Case Study · Semiconductor / EDA

EDA License Optimization Across FlexLM, LSF, and Pricing

Industry: Semiconductor / EDA

Company: EDA Services Provider

Data sources: FlexLM daemon logs · LSF scheduler accounting · license pricing database

Result: Per-feature classification — undersized, oversized, or right-sized — grounded in real concurrency, unique-user impact, and blocked engineering cost. Every recommendation traces to a SQL query and an audit record.

The Problem With How This Is Usually Done

Most EDA shops measure license utilization the same way: pull DENIED counts from FlexLM, divide checkouts by seats, flag anything over 80%. The number goes into a spreadsheet. The spreadsheet goes to the CAD manager. The CAD manager asks for more licenses. Procurement approves it.

The problem is that raw denial counts and average utilization are the wrong metrics, and optimizing for them produces the wrong decisions in both directions.

Denial counts overstate impact. A single user polling aggressively for a blocked feature emits hundreds of DENIED events per hour. A feature that looks catastrophically constrained from raw counts may be affecting one engineer running a script in a loop. The signal you need is unique users actually blocked, correlated to what they were trying to accomplish.

Utilization averages hide the actual constraint. A feature at 95% average daily utilization sounds like it needs more seats. But if that 95% happens entirely during a two-hour tape-out window and the feature sits at 12% the rest of the week, you don't need more seats — you need a scheduler policy. The p95 concurrency per hour-of-week is a completely different number than the daily average, and it's the one that maps to actual demand shape.

Checkouts are bundles, not units. A single OUT event for a synthesis tool may pull 4–12 sub-feature tokens. The concurrent demand is the envelope of the bundle across all active checkouts, not the count of top-level tool launches. Getting this wrong causes systematic over-purchasing of the wrong features.

Not all denials have the same cost. A DENIED event that interrupts a 14-hour formal verification run is not the same as one that hit a 30-second interactive query. If you can't correlate denials to the jobs they blocked — and weight by the cost of that block — you can't build a defensible recommendation.

The Data

FlexLM daemon logs are structured text. Each line is an event: OUT (checkout), IN (checkin), DENIED, UNSUPPORTED, or QUEUED, tagged with user, host, feature, version, and a handle that ties OUT/IN pairs together. The problem is that the raw log has no concept of duration — that's implicit in the OUT/IN pairing — and no concept of the job context that produced the checkout.

LSF accounting records are separate. Every batch job has a submission time, start time, end time, and the user and host it ran on. The overlap between a DENIED event's timestamp and an LSF job's start/end window is what links license pressure to actual engineering work.

flexlm_events   (ts, event, user, host, feature, version, count, handle)
                  -- event ∈ {OUT, IN, DENIED, UNSUPPORTED, QUEUED}

lsf_jobs        (job_id, user, queue, project, submit_ts, start_ts, end_ts,
                  cpu_s, mem_max_mb, exit_status)

license_pricing (feature, vendor, unit_price_year, token_weight)

features        (feature, tool, vendor, suite)

FlexLM logs land through a tail ingest that parses the textual event stream into typed records at ingest time. LSF accounting records come in via batch ingest from the scheduler's log directory. Both feed the same query layer — there's no ETL pipeline syncing them to a separate warehouse, no separate time-series store for the event data.

Most Questions Don't Need SQL

The agents start in Query DSL. Most CAD-manager questions hit a single index and compose as structured filters plus aggregations — no joins, no SQL parser in the hot path. "Which features denied the most users in the last 24 hours?" is a single QueryDSL request against flexlm_events:

{
  "index": "flexlm_events",
  "query": {
    "bool": {
      "filter": [
        { "term":  { "event": "DENIED" } },
        { "range": { "@timestamp": { "gte": "now-24h" } } }
      ]
    }
  },
  "aggs": {
    "by_feature": {
      "terms": { "field": "feature", "size": 50 },
      "aggs": {
        "users_blocked":  { "cardinality": { "field": "user" } },
        "denial_events":  { "value_count": { "field": "event" } }
      }
    }
  },
  "size": 0
}

That's the morning brief. It's also the shape of every "top N offenders by feature, by user, by host, by project" question — a terms aggregation with a couple of cardinality sub-aggregations, scoped by a time range. The agent emits the JSON directly.

When DSL Stops, SQL Starts

Concurrent usage is reconstructed from paired OUT/IN events as a step function over time. Each OUT event starts an interval; the next IN event on the same handle ends it. Denials land at their actual timestamps. The 95th percentile of concurrency per (feature, hour-of-week) is the real headroom signal — not the mean, not the daily peak, and not the raw denial count.

WITH checkouts AS (
  SELECT feature, user, ts AS t_out,
         LEAD(ts) OVER (PARTITION BY feature, handle ORDER BY ts) AS t_in
  FROM   flexlm_events
  WHERE  event = 'OUT'
),
concurrency AS (
  SELECT feature,
         ts_bucket('5m', t_out) AS bucket,
         COUNT(*)               AS concurrent
  FROM   checkouts
  GROUP  BY feature, bucket
),
denials AS (
  SELECT feature,
         COUNT(DISTINCT user) AS users_blocked,
         COUNT(*)             AS denial_events
  FROM   flexlm_events
  WHERE  event = 'DENIED'
  GROUP  BY feature
)
SELECT f.feature,
       PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY c.concurrent) AS p95_concurrent,
       d.users_blocked,
       d.denial_events
FROM   features f
LEFT   JOIN concurrency c USING (feature)
LEFT   JOIN denials     d USING (feature)
GROUP  BY f.feature, d.users_blocked, d.denial_events;

DSL stops being the right surface here for two reasons: LEAD() is a window function (the OUT/IN pairing needs row-to-row context), and the final projection joins three derived CTEs against the features catalog. Both are out of scope for a single-index DSL request. ts_bucket('5m', ...) is a native time-series bucketing function — no application-side resampling, no separate TSDB. The concurrency and denial datasets are stored together and queried in one plan.

The distinction between denial_events and users_blocked here is the entire point. A feature with 40,000 denial events and 3 unique blocked users is a different problem than one with 400 denial events and 40 unique blocked users. The raw event count told you nothing useful.

Joining License Pressure to Actual Engineering Work

Once you know which features are generating denials and when, the next question is what those blocked users were actually trying to do. A denial that interrupts a job that completed anyway — because it retried and got a seat — has a different cost than one that killed a queued signoff job that then needed to be manually rescheduled.

Each DENIED event is correlated to the LSF job active for that (user, host) at that timestamp:

SELECT d.feature,
       j.queue,
       j.project,
       COUNT(*)        AS blocked_jobs,
       SUM(j.cpu_s)    AS blocked_cpu_seconds
FROM   flexlm_events d
JOIN   lsf_jobs j
  ON   d.user = j.user
  AND  d.ts BETWEEN j.submit_ts
               AND  COALESCE(j.start_ts, j.submit_ts + INTERVAL '1 hour')
WHERE  d.event = 'DENIED'
GROUP  BY d.feature, j.queue, j.project;

This is a temporal join across two datasets that arrive from completely separate systems — the FlexLM daemon and the LSF scheduler — with no foreign key between them. The join predicate is ts BETWEEN submit_ts AND start_ts, which requires a range scan over the LSF events indexed by timestamp. Running this as a cross-system join in application code means either pulling both datasets into memory or making sequential queries and merging. Here it's one plan.

The blocked_cpu_seconds aggregate is what feeds the cost weighting. A denial that blocked 40,000 CPU-seconds of batch work in the signoff queue is prioritized differently in the recommendation than one that hit interactive scratch work.

Classification and Cost

The final step joins p95 concurrency, unique-user denial impact, blocked CPU-hours, and per-seat pricing into a per-feature recommendation:

Undersized: p95 concurrency exceeds licensed seats during normal working hours, unique blocked users > threshold, blocked CPU cost > threshold
Oversized: p95 concurrency consistently below licensed seats with no denial pressure — dollar delta between current spend and right-sized spend
Right-sized: concurrency within headroom, denial impact low or zero, no reallocation needed

Each feature gets the classification, the dollar delta, and the SQL plan that produced it — not a black-box score. CAD managers can ask follow-up questions in natural language ("which teams are driving the undersized signal on Calibre DRC?"), get a new query, inspect it, and the exchange is recorded in the audit log. Every recommendation is reproducible: the same query against the same data returns the same result.

What Makes This Hard Without the Right Data Layer

The analysis sounds straightforward once it's written down. The engineering complexity is in the data layer:

Event reconstruction across handle pairs. The OUT/IN step function requires a window function over ordered events partitioned by handle. This is a SQL LEAD over a time-ordered stream — doable in any database, but it needs the full event history in one queryable place, not archived in gzipped log files.

Temporal joins across independent systems. The LSF/FlexLM correlation requires range-predicate joins across data that arrives from separate sources with no shared keys. Building this in application code means pulling both sides into memory or writing a custom merge. In the database, it's a join predicate.

Time-series aggregation at the right granularity. ts_bucket('5m', ...) is a native function here. In a general-purpose SQL database you'd write this as date_trunc plus arithmetic, which is verbose and doesn't compose cleanly with window functions over the bucketed result.

Governance on who sees what. License pricing data — per-seat costs, contract terms — is not something every engineer should have access to. The same query layer enforces field-level access so the CAD manager running the full cost analysis sees the dollar figures, while an engineer asking "why is my job getting denied" sees the utilization data but not the pricing. No second pipeline, no second access model.

A Team of Agents Running the Analysis

The license-optimization workload isn't one agent producing one number. It's a team of specialist agents sharing one backend, one identity model, and one audit log:

Ingest agent — tails FlexLM daemon logs and pulls LSF accounting records, converting both into typed records at write time. Watches for new feature codes and backfills the features and license_pricing dimensions as they appear.
Concurrency agent — reconstructs the OUT/IN step function, computes p95 concurrency per (feature, hour-of-week), and emits the headroom signal.
Impact agent — joins DENIED events to LSF jobs, aggregates blocked CPU-hours per queue and project, and separates "annoyed interactive user" from "killed signoff job."
Cost agent — joins concurrency and impact metrics to per-seat pricing. Sees pricing fields the other agents can't. Produces the dollar delta for each feature.
Recommender agent — combines the three signals into the undersized / oversized / right-sized classification, attaches the SQL plan, and writes the recommendation to the shared decision store.
Customer agent — the one embedded in the customer's console. Answers natural-language questions from a specific CAD manager, scoped to their organization's rows only.

Coordination happens through shared state in the backend — not through orchestration plumbing. The impact agent reads what the concurrency agent wrote. The recommender agent reads what both produced. When a CAD manager asks the customer agent "why is Calibre DRC flagged as undersized," it pulls the recommendation record, the underlying concurrency and impact metrics, and the source FlexLM and LSF events that produced them. The audit log records every read. Every recommendation is fully traceable back to the agent that made it and the data it was grounded in.

Governance Declared Once, Not Stitched Across Tools

The thing that makes an agent team practical here is that governance is declared once on the backend and applied to every query any agent issues. There is no separate auth model for FlexLM logs, for LSF, for pricing, for the customer UI:

sdk.create_role("cad_manager", permissions={
    "datasets": ["flexlm_events", "lsf_jobs", "license_pricing", "features"],
    "field_allow": {
        "license_pricing": ["feature", "unit_price_year", "token_weight"]
    }
})

sdk.create_role("engineer", permissions={
    "datasets": ["flexlm_events", "lsf_jobs", "features"],
    "field_deny":  { "license_pricing": ["unit_price_year"] },
    "row_filter":  { "lsf_jobs": "user = $user.username" }
})

sdk.create_role("customer_org", permissions={
    "datasets":    ["flexlm_events", "lsf_jobs", "features", "license_pricing"],
    "row_filter":  { "flexlm_events": "org_id = $user.org_id",
                     "lsf_jobs":      "org_id = $user.org_id" }
})

Three distinct layers of access, enforced server-side on every query the corresponding agent issues:

RBAC — which datasets each role can touch at all.
Field-level security — pricing columns visible to the cost and recommender agents, masked from the concurrency, impact, and engineer-facing agents even though they all query the same backend.
Row-level (document-level) security — the customer agent running in the company's UI can only read rows from its own organization; an engineer's self-service queries only surface their own jobs. The row filter is applied inside the engine before the query executes.

The alternative is the usual pattern: column privileges in the warehouse, filtering middleware in the app, a separate auth model in the log store, a fourth one for the pricing system, and a best-effort audit pipeline bolted on. Every new agent is a new surface where a scope can leak. Declaring the rules once at the data layer — where every agent's query has to pass through — removes that category of work. The team adds a new specialist agent by declaring its role, not by wiring up security in five places.

Embedded in the Product

The customer-facing console and the internal analyst surface run against the same backend through the same identity and governance primitives. A customer sees their own organization's data; an internal analyst sees the full portfolio; both go through the same row- and field-level rules applied to the same datasets. There's no read replica, no data copy for the customer-facing tier, and no per-surface security model to keep in sync.

What Changed

Before: FlexLM reports from vendor tooling, periodic CSV exports, spreadsheet joins to pricing, manual analysis. The denial counts were the primary signal. Recommendations took days to produce and weren't reproducible — run the analysis again a week later and you'd get a different number because the raw data had aged out of the log files.

After: Continuous ingest from both FlexLM and LSF. A team of specialist agents — ingest, concurrency, impact, cost, recommender, customer — shares one backend, one identity model, and one audit log. A CAD manager asks a question, gets a SQL plan they can inspect, and the recommendation traces back through the agent that made it to the specific events and jobs that grounded it. Governance is declared once; every agent's query inherits it.

The license portfolio that used to take a week of analyst time to review runs in seconds. The recommendations are defensible because they're grounded in the actual concurrency distribution, and reproducible because every agent's plan and output is captured as queryable data.