Case Studies

Case Study · Managed Security Services Provider

Multi-Tenant Azure Log Monitoring at a Managed Security Services Provider

Industry
Managed Security — Multi-tenant SOC
Workload
Scheduled detection reports and real-time monitors across an enterprise client portfolio
Data sources
Azure Activity logs · Azure Sign-in logs · Azure Resource inventory · per-client identity context
Stack before / after
SIEM with per-tenant KQL rule sets, scheduler, Python glue, separate alert router → Infino

The Problem

A managed security services provider runs the SOC for a portfolio of enterprise clients — each with their own Azure tenant, their own resource taxonomy, their own privileged-user conventions. Every morning the SOC sends each client a report summarizing unusual activity in the last 24 hours. Every 5 minutes, real-time monitors sweep the same data looking for bursts of failed sign-ins, privilege escalations, and anomalous resource changes.

The expensive part of this workload is not writing a detection. Writing one detection is a morning's work. The expensive part is keeping one copy of that detection aligned across every client, as the clients' environments evolve independently. A new detection idea should take a day to roll out to the full portfolio, not a sprint.

The default SIEM architecture does not give you that. Every detection is a per-tenant KQL query, authored and maintained in a separate workspace, scheduled by a separate orchestration layer, alerted through a separate router. When a client renames a resource group or adds a custom role, their copy of the rule set drifts. When a new detection is added, it's a pull request against each tenant's workspace. The rule sets never stay in sync. The margin on the managed service evaporates into drift management.

The Three Hard Parts

Tenant isolation has to be enforced at the engine, not the application. A SOC query that correctly filters by tenant_id in application code is one deployment mistake away from cross-tenant disclosure. For a managed service, that is the business-ending bug. The right place to enforce "this query can only see rows from tenant X" is at the row-source level, before the query executes, applied uniformly to SQL, search, and PromQL.

Detections have to generalize across tenants. A detection written against a specific resource-naming convention works for that client and fails silently on the next. Detections should be authored once as parameterized queries and bound, at execution time, to the specific row-source of whatever tenant they run against.

Alerts have to be structured events, not LLM prose. Monitors that emit free-text descriptions are unusable by the downstream response workflow. A burst-of-failed-signins alert needs to emit a typed record — tenant, principal, country, window, count — that the incident pipeline can route and correlate. The wire format for alerts is structured.

Dataset Layout

Azure Activity, Sign-in, and Resource data flow from Event Hub into a unified backend. Every fact is keyed on tenant_id, and tenant_id is declared as a row-level security attribute at dataset-creation time:

azure_activity   (tenant_id, ts, caller, op, resource_id, result, properties)
azure_signin     (tenant_id, ts, user_principal, ip, country, app_id, status, risk)
azure_resources  (tenant_id, resource_id, type, location, tags, last_modified)

tenants          (tenant_id, client_name, region, plan)
soc_principals   (principal_id, scoped_to_tenant, role)

Scheduled jobs and monitors run under a SOC service principal that is scoped to exactly one tenant per execution. A query issued under that principal cannot resolve rows from any other tenant — the engine rewrites the row-source with WHERE tenant_id = $principal.tenant before the plan runs. This is a property of the principal and the dataset declaration, not of the query text. A detection that forgets to add a tenant filter is still correctly scoped; a detection that tries to bypass the filter still can't.

Field-level security is the same story for competitor-sensitive columns: plan tier, contract terms, contact metadata are declared at the field level and are invisible to the SOC role that runs detections.

Scheduled Detection Reports

A morning brief is authored once as a stored query and fans out per tenant. The detection is a typed query — reviewed once, then scheduled. Every tenant runs the same logic against its own row-source.

sdk.create_scheduled_query({
    "name": "morning_brief.privileged_role_assignments",
    "schedule": "0 7 * * *",
    "fan_out": "tenant_id",
    "principal": "soc_agent",
    "sql": """
        WITH prior_callers AS (
            SELECT DISTINCT caller
            FROM   azure_activity
            WHERE  op LIKE 'Microsoft.Authorization/roleAssignments/%'
              AND  @timestamp BETWEEN NOW() - INTERVAL '30 days'
                                  AND NOW() - INTERVAL '24 hours'
        ),
        new_today AS (
            SELECT a.caller,
                   a.resource_id,
                   a.@timestamp,
                   a.properties ->> 'roleDefinitionId' AS role_id,
                   CASE WHEN p.caller IS NULL THEN true ELSE false END
                       AS new_caller_30d
            FROM   azure_activity a
            LEFT   JOIN prior_callers p USING (caller)
            WHERE  a.op LIKE 'Microsoft.Authorization/roleAssignments/%'
              AND  a.@timestamp > NOW() - INTERVAL '24 hours'
              AND  a.result = 'Success'
        )
        SELECT * FROM new_today
        ORDER BY new_caller_30d DESC, @timestamp DESC;
    """,
    "sink_index": "morning_brief.{tenant_id}"
})

One definition, reviewed once. Thirty executions, one per tenant, each bound to its own row-source. The detection logic is the same everywhere; only the row-source filter changes. No LLM in the hot path, no per-run translation cost, no drift.

Real-Time Monitors

Monitors follow the same pattern — a stored query on the read side, a typed event contract on the emit side:

sdk.create_monitor({
    "name": "burst_failed_signins_new_country",
    "fan_out": "tenant_id",
    "every": "5m",
    "sql": """
        WITH recent AS (
            SELECT tenant_id, user_principal, country, @timestamp
            FROM   azure_signin
            WHERE  status = 'failure'
              AND  @timestamp > NOW() - INTERVAL '15 minutes'
        ),
        prior_countries AS (
            SELECT DISTINCT user_principal, country
            FROM   azure_signin
            WHERE  @timestamp BETWEEN NOW() - INTERVAL '30 days'
                                  AND NOW() - INTERVAL '15 minutes'
        )
        SELECT r.tenant_id, r.user_principal, r.country,
               MIN(r.@timestamp) AS window_start,
               MAX(r.@timestamp) AS window_end,
               COUNT(*) AS count
        FROM   recent r
        LEFT   JOIN prior_countries p USING (user_principal, country)
        WHERE  p.country IS NULL
        GROUP  BY r.tenant_id, r.user_principal, r.country
        HAVING COUNT(*) > 10;
    """,
    "emit": {
        "type": "alert.signin_burst",
        "fields": ["tenant_id", "user_principal", "country",
                   "window_start", "window_end", "count"]
    }
})

The monitor emits a typed event into the incident response workflow. The downstream router does not need to parse prose — it consumes a record with a known schema and routes on it. The wire contract for the alert is structured, versioned, and stable.

Hybrid Queries Against the Same Backend

The questions that matter mix structured filters with semantic search over unstructured context — error messages, resource descriptions, activity properties. One example: find resource changes in the last 24h where the operation description semantically matches "publicly exposing storage or network," filtered to production-tagged resources.

{
  "index": "azure_activity",
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-24h" } } },
        { "term":  { "resource.tags.environment": "production" } }
      ],
      "must": [
        {
          "semantic": {
            "field": "op_description",
            "query": "public exposure of storage or network",
            "k": 50
          }
        }
      ]
    }
  },
  "fields": [
    "tenant_id", "resource_id", "resource.type", "resource.tags",
    "op", "properties", "@timestamp"
  ],
  "sort": [{ "@timestamp": "desc" }]
}

This is the query DSL form — a JSON document the agent emits directly, no SQL parsing in the hot path. Structured filters, semantic ranking, and field projection compose in one request against one row-source, correctly scoped to the calling principal's tenant. The same engine still answers the relational CTEs above in SQL when joins are required; the DSL is the right surface when the question fits a single index.

Audit as a Queryable Dataset

Every scheduled-query execution, every monitor tick, every on-demand query from a SOC engineer is appended to an audit dataset — who ran it, which principal, which tenant, which datasets and fields were read, which rows were filtered out by row-level security, and when. The audit dataset is itself a queryable dataset.

For the managed service, this changes a recurring conversation. When a client asks "what did you check for us this morning?" the answer is a query, not a slide. When a client's compliance team asks "show me every query that touched our sign-in logs in March," the answer is a query. When the internal SOC lead wants to confirm that no cross-tenant read ever occurred, the answer is a query. The audit log is the primary artifact, not an afterthought bolted on by instrumentation.

What Changed

Before: thirty parallel rule sets, one per tenant, in a SIEM query language. Every new detection a thirty-way pull request. Every client environment change a silent drift. Tenant isolation enforced in application code. Alerts flowing out as free-text that the incident pipeline had to parse. Audit assembled by hand from scheduler logs, SIEM logs, and router logs when asked.

After: one backend, one governance layer, one audit surface. A detection is authored once as a stored query, fanned out per tenant, and bound to a row-source that cannot be bypassed. Monitors emit structured events. Compliance answers are queries. The managed-service margin is preserved because the SOC engineers spend their time on detection quality, not on keeping thirty copies of the same rule in sync.

The SIEM is still in the picture — as a data source, not as the system of record.

Questions? Get in touch.