Validating DAX Against Your Lakehouse with Semantic Link

A semantic model is a promise. It promises that the numbers in your reports match the data in your lakehouse. But after enough model changes, renamed columns, new relationships, and tweaked measures, that promise gets harder to verify. I wanted a way to check it programmatically.

This is my second submission to the Fabric Semantic Link Developer Experience Challenge. The first was a DAX unit test harness that compares measures against hardcoded expected values. That works well for known business rules, but it has a limitation: someone has to decide and maintain what the “right” answer is. For a model with hundreds of measures across dozens of filter contexts, that does not scale.

So I built something different. Instead of hardcoding expected values, I use the Lakehouse as the ground truth.

The idea

If your semantic model sits on top of a Fabric Lakehouse, then both the DAX layer and the SQL layer should agree on the same numbers. A COUNTROWS('fact_ticket_metrics') in DAX should return the same count as SELECT COUNT(*) FROM Gold.fact_ticket_metrics in Spark SQL. If they diverge, something changed in the model that needs attention.

The notebook takes pairs of queries: one DAX, one SQL. It executes both, normalizes the results into comparable DataFrames, and reports pass or fail. The Lakehouse is the single source of truth.

How it works

Each test case is a dictionary with a description, a DAX query, and a SQL query. Optionally, you can specify sort columns, a floating-point tolerance, or a column mapping if the names differ between the two result sets.

test_cases = [
    {
        "description": "Row count tickets",
        "dax_query": """
            EVALUATE
            ROW("RowCount", COUNTROWS('fact_ticket_metrics'))
        """,
        "sql_query": """
            SELECT COUNT(*) AS RowCount
            FROM Gold.fact_ticket_metrics
        """,
    },
    {
        "description": "Total # Incidents",
        "dax_query": """
            EVALUATE
            ROW("fact_ticket_metrics", [Incidents])
        """,
        "sql_query": """
            SELECT COUNT(id) AS Incidents
            FROM Gold.fact_ticket_metrics
        """,
        "tolerance": 0.01,
    },
]

The harness loops through each test case, executes the DAX query via sempy_labs.evaluate_dax_impersonation, executes the SQL query via Spark, and then compares the two result DataFrames.

Column normalization

DAX returns column names in the format 'Table'[Column]. SQL returns plain column names. If you want to compare them, those names need to match.

A small regex function strips DAX table qualifiers: 'Sales'[Amount] becomes Amount, and [Amount] also becomes Amount. After normalization, the harness aligns both DataFrames on their common columns, sorted alphabetically. Any extra columns on either side get flagged as warnings but do not block the comparison.

If the normalized names still do not match (say the DAX column is RowCount but the SQL column is row_count), you can pass a column_mapping dictionary to handle the translation explicitly.

Floating-point tolerance

Not every number comparison should demand exact equality. Aggregations in DAX and Spark can produce slightly different floating-point results depending on processing order and precision. The harness uses numpy.isclose with a configurable relative tolerance (default: 0.0001) for numeric columns. String columns are compared as exact matches.

When a numeric mismatch exceeds the tolerance, the harness reports the specific column, row, DAX value, and SQL value. It caps the output at five mismatches per column to keep the report readable when something is broadly wrong.

The comparison engine

The comparison works at the DataFrame level, not just scalar values. This matters because many useful validation queries return multiple rows: a count per category, a sum per month, a distinct count per workspace. Scalar-only testing misses structural issues like missing rows or extra groupings.

The engine does three things in sequence:

  1. Aligns both DataFrames on their common columns and sorts them consistently
  2. Compares numeric columns with tolerance, string columns with exact match
  3. Collects mismatches into a diff DataFrame for inspection

A shape mismatch (different number of rows or columns) is an immediate failure. You get the exact dimensions from both sides so you know whether the issue is missing data or a query that groups differently.

What I tested against

I ran this against a Zendesk reporting model that sits on a Gold-layer lakehouse. The model has ticket metrics, incident counts, and support analytics. The test cases validated that the semantic model’s row counts and measure aggregations matched the underlying SQL tables.

This is the kind of model where schema drift is common. New ticket categories get added, fields get renamed upstream, and the Gold layer evolves. Having an automated check that the DAX layer still reflects reality saves the awkward moment when someone asks why the dashboard numbers do not match the data export.

How this differs from unit testing

My other submission, the Semantic Model Test Harness, validates DAX against hardcoded expected values. That is unit testing: does this specific measure, with this specific filter, return this specific number?

This notebook is closer to integration testing. It validates that the semantic model agrees with its source data. The two approaches complement each other:

  • Unit tests catch business logic errors (a measure formula was changed incorrectly)
  • Lakehouse comparison tests catch data layer drift (the model no longer reflects what is in the tables)

Running both gives you confidence from two different angles.

Where this goes next

The test case format supports multi-row comparisons, so extending this to validate entire dimension tables (not just aggregated measures) is straightforward. I can also see connecting this to Fabric pipeline orchestration, running the comparison notebook as a post-refresh step to detect drift immediately after data lands.

Another natural extension: generating test cases automatically by introspecting the semantic model’s measures and matching them to lakehouse tables. Semantic Link Labs has functions for listing model metadata that could feed into a test case generator. I have not built that yet, but the structure is there.

The notebook is submitted to the Fabric Notebook Gallery as part of the Semantic Link Developer Experience Challenge. If you are running semantic models on top of a Lakehouse and have ever wondered whether they still agree, this might save you some manual checking.

Unit Testing DAX with Semantic Link

Every BI developer has felt it. You change a measure, update a relationship, or rename a column in a semantic model, and then you spend the next hour clicking through report pages to check if something broke. Manual spot-checking is how most teams validate DAX today. It works until it does not.

I have been building and maintaining semantic models for years. The further I get into Fabric-based development, the more my models start to feel like production code. They power dashboards that drive decisions. They feed downstream pipelines. When something breaks, the blast radius is real. And yet, the testing story has always been: deploy, open the report, squint at the numbers.

That gap bothered me enough to do something about it.

The challenge

Microsoft recently launched the Fabric Semantic Link Developer Experience Challenge, a community contest focused on building reusable tools that improve how teams develop, test, document, and maintain semantic models in Microsoft Fabric. The requirement: use Semantic Link as a core component and solve a real developer pain point.

I have been eyeing Semantic Link Labs for a while. The library exposes evaluate_dax_impersonation, which lets you execute arbitrary DAX queries against a Fabric semantic model from a notebook. That single function is what makes programmatic testing possible.

The idea for my submission: a test harness that brings unit testing and regression detection to DAX measures. Define your test cases. Run them. Get a pass/fail report. No browser required.

What I built

The Semantic Model Test Harness is a single Fabric notebook. No external services, no complex infrastructure. You define test cases as rows in a pandas DataFrame, each row specifying three things:

  • The DAX measure to evaluate
  • The filter context to apply (a DAX boolean expression simulating a slicer or page filter)
  • The expected value

Here is what a test case definition looks like:

dax_tests = pd.DataFrame([
    {
        "measure": "# Reports",
        "filter_context": "'Catalog - Report'[Report Workspace] = 'Arla DK'",
        "expected_value": 61
    },
    {
        "measure": "# Workspace Users",
        "filter_context": "'Catalog - Workspace'[Workspace] = 'CatMan Next DK - Demo'",
        "expected_value": 15
    },
])

Each test case gets transformed into a EVALUATE ROW(...) DAX query that wraps the measure in a CALCULATE with the specified filter. The harness sends that query to the semantic model via sempy_labs.evaluate_dax_impersonation(), compares the result to the expected value, and records pass or fail.

DAX query generation

One thing I had to sort out early: DAX has opinions about quoting. Single quotes wrap table names, double quotes wrap string values. Filter expressions like 'Catalog - Report'[Report Workspace] = 'Arla DK' need the 'Arla DK' portion converted to "Arla DK" before execution. A small regex helper handles that conversion automatically.

The harness also distinguishes between boolean filter expressions (like 'Table'[Column] = "Value") and table function expressions (like FILTER(...) or VALUES(...)). Both are valid in a CALCULATE, but the detection matters for correct query construction. A simple heuristic checks for DAX table function prefixes and falls back to boolean if none are found.

The generated DAX query for a test case looks like this:

EVALUATE ROW("Value", CALCULATE([# Reports], 'Catalog - Report'[Report Workspace] = "Arla DK"))

Running the tests

Execution is straightforward. The harness loops through every row in the test DataFrame, builds the DAX query, sends it to the model, and collects results. Each test produces a row in the results DataFrame showing the measure, filter context, expected value, actual value, pass/fail status, and the exact DAX query used.

If a test fails because of a connectivity error, invalid DAX, or anything else unexpected, the exception is caught and logged as a failure with the error message preserved. No silent swallowing of errors.

The results summary counts passes and failures. Failed tests get highlighted separately so they stand out:

Total tests: 2
Passed: 2
Failed: 0
✅ All tests passed!

When something does fail, you get the actual value alongside the expected value, plus the DAX query that was sent. That gives you everything needed to diagnose whether the issue is in the model, the test definition, or the filter context.

The model I tested against

I ran this against my CatMan BI Tenant Stats semantic model, a model I maintain for Power BI tenant administration and monitoring. It tracks workspaces, reports, datasets, users, activity, and permissions across our organization’s Power BI tenant. The model has 17 tables covering catalog metadata, user licensing, activity logs, and calendar dimensions.

This is a model that changes regularly as new workspaces spin up, users rotate, and reporting patterns shift. Exactly the kind of model where silent measure breakage is a real risk.

What I learned

Test case design is the hard part. Writing the harness code was relatively quick. Deciding which measures to test, with which filter contexts, and what counts as a correct expected value requires genuine domain knowledge. This is not something you can auto-generate meaningfully. You need a human who knows the business logic.

Filter context quoting will trip you up. DAX’s quoting rules are well-documented, but in practice, switching between single and double quotes across table names and string values is a reliable source of errors when constructing queries programmatically. The regex helper saved me repeated debugging sessions.

evaluate_dax_impersonation is the unlock. Without this function from Semantic Link Labs, you would need to stand up a XMLA endpoint connection, handle authentication separately, and manage the query lifecycle yourself. Semantic Link wraps all of that. The function takes a dataset name, a workspace name, and a DAX query string, then returns a DataFrame. That simplicity is what makes a notebook-based test harness practical.

Regression testing needs baselines. The current harness compares against hardcoded expected values. For a production CI/CD integration, you would want a baseline snapshot mechanism: run tests, store results, then compare future runs against the stored baseline rather than manually maintained numbers. I have not built that yet, but the architecture supports it.

Where this goes next

The notebook is designed to be dropped into a Fabric workspace and run on demand or triggered as part of a deployment pipeline. Fabric notebooks can be orchestrated through pipelines, so running this harness as a post-deployment validation step is a natural fit.

I can also see extending the test case format to include tolerance thresholds for measures that fluctuate (like row counts on live data) rather than requiring exact matches. And grouping tests by business domain or model area would help when you want to run a targeted suite after changing a specific part of the model.

For now, it works. I define my tests, I run the notebook, and I get a clear answer: did something break, or is the model still behaving as expected? That is a better answer than opening six report pages and eyeballing numbers.

The notebook is submitted to the Fabric Notebook Gallery as part of the Semantic Link Developer Experience Challenge. If you are maintaining semantic models in Fabric and have felt that same testing gap, give it a try. Let me know in the comments if you find it useful, or if you run into edge cases I have not covered.