Every BI developer has felt it. You change a measure, update a relationship, or rename a column in a semantic model, and then you spend the next hour clicking through report pages to check if something broke. Manual spot-checking is how most teams validate DAX today. It works until it does not.
I have been building and maintaining semantic models for years. The further I get into Fabric-based development, the more my models start to feel like production code. They power dashboards that drive decisions. They feed downstream pipelines. When something breaks, the blast radius is real. And yet, the testing story has always been: deploy, open the report, squint at the numbers.
That gap bothered me enough to do something about it.
The challenge
Microsoft recently launched the Fabric Semantic Link Developer Experience Challenge, a community contest focused on building reusable tools that improve how teams develop, test, document, and maintain semantic models in Microsoft Fabric. The requirement: use Semantic Link as a core component and solve a real developer pain point.
I have been eyeing Semantic Link Labs for a while. The library exposes evaluate_dax_impersonation, which lets you execute arbitrary DAX queries against a Fabric semantic model from a notebook. That single function is what makes programmatic testing possible.
The idea for my submission: a test harness that brings unit testing and regression detection to DAX measures. Define your test cases. Run them. Get a pass/fail report. No browser required.
What I built
The Semantic Model Test Harness is a single Fabric notebook. No external services, no complex infrastructure. You define test cases as rows in a pandas DataFrame, each row specifying three things:
- The DAX measure to evaluate
- The filter context to apply (a DAX boolean expression simulating a slicer or page filter)
- The expected value
Here is what a test case definition looks like:
dax_tests = pd.DataFrame([
{
"measure": "# Reports",
"filter_context": "'Catalog - Report'[Report Workspace] = 'Arla DK'",
"expected_value": 61
},
{
"measure": "# Workspace Users",
"filter_context": "'Catalog - Workspace'[Workspace] = 'CatMan Next DK - Demo'",
"expected_value": 15
},
])
Each test case gets transformed into a EVALUATE ROW(...) DAX query that wraps the measure in a CALCULATE with the specified filter. The harness sends that query to the semantic model via sempy_labs.evaluate_dax_impersonation(), compares the result to the expected value, and records pass or fail.
DAX query generation
One thing I had to sort out early: DAX has opinions about quoting. Single quotes wrap table names, double quotes wrap string values. Filter expressions like 'Catalog - Report'[Report Workspace] = 'Arla DK' need the 'Arla DK' portion converted to "Arla DK" before execution. A small regex helper handles that conversion automatically.
The harness also distinguishes between boolean filter expressions (like 'Table'[Column] = "Value") and table function expressions (like FILTER(...) or VALUES(...)). Both are valid in a CALCULATE, but the detection matters for correct query construction. A simple heuristic checks for DAX table function prefixes and falls back to boolean if none are found.
The generated DAX query for a test case looks like this:
EVALUATE ROW("Value", CALCULATE([# Reports], 'Catalog - Report'[Report Workspace] = "Arla DK"))
Running the tests
Execution is straightforward. The harness loops through every row in the test DataFrame, builds the DAX query, sends it to the model, and collects results. Each test produces a row in the results DataFrame showing the measure, filter context, expected value, actual value, pass/fail status, and the exact DAX query used.
If a test fails because of a connectivity error, invalid DAX, or anything else unexpected, the exception is caught and logged as a failure with the error message preserved. No silent swallowing of errors.
The results summary counts passes and failures. Failed tests get highlighted separately so they stand out:
Total tests: 2
Passed: 2
Failed: 0
โ
All tests passed!
When something does fail, you get the actual value alongside the expected value, plus the DAX query that was sent. That gives you everything needed to diagnose whether the issue is in the model, the test definition, or the filter context.
The model I tested against
I ran this against my CatMan BI Tenant Stats semantic model, a model I maintain for Power BI tenant administration and monitoring. It tracks workspaces, reports, datasets, users, activity, and permissions across our organization’s Power BI tenant. The model has 17 tables covering catalog metadata, user licensing, activity logs, and calendar dimensions.
This is a model that changes regularly as new workspaces spin up, users rotate, and reporting patterns shift. Exactly the kind of model where silent measure breakage is a real risk.
What I learned
Test case design is the hard part. Writing the harness code was relatively quick. Deciding which measures to test, with which filter contexts, and what counts as a correct expected value requires genuine domain knowledge. This is not something you can auto-generate meaningfully. You need a human who knows the business logic.
Filter context quoting will trip you up. DAX’s quoting rules are well-documented, but in practice, switching between single and double quotes across table names and string values is a reliable source of errors when constructing queries programmatically. The regex helper saved me repeated debugging sessions.
evaluate_dax_impersonation is the unlock. Without this function from Semantic Link Labs, you would need to stand up a XMLA endpoint connection, handle authentication separately, and manage the query lifecycle yourself. Semantic Link wraps all of that. The function takes a dataset name, a workspace name, and a DAX query string, then returns a DataFrame. That simplicity is what makes a notebook-based test harness practical.
Regression testing needs baselines. The current harness compares against hardcoded expected values. For a production CI/CD integration, you would want a baseline snapshot mechanism: run tests, store results, then compare future runs against the stored baseline rather than manually maintained numbers. I have not built that yet, but the architecture supports it.
Where this goes next
The notebook is designed to be dropped into a Fabric workspace and run on demand or triggered as part of a deployment pipeline. Fabric notebooks can be orchestrated through pipelines, so running this harness as a post-deployment validation step is a natural fit.
I can also see extending the test case format to include tolerance thresholds for measures that fluctuate (like row counts on live data) rather than requiring exact matches. And grouping tests by business domain or model area would help when you want to run a targeted suite after changing a specific part of the model.
For now, it works. I define my tests, I run the notebook, and I get a clear answer: did something break, or is the model still behaving as expected? That is a better answer than opening six report pages and eyeballing numbers.
The notebook is submitted to the Fabric Notebook Gallery as part of the Semantic Link Developer Experience Challenge. If you are maintaining semantic models in Fabric and have felt that same testing gap, give it a try. Let me know in the comments if you find it useful, or if you run into edge cases I have not covered.






