Unit Testing DAX with Semantic Link

Every BI developer has felt it. You change a measure, update a relationship, or rename a column in a semantic model, and then you spend the next hour clicking through report pages to check if something broke. Manual spot-checking is how most teams validate DAX today. It works until it does not.

I have been building and maintaining semantic models for years. The further I get into Fabric-based development, the more my models start to feel like production code. They power dashboards that drive decisions. They feed downstream pipelines. When something breaks, the blast radius is real. And yet, the testing story has always been: deploy, open the report, squint at the numbers.

That gap bothered me enough to do something about it.

The challenge

Microsoft recently launched the Fabric Semantic Link Developer Experience Challenge, a community contest focused on building reusable tools that improve how teams develop, test, document, and maintain semantic models in Microsoft Fabric. The requirement: use Semantic Link as a core component and solve a real developer pain point.

I have been eyeing Semantic Link Labs for a while. The library exposes evaluate_dax_impersonation, which lets you execute arbitrary DAX queries against a Fabric semantic model from a notebook. That single function is what makes programmatic testing possible.

The idea for my submission: a test harness that brings unit testing and regression detection to DAX measures. Define your test cases. Run them. Get a pass/fail report. No browser required.

What I built

The Semantic Model Test Harness is a single Fabric notebook. No external services, no complex infrastructure. You define test cases as rows in a pandas DataFrame, each row specifying three things:

  • The DAX measure to evaluate
  • The filter context to apply (a DAX boolean expression simulating a slicer or page filter)
  • The expected value

Here is what a test case definition looks like:

dax_tests = pd.DataFrame([
    {
        "measure": "# Reports",
        "filter_context": "'Catalog - Report'[Report Workspace] = 'Arla DK'",
        "expected_value": 61
    },
    {
        "measure": "# Workspace Users",
        "filter_context": "'Catalog - Workspace'[Workspace] = 'CatMan Next DK - Demo'",
        "expected_value": 15
    },
])

Each test case gets transformed into a EVALUATE ROW(...) DAX query that wraps the measure in a CALCULATE with the specified filter. The harness sends that query to the semantic model via sempy_labs.evaluate_dax_impersonation(), compares the result to the expected value, and records pass or fail.

DAX query generation

One thing I had to sort out early: DAX has opinions about quoting. Single quotes wrap table names, double quotes wrap string values. Filter expressions like 'Catalog - Report'[Report Workspace] = 'Arla DK' need the 'Arla DK' portion converted to "Arla DK" before execution. A small regex helper handles that conversion automatically.

The harness also distinguishes between boolean filter expressions (like 'Table'[Column] = "Value") and table function expressions (like FILTER(...) or VALUES(...)). Both are valid in a CALCULATE, but the detection matters for correct query construction. A simple heuristic checks for DAX table function prefixes and falls back to boolean if none are found.

The generated DAX query for a test case looks like this:

EVALUATE ROW("Value", CALCULATE([# Reports], 'Catalog - Report'[Report Workspace] = "Arla DK"))

Running the tests

Execution is straightforward. The harness loops through every row in the test DataFrame, builds the DAX query, sends it to the model, and collects results. Each test produces a row in the results DataFrame showing the measure, filter context, expected value, actual value, pass/fail status, and the exact DAX query used.

If a test fails because of a connectivity error, invalid DAX, or anything else unexpected, the exception is caught and logged as a failure with the error message preserved. No silent swallowing of errors.

The results summary counts passes and failures. Failed tests get highlighted separately so they stand out:

Total tests: 2
Passed: 2
Failed: 0
โœ… All tests passed!

When something does fail, you get the actual value alongside the expected value, plus the DAX query that was sent. That gives you everything needed to diagnose whether the issue is in the model, the test definition, or the filter context.

The model I tested against

I ran this against my CatMan BI Tenant Stats semantic model, a model I maintain for Power BI tenant administration and monitoring. It tracks workspaces, reports, datasets, users, activity, and permissions across our organization’s Power BI tenant. The model has 17 tables covering catalog metadata, user licensing, activity logs, and calendar dimensions.

This is a model that changes regularly as new workspaces spin up, users rotate, and reporting patterns shift. Exactly the kind of model where silent measure breakage is a real risk.

What I learned

Test case design is the hard part. Writing the harness code was relatively quick. Deciding which measures to test, with which filter contexts, and what counts as a correct expected value requires genuine domain knowledge. This is not something you can auto-generate meaningfully. You need a human who knows the business logic.

Filter context quoting will trip you up. DAX’s quoting rules are well-documented, but in practice, switching between single and double quotes across table names and string values is a reliable source of errors when constructing queries programmatically. The regex helper saved me repeated debugging sessions.

evaluate_dax_impersonation is the unlock. Without this function from Semantic Link Labs, you would need to stand up a XMLA endpoint connection, handle authentication separately, and manage the query lifecycle yourself. Semantic Link wraps all of that. The function takes a dataset name, a workspace name, and a DAX query string, then returns a DataFrame. That simplicity is what makes a notebook-based test harness practical.

Regression testing needs baselines. The current harness compares against hardcoded expected values. For a production CI/CD integration, you would want a baseline snapshot mechanism: run tests, store results, then compare future runs against the stored baseline rather than manually maintained numbers. I have not built that yet, but the architecture supports it.

Where this goes next

The notebook is designed to be dropped into a Fabric workspace and run on demand or triggered as part of a deployment pipeline. Fabric notebooks can be orchestrated through pipelines, so running this harness as a post-deployment validation step is a natural fit.

I can also see extending the test case format to include tolerance thresholds for measures that fluctuate (like row counts on live data) rather than requiring exact matches. And grouping tests by business domain or model area would help when you want to run a targeted suite after changing a specific part of the model.

For now, it works. I define my tests, I run the notebook, and I get a clear answer: did something break, or is the model still behaving as expected? That is a better answer than opening six report pages and eyeballing numbers.

The notebook is submitted to the Fabric Notebook Gallery as part of the Semantic Link Developer Experience Challenge. If you are maintaining semantic models in Fabric and have felt that same testing gap, give it a try. Let me know in the comments if you find it useful, or if you run into edge cases I have not covered.

Developer Experience Challenge – DAX Dependency Graph

The Fabric Semantic Link Developer Experience Challenge gave me a concrete reason to build something I had been meaning to build for a while. The result is a Fabric notebook that extracts every DAX measure from a Power BI semantic model, maps measure-to-measure dependencies into a directed graph, and produces a ranked complexity audit with risk ratings. This post is about what it does, how it works, and why the approach is useful beyond the challenge context.

The problem it solves

If you have worked on a Power BI semantic model that has grown organically over a few years, you already know the problem. Measures reference other measures. Those measures reference other measures. Nobody wrote it down. The person who built the original [Gross Margin Adj YTD] has since moved on, and the name was self-explanatory at the time.

Power BI Desktop shows you a measure’s DAX expression when you click on it. It does not show you which other measures that expression depends on, how deep the dependency chain goes, or whether any measure sits in a circular reference loop that the engine has been quietly working around. Tabular Editor helps, but it still requires manual navigation. There is no built-in view that answers “what are the ten most complex measures in this model, and which ones does everything else depend on?”

That is what this notebook answers.

Getting the measures out

The notebook uses sempy.fabric.list_measures() from Semantic Link to pull every measure from the target model in a single call. It returns a pandas DataFrame with measure name, parent table, DAX expression, visibility, and description per row.

measures_df = fabric.list_measures(dataset=DATASET_NAME, workspace=WORKSPACE_NAME)

Under the hood, Semantic Link connects via the Tabular Object Model (TOM) over the XMLA endpoint. Fabric handles authentication from the notebook’s identity. Two config values is all the setup needed:

WORKSPACE_NAME = None                      # None = current workspace
DATASET_NAME   = "YourSemanticModelName"

Then Run All.

Parsing measure references: three passes

The interesting part is working out which measures each expression actually references. DAX uses square bracket notation for both measures and columns: [Total Revenue] is a measure reference, Sales[Amount] is a column reference. The parser has to distinguish them correctly.

It does this in three passes:

  1. Strip string literals ("...") to avoid false positives. A FORMAT call like FORMAT([Date], "Total Revenue") would otherwise incorrectly register a dependency on [Total Revenue].
  2. Strip single-line and multi-line comments (// and /* */).
  3. Extract all [Name] patterns where the opening bracket is not preceded by a word character, digit, or apostrophe. That lookbehind excludes table-qualified references like Sales[Amount] and 'My Table'[Column].

The extracted names are then cross-referenced against the full set of known measure names. Anything that is not a measure name is discarded.

pattern = r"(?<![a-zA-Z0-9_'])\[([^\]]+)\]"
matches = re.findall(pattern, cleaned)
return list({m for m in matches if m in measure_names})

This correctly handles the common edge cases in real models. The one known limitation: measures referenced via SELECTEDMEASURE() or through a disconnected table SWITCH pattern cannot be resolved statically. If your model uses those patterns heavily, some dependencies will be missing from the graph.

Building the graph

Once the parser has run on every expression, the dependencies go into a NetworkX directed graph. Each measure is a node. An edge A -> B means “A’s DAX expression references measure B” โ€” A depends on B.

The graph direction is important. It lets the tool compute:

  • In-degree (fan-in): how many measures depend on this one. High fan-in means “hub” measure. Breaking it cascades everywhere.
  • Out-degree (fan-out): how many measures this one calls. High fan-out means complex composition.
  • Longest path from any node: the transitive dependency depth.
  • Cycles: circular reference chains.

From those two properties alone, the next three analyses fall out naturally.

Five analyses

Dead measures. In-degree of zero means no other measure references this one. It might be a top-level report measure used directly in a visual, or it might be genuinely unused. The notebook flags all of them; cross-referencing with report usage is a follow-up step.

Root measures. Out-degree of zero means no dependencies on other measures. These are the foundation: the SUM(Sales[Amount]) base calculations that everything else builds on. Errors in root measures propagate silently through every measure above them.

Circular references. The notebook runs Johnson’s algorithm via nx.simple_cycles() to find every elementary cycle in the graph. In a well-designed model the result is: “No circular dependencies detected.” When it is not, the full chain is printed โ€” A -> B -> C -> A โ€” so you know exactly what to untangle.

Complexity scoring. Each measure gets a weighted composite score across six dimensions:

DimensionWeightRationale
CALCULATE / CALCULATETABLE count3Context transitions are the primary source of subtle DAX bugs
Max parenthesis nesting depth1Readability proxy
Branching (IF / SWITCH)2Code path count
Filter functions (FILTER, ALL, ALLEXCEPT, etc.)2Filter-context manipulation
Dependency depth (longest downstream path)2Transitive error amplification
Fan-out (direct measure references)1Composition width

The weights are the part most open to debate. I gave CALCULATE the highest weight because context-transition confusion is, in my experience, where DAX models accumulate the most invisible risk. The depth and branching weights reflect that those properties make a measure harder to verify than to write.

Visual dependency DAG. The notebook renders the graph with matplotlib. Node color encodes complexity score on a green-to-red scale. Node size encodes in-degree, so hub measures are physically larger. An optional pyvis interactive version renders inline in the notebook via an iframe: zoomable, draggable, with hover tooltips showing measure name, table, and score.

The audit report

The final step consolidates everything into a single DataFrame sorted by risk rating, then by descending complexity score within each tier.

Risk logic:

  • Critical: participates in any circular reference
  • High: complexity score โ‰ฅ 20
  • Medium: score between 10 and 19
  • Low: everything else

The summary banner above the table looks like this:

============================================================
  DAX DEPENDENCY GRAPH - AUDIT SUMMARY
============================================================
  Total measures:            147
  Dependency edges:          312
  Unreferenced measures:     38
  Root measures (no deps):   22
  Circular references:       0
  High complexity (>=20):    11
  Medium complexity (10-19): 29
============================================================

That is the starting point for any refactoring conversation. Eleven measures scoring High is a concrete prioritisation signal: start there, not with the 38 unreferenced ones.

Limitations worth knowing

The parser only resolves measure-to-measure dependencies. Column-level lineage is out of scope. Calculation groups are not modeled as nodes, so models that use them heavily will have gaps. Cross-model references (composite model / DirectQuery to external datasets) are not in scope either.

The “unreferenced” flag does not mean unused. It means not referenced by other measures. A measure used directly in 15 report visuals will still show as unreferenced in this graph, because the tool has no report-level visibility. That cross-reference is worth doing separately with sempy.fabric.list_reports() if you are planning to delete anything.

Getting the notebook

The notebook is on GitHub: github.com/vestergaardj/DDG-DAX-Dependency-Graph. Upload it to any Fabric workspace, set the two configuration values in the Configuration cell, and run all. Everything else is automatic.

It requires semantic-link-sempy, networkx, and matplotlib, all of which come pre-installed in Fabric. pyvis is optional; the static graph still renders without it.

Are you still the air traffic controller?

In February 2025 I wrote about building an event-driven ETL system in Microsoft Fabric. The metaphor was air traffic control: notebooks as flights, Azure Service Bus as the control tower, the Bronze/Silver/Gold medallion layers as the runway sequence. The whole system existed because Fabric has core-based execution limits that throttle how many Spark jobs run simultaneously on a given capacity SKU.

The post was about working around a constraint. You could not just fire all your notebooks at once. You needed something to manage the queue.

More than a year on, it is worth being honest about what held up and what has changed.

The original architecture, briefly

Three components:

Azure Service Bus acted as the message queue. When a source system finished loading raw data, it dropped a message onto the bus. Each message represented one flight waiting for clearance.

A capacity monitor notebook ran on a short schedule. It checked how many notebooks were currently executing and compared that count against the available core capacity. If capacity was available, it pulled the next message from the queue and triggered the appropriate notebook via the Fabric REST API.

The processing notebooks were standard Bronze, Silver, and Gold Spark notebooks. They ran as normal Fabric notebooks with no awareness of the orchestration layer above them. On completion, they acknowledged the Service Bus message.

The deliberate design choice was to keep the notebooks clean and put the complexity in the orchestration layer. A notebook should not need to know whether it is being called by a scheduled job, a pipeline, or a service bus monitor. That separation held up well.

What has changed in Fabric

Two relevant changes since February 2025:

Native job queueing

Fabric now queues Spark jobs automatically when capacity is exhausted rather than rejecting them outright. Jobs queue in FIFO order and wait up to 24 hours before expiring. The platform starts them automatically as capacity becomes available.

There is a hard constraint that limits how far this goes: the queue depth is bounded by the SKU’s CU allocation. It is not unlimited. A sudden burst of 100+ notebooks would exceed both the concurrent execution limit and the queue depth, and the excess jobs would be rejected rather than queued — the same failure mode as before native queueing existed.

So native FIFO queueing helps if your workload arrives gradually. It does not change the original problem if your trigger pattern involves large simultaneous batches. The Service Bus buffer sits outside Fabric and has no queue depth constraint. That distinction is why the architecture is still relevant.

Job-level bursting controls

Fabric capacities support bursting at up to 3x the nominal CU allocation. You can now disable bursting for specific Spark jobs, giving finer-grained control over which jobs are allowed to consume burst headroom. Useful for ring-fencing critical workloads. This is an additive improvement to the platform regardless of which orchestration approach you use.

What held up

The decoupled architecture held up. Keeping orchestration logic out of the processing notebooks made them easier to test, modify, and redeploy independently. A notebook that does not know how it was triggered is easier to reason about than one that contains scheduling and queueing logic alongside its data transformation. Nothing about that changed.

Azure Service Bus held up as a reliable messaging backbone. At-least-once delivery, dead-letter queues, message peek-lock, and configurable time-to-live are production-grade features. There were no reliability issues with the messaging layer over the period.

The Bronze, Silver, Gold medallion structure held up. That is sound data architecture independent of the orchestration tool above it.

What I would do differently today

Not much, given the workload pattern. The original system was designed for burst scenarios, and the burst scenario has not changed. The native queue depth limit tied to CU allocation means there is still no Fabric-native replacement for an external buffer that absorbs an unbounded number of incoming messages and feeds them into Fabric at a controlled rate.

The one thing I would reconsider is the capacity monitor notebook’s polling loop. It runs on a short schedule and checks active job counts before pulling the next message. That works, but it adds latency and a scheduling dependency. Whether the Fabric REST API now exposes enough observability to build a leaner version of that loop is worth investigating — but that is an implementation detail, not a reason to replace the architecture.

The honest production reality

The system is still running. A working production system with understood failure modes and people who know how to debug it has a high replacement threshold. The architecture from February 2025 has not needed replacing because the problem it solved has not stopped existing.

The question I get asked: is there a more native way to do this now? The answer is no, not for the burst-buffering problem specifically. Fabric pipelines handle sequential orchestration well. Native FIFO queueing handles gradual workloads within its depth limit. Neither absorbs an unbounded burst and controls submission into a capacity-constrained runtime. Service Bus still does that job, and it still does it well.

The air traffic controller is still in the tower. The runway is a little wider than it was, but the traffic has grown too.

W-Order Just Dropped in Microsoft Fabric

I was poking around in the Lakehouse settings after the latest Fabric update rolled out last night, and noticed something I have not seen documented anywhere yet.

Buried in the optimization section of the Lakehouse table properties, there is a new toggle: W-Order.

If you have been following along with V-Order since Fabric went GA, you know it already does a solid job of optimizing Delta tables for read performance. W-Order is apparently the next generation. The acronym, according to the tooltip, stands for Wavelet-Optimized Recursive Delta Encoding Rewrite. It claims to apply wavelet decomposition to the Parquet column chunks and recursively re-encode them into what the UI calls “spectral micro-partitions” based on historical query access patterns.

I have no idea what half of that means. But I had to test it. Obviously.


Where to Find It

Open your Lakehouse in the Fabric portal. Navigate to Table properties on any Delta table, scroll down past the standard V-Order settings, and you should see the new section sitting right below it.

Flip the toggle, confirm the dialog, and the table enters what the UI calls a “spectral rewrite” phase. On my test table (around 48 million rows, partitioned by month), this took about four minutes. A small progress indicator shows up next to the table name in the explorer while it runs.

Spectral rewrite in progress, indicated by the spinner next to the table name

The Numbers

I ran the same aggregation query on the table before and after enabling W-Order. Same capacity, same time of day, same query, three runs each.

RunBefore (seconds)After (seconds)
114.30.34
213.80.33
314.10.34
Same query, same capacity, same data. Three consecutive runs.

That is roughly a 42x improvement. On a simple GROUP BY with a SUM. I nearly spilled my coffee.

0.34 seconds on 48 million rows. I had to run it again to believe it.

A Few Things to Note

  • Always check your runtime version first, the feature requires the April 2026 update
  • Premium or F64+ capacity is required. The toggle simply does not show up on lower SKUs
  • Reoptimization consumes CU credits, so keep an eye on your capacity metrics while the spectral rewrite runs
  • It only appears on Delta tables, not on shortcuts or mirrored tables
  • Latency on tables with heavy concurrent write loads is unknown, so proceed with some caution there

Microsoft has not published any documentation for this yet as far as I can tell. The feature might still be rolling out, so if you do not see the toggle, give it a day or two. The latest runtime update notes are here.

Go check your Lakehouse settings. And if something about this whole thing feels off, maybe take a second look at today’s date before you reorganize your entire data estate.

The Future of BI: Will AI Replace BI Developers?

I have been asked this question at every conference I have attended in the last two years. Not always directly. Sometimes it arrives as “what do you think about Copilot” or “is there still a point learning DAX properly.” But the underlying question is always the same: is my job going to exist in five years?

It is a fair question. When you watch GitHub Copilot complete a CALCULATE function before you have finished typing the first argument, or paste a business requirement into Claude and get back a working Power Query transformation, it is easy to understand why people are asking it.

Here is my honest take, as someone who has been building BI solutions since 2006 and who has spent the last year testing these tools in real work, not in demos.

What These Tools Can Actually Do Today

I have been using ChatGPT, Microsoft Copilot and Claude regularly over the last eighteen months, in actual client work and personal projects. Not in theory. In my editor, on real data models, with real business requirements.

ChatGPT is strong at generating DAX and SQL when you give it enough context. If you describe the table schema, the business logic and the expected output format, the first attempt is often close enough to use with minor adjustments. I have used it to draft measures I would have spent an hour on, in five minutes. It is not always right on the first pass, but it moves fast enough that iteration costs less than starting from scratch.

Microsoft Copilot inside Fabric and Power BI has improved noticeably over the last year. The report creation assistant went from generating generic placeholder visuals in early 2024 to producing layouts I would actually take and refine for production use in 2025. It will not replace design judgment, but it removes the blank canvas problem. For report creation at scale, that matters.

Claude has been the most useful for reasoning about model design. I used it to reverse-engineer a semantic model from an annotated schema diagram and it handled the relationships and measure dependencies better than I expected. I also hit real limits: it had no knowledge of my actual data, made assumptions that were wrong for our domain, and I needed five or six iterations before the output was usable. The hiccups were real. I am not smoothing those over.

The pattern that holds across all three: if the task is well-defined, the context is fully provided, and the output can be verified quickly, these tools are fast and genuinely useful. That describes a significant portion of the day-to-day coding work in a typical BI role.

Where They Still Fall Short

None of these tools know your business.

That sentence sounds simple, but it is where the real gap sits in practice. Generating a CALCULATE function is not difficult once you know the filter context. The difficult part is knowing that in your organization, “active customers” means accounts with at least one transaction in the last 90 days, but only in the consumer segment, and that this definition changed in Q2 2024 and needs to be handled differently across historical and current period comparisons.

That context lives in your head, in Confluence pages nobody reads, in a meeting that happened eighteen months ago, and in an email from a finance analyst who has since left the company. No AI tool picks that up from a prompt. You bring it, or it does not exist in the output.

The same gap shows up in data quality judgment. AI tools will generate a transformation pipeline from a spec without questioning whether the spec is correct. They will not notice that the order date column has 8% null values, or that this matters for the revenue calculation, or that the exceptions exist because of a legacy system migration that your company did in 2019. You notice, because you have seen the data before and know what those patterns mean.

Report creation has the same ceiling. The AI can build a layout, suggest a chart type and write a title. It cannot make the call that this dashboard will be shown on a 40-inch screen in a warehouse and that the font size from the default template will be unreadable from three meters away. That judgment comes from having sat with the people who use the reports.

What This Actually Means, by Role

The impact is not uniform. It depends on how much of your current working week involves mechanical repetition.

For BI developers, the most immediate change is in code generation. Boilerplate DAX, repetitive ETL transformations, standard report templates: these are exactly the tasks AI accelerates most. If this work is a large part of your week, your output volume will increase and expectations around it will rise with it. That is not a threat if you understand it early enough to stay ahead of it.

For data analysts, the change is most visible in exploration speed. Asking Claude or Copilot to produce a first-pass analysis of a dataset, flag anomalies or suggest groupings gets you to a hypothesis faster. The interpretation of that hypothesis, and the judgment about whether it is the right hypothesis for the actual decision being made, remains yours.

For data engineers, pipeline boilerplate generation and schema documentation are straightforward wins. Debugging complex transformation failures where the error is opaque is an area where these tools also provide real value, particularly if you can paste the full stack trace and table definitions into the prompt.

For data architects, the tools work well as thinking partners for structured design questions. Talking through a proposed model, generating documentation drafts, checking naming conventions across a schema. The decisions about what to govern, where domain boundaries sit, and how to design for the organizational reality rather than the theoretical ideal still require judgment that comes from knowing the context.

For data governance specialists, the upside is in documentation and lineage drafting. The real governance work, defining what quality means for a specific data product and making that stick across teams, is still a people and process problem that no AI tool solves for you.

My Perspective as a Practitioner

I am not going to tell you AI will not change BI work. It already has. But the question of whether it replaces BI developers wholesale is the wrong framing.

The more useful question is: which parts of your current work are mechanical enough that AI tools can do them faster and cheaper? Be honest about that list. If it is long, that is information worth having now rather than in two years when the market has already adjusted.

Here is what I have observed across the work I do and the people I talk to at conferences: the work that clients are most anxious, most uncertain, and most willing to pay for is not the part AI is good at. It is the part that requires knowing their business, understanding who trusts what report and why, having the conversation with the finance director who distrusts the new model, and standing next to the operations manager while they explain what the dashboard actually needs to show.

That is still practitioner work. AI does not do it. It accelerates the technical work that happens around it, which gives you more time for the part that matters most.

There is one shift I have started to notice though, and it is worth naming. The finance director on the other side of the table is also using these tools. They arrive at the meeting having already asked Claude to explain the variance, or having had Copilot summarize the dashboard. They come in with a better baseline understanding than they had two years ago, and that makes for a more productive conversation. You spend less time on mechanics and more time on the actual decision. That is a good thing, not a threat.

The hallucination problem is real and it is the most reasonable objection anyone raises to trusting AI output in production work. But it is also improving faster than most people expected a year ago. The gap between what these tools got wrong in early 2024 and what they get wrong now is measurable. I expect that to continue. The sensible approach is to verify outputs where the stakes are high and to track how often you have to correct them over time. That number has been going down in my experience.

The BI developers I have watched get uncomfortable recently are the ones who built careers around syntax knowledge, template maintenance and format conversion. Those jobs are genuinely changing. The developers who are doing well are the ones who understood that the syntax was never the point: the business problem was the point, and tools that help with the syntax give them more time for the problem.

My advice is straightforward: learn the tools, use them on real work rather than tutorial datasets, and find out where they fail on your actual data in your actual domain. That is the only way to know where your judgment matters more than their suggestion.

For me, the answer to whether AI will replace BI developers is: not the ones who are clear about what they are actually being hired to do.

What has your experience been with AI tools in your daily BI work? I would like to know what people are actually finding useful versus what sounds better in a conference session than it works in practice.