The AI Result Looks Good. Now Who Signs It Off?

There is a moment in BI projects that everybody knows. The report is built. The dashboard looks fine. The first review starts. And then someone from the business says: "That number is wrong." Nobody enjoys that moment, but at least it gives the project something concrete.

Thomas Howert

Co-Founder & Senior Advisor for BI, data projects, and technology decisions

Dashboard on Tablet with text: "That number is wrong."

The CRM says there are 1,240 active customers. The dashboard shows 1,217. Now the work has a direction. Maybe the filter is wrong. Maybe “active customer” was never properly defined. Maybe the nightly load failed. Maybe the dashboard is right and the source system is messier than expected.

Whatever the reason, the discussion has an object.

‍

There is a number, a source, and a deviation

BI acceptance can still take weeks. People argue about definitions, ownership, exceptions and historical data. But the basic logic holds: the analytical result has to reconcile with whatever the organisation treats as its source of truth.

The report says X. The source says Y. Now we need to know why.

With AI, the review often starts differently

The result looks good. The wording is clean. The explanation sounds plausible enough that nobody immediately pushes back.

Then, usually around the third or fourth real example, someone says:

"Yes, but we would not do that in this case."

In BI, acceptance often starts when the numbers do not reconcile. With AI, acceptance often starts when the result looks right, but the business would not act on it.

Because now the issue is no longer only whether a value matches a field in a source system. The system has interpreted something: a customer situation, a support case, a risk signal, a contract clause, a process status.

And interpretation is much harder to sign off than a number.

Take a simple sales case

A dashboard shows open pipeline, last activity, expected revenue, product usage and account status. Those values can be checked against CRM. The data may be messy, but the direction of the check is clear.

An AI assistant looks at the same account and concludes: contact this customer this week — because of expansion potential, recent engagement and a possible churn signal. That may be useful. It may be exactly what people want from AI.

But how do you accept it?

There is no CRM field called “correct next best action”.

The last activity may have been included but misunderstood. Three opened emails may look like engagement without meaning buying intent. The churn signal may be based on usage data that is already outdated. There may be an open escalation that changes everything. Or the suggestion may fit a generic sales playbook but be wrong for this specific account.

The dangerous part: the output may still look fine.

‍

When AI fails in an obvious way, it is easy to catch. It invents a customer, uses the wrong date, confuses two accounts or produces something absurd. Annoying, but not what I worry about most. The more dangerous failures are often not obvious at first.

A customer summary reads well but misses the newest escalation. A prioritisation looks structured but relies on outdated context. A recommendation sounds logical but does not fit the real process. A classification is plausible but would send the case to the wrong team.

Nothing explodes. The process just moves slightly in the wrong direction.

That is harder to detect than a wrong number in a dashboard. A wrong number creates immediate resistance. Someone sees it and says: “This cannot be right.” With AI, the result can pass the first impression because it sounds reasonable.

That is exactly why “looks good” is a weak acceptance criterion.

This becomes critical once the AI result leaves the chat window

A rough summary next to the workflow can still be reviewed, ignored or challenged. A priority flag in CRM is already closer to the operation. A suggested next action inside a sales workflow is closer again. And once the system starts writing back into tools, opening tickets or changing statuses, the question is no longer whether the result sounded useful in a demo.

The question is where the result is allowed to go.

Can it move forward without approval? Where would a mistake become visible? Who is responsible if the recommendation was plausible, but wrong? Most demos do not answer this. They show that AI can generate something useful. They do not show clearly enough what happens after the useful-looking result appears.

‍

Some checks should be boring

Wrong customer, wrong time period, wrong source, missing required field, unsupported claim. If that happens, the result does not move forward. No philosophical discussion needed.

But those checks only catch the obvious problems.

A result can use the approved source and still interpret it badly. It can reference the correct document and still miss the exception that matters. It can summarise the facts accurately and still suggest a next step nobody with process knowledge would take.

That is where domain expertise becomes unavoidable. Someone who knows the process has to ask whether the result would hold up in practice.

Would we really call this customer now? Would we really escalate this case? Would we really let the system update that field?

These are not benchmark questions. They are acceptance questions. And they are specific to one company, one process and one consequence.

The non-deterministic part makes this harder still

Nobody needs AI to produce the exact same sentence every time. But the decision behind the wording needs to be stable enough. If the same customer is “high priority” in one run, “wait” in the next, and “possible churn” in the third, that is not a style issue. It is an operational reliability issue.

A second model can help. It can challenge the result, compare claims against the source and flag missing evidence. That can be useful.

But it should be a challenger, not the final authority. Otherwise one unchecked AI output approves another unchecked AI output, and the evaluation layer only looks more sophisticated than it really is.

In practice, this is less about building an elegant evaluation framework and more about deciding what is allowed to move forward. Which sources count. Where the system stops. Where a person has to look at it. And how we reconstruct later why a result was accepted.

The difficult question is not whether AI can generate a plausible answer. It can. The difficult question is whether that answer can be accepted by the organisation and safely handed into the next process step.

In analytics, we learned to ask whether the number matches the source of truth.

With AI, we need to learn how to sign off an interpretation. Because the result can look good long before it is safe to use.

Trust does not come from good answers

We help you build AI systems whose outputs are transparent, verifiable and ready to be integrated safely into your business processes.

Assess your AI foundation

Thomas Howert

Co-Founder & Senior Advisor for BI, data projects, and technology decisions

Discover more articles

EU-Sternkreis um Text: "AI Use Case Inventar und Governance"

AI Use Case Inventory and Governance - Portfolio, Roles, Obligations

Part 3/3 concludes the series with the question that, in practice, often needs to be answered first:

Mehr erfahren

Is your BI dashboard turning into a compliance risk?

The Impact of the EU AI Act on Business Intelligence

The EU’s Artificial Intelligence Act is the world’s first comprehensive AI law, and while it doesn’t regulate every dashboard, it has major implications for Business Intelligence once AI features are involved.

Mehr erfahren

Treat the Problem, not the Symptoms: Common Mistakes in Data Cleansing

When numbers don’t add up, many teams reach for the same cure: cleansing scripts. They patch nulls, deduplicate rows, and standardize values downstream. It works. But only on the symptoms. The root problems remain. And the “data debt” keeps growing.

Mehr erfahren