Data Quality for AI in Finance: Why Your Data Probably Isn't Ready

The number one reason AI projects in finance underperform is not the AI. It is the data. The model is fine. The data it is working on is not.

I now treat a data quality assessment as a prerequisite for any AI implementation conversation. Not a preparatory step. A prerequisite. The organisations that skip this and go straight to vendor evaluation are the ones that end up with automation rates 30 points below what the pilot achieved, or with AI recommendations that are systematically wrong in ways that take months to trace back to their source.

Data quality problems are fixable. They require effort and discipline, but they are finite. The work required is almost always less than people expect and far less than the cost of a failed AI implementation.

The five data quality dimensions that matter for finance AI

Not all data quality problems are equal. For AI in finance, five dimensions determine whether your data will support the automation and insight you are trying to build.

Completeness. Missing values degrade AI performance in ways that are not always obvious. A supplier master where 30% of records have no payment terms. A chart of accounts where cost centre codes are populated on some transactions and blank on others. A customer database where contact details, credit limits, and trading terms are inconsistently captured. When an AI model encounters missing fields, it either ignores those records or makes inferences that may be systematically wrong. Either way, the output is less reliable than your automation rate targets require.

In finance, completeness problems tend to concentrate at the edges of processes: new entity additions, legacy data migrations, manual workarounds for system limitations. The fields that are optional in the system are almost always incomplete in practice.

Consistency. The same thing represented differently across systems and over time. A customer who appears as “Tesco” in your ERP, “Tesco Stores Ltd” in your CRM, and “TESCO PLC” in your reporting database. A cost centre called “Marketing” in one entity and “Mkt” in another, following an acquisition where nobody rationalised the chart of accounts. A supplier whose payment terms are “30 days” in the supplier master and “45 days” on every contract since 2019.

Consistency problems are particularly damaging for AI because machine learning works by recognising patterns. Inconsistency breaks the patterns. A deduplication algorithm that cannot reliably identify the same supplier across two systems cannot support automated three-way matching. A forecasting model that cannot consistently map cost centres across entities cannot produce accurate consolidated forecasts.

Accuracy. Data that exists but does not reflect reality. Fixed asset registers that include assets disposed of years ago. Inventory records that diverge from physical counts by 15%. Debtor balances that have not been reconciled to customer statements in six months. Revenue recognition entries that reflect timing differences but are labelled in ways that obscure this.

Accuracy problems are the most consequential for AI because they are the hardest to detect automatically. Completeness and consistency can often be caught by rules-based validation. Accuracy requires comparison against an external reference: a physical count, a bank statement, a supplier confirmation. AI trained on inaccurate data learns the inaccuracies and replicates them at scale.

Timeliness. Data that is current enough to be useful. A supplier master last cleansed 18 months ago. Bank reconciliations running four days behind. Inventory updated weekly in a business where daily stock movements matter for forecasting. Cost centre allocations not updated since a restructuring six months ago.

Timeliness matters most for AI applications that depend on current data: cash flow forecasting, anomaly detection, approval routing. An anomaly detection system working off yesterday’s data in a high-volume AP environment is detecting anomalies that have already been paid. That is not control. It is reporting.

Lineage. Can you trace where each data point came from? Which system generated it, when it was last updated, and what transformation logic was applied on the way to the dataset your AI is working from? Data lineage matters for two reasons. First, it enables root cause analysis when AI outputs are wrong: you can trace back from the wrong result to the data input that caused it. Second, it is an audit requirement. An AI-assisted decision that your auditors cannot trace to a defined, documented data source is not a controlled decision.

In every finance environment I have worked in, data lineage is informal. People know roughly where the data comes from. They cannot document it at the level an auditor or an AI governance framework requires. That gap needs to close before you are running production AI on financial data.

The most common data quality problems by function

Different parts of the finance function have characteristic data quality failure modes. Knowing where to look saves time in the diagnostic.

AP supplier master. Duplicate supplier records are near-universal: suppliers added multiple times under slightly different names, legacy suppliers from acquisitions that were never deduped, test records that made it to production. Duplicate suppliers corrupt matching accuracy and create payment risk. Unverified bank details are a fraud exposure. Stale payment terms drive late payment penalties and damage supplier relationships. The supplier master is foundational for AP automation and it is almost always in worse shape than the finance team believes.

GL chart of accounts proliferation. Finance functions that have grown through acquisition or organic expansion almost always have chart of accounts problems: codes added locally without central governance, the same nominal account used for different purposes across entities, retired codes that still carry balances. A chart of accounts that started at 200 codes and is now at 600 without any rationalisation is a forecasting problem waiting to surface. AI models that work across entities need a consistent chart of accounts. Cleaning it up is uncomfortable and worth doing.

Intercompany inconsistencies. Intercompany transactions that do not reconcile between entities, differences in how intercompany balances are coded and described, elimination processes that rely on manual adjustment rather than systematic matching. For group reporting and consolidation AI, intercompany data quality is critical. A reconciliation AI tool applied to inconsistent intercompany data produces confident-sounding wrong answers.

Cost centre naming across acquisitions. Post-acquisition, the acquiring company’s cost structure and the acquired company’s cost structure need to be mapped together. This is almost never done properly in the heat of an integration. The result is years of reporting where cost centres cannot be compared across entities because the naming conventions are incompatible. The problem is visible to anyone looking at consolidated reporting. It is often not fixed because nobody wants to quantify what it is costing.

How to assess where you actually are

The most useful thing you can do before starting an AI project in finance is a structured data quality audit of the specific data your AI will use.

The method is simple. Start with the highest-volume AI candidate. If that is AP automation, start with your supplier master and invoice data. Pull 100 records at random. For each of the five quality dimensions, score each record: missing values, consistency with your definitions, accuracy against a reference, currency, traceability of origin.

Tally the results. If 40% of records have completeness issues, you have a completeness problem that will suppress your automation rate by a material amount. If 25% have consistency issues across systems, that is your deduplication problem. The distribution tells you where to focus remediation effort.

This takes a working day if you are systematic about it. It tells you more than a month of vendor conversations. Vendors cannot tell you what your data quality is like. Only you can assess that.

One thing I have found consistently: finance teams almost always underestimate their data quality problems. Not because they are careless, but because the problems accumulate slowly, concentrate in areas that are not regularly reviewed, and are invisible in the normal run of daily processing. The audit surfaces what the daily work obscures.

The AI readiness assessment for finance functions includes a data quality dimension that maps directly to this diagnostic. It gives you a scoring framework for capturing and communicating the results.

Fixing it

Data quality remediation has a reputation for being endless. In practice, it is not. It is a finite body of work with a clear end state: data that meets defined quality standards for a specific AI use case.

Deduplication. Systematic deduplication of supplier masters, customer masters, and employee records using automated matching tools that compare names, addresses, tax numbers, and bank details. For most finance functions, a deduplication exercise on the supplier master takes two to four weeks of focused effort and is not revisited if you put proper governance in place.

Standardisation. Agreeing and enforcing standard formats for fields that need to be consistent: cost centre naming conventions, supplier category codes, payment terms definitions, currency codes. This requires a decision-making process more than technical work. Agreeing on the standard is usually harder than implementing it.

Governance rules. Validation rules at the point of data entry that prevent future quality problems from being created: mandatory fields that cannot be left blank, lookup fields that constrain entries to defined values rather than free text, approval requirements for new supplier additions. System-enforced consistency removes reliance on individual discipline. The connection between system design and data quality is relevant here: systems enforce rules that spreadsheets cannot.

Validation at point of entry. The cheapest time to fix a data quality problem is before it enters the system. A supplier onboarding process that requires bank detail verification before a supplier is activated, a cost centre allocation process that requires a valid code before an invoice can be posted, an inventory update process that requires a reference count: all of these prevent future quality problems from accumulating.

The sequence matters. Fix existing problems first, then implement governance to prevent new ones. If you implement governance before deduplication, you prevent new duplicates from being created while the existing ones continue to degrade your AI performance.

The return on doing this work

Organisations that complete a data quality assessment before an AI pilot are three times more likely to achieve their target automation rates in production. That figure comes from implementation analysis, and it reflects a structural reality: AI tools do not adapt to poor data quality. They amplify it. A model trained on inconsistent data learns the inconsistencies. A matching algorithm applied to duplicate-heavy data generates false positives at a rate that consumes more human time than it saves.

The investment in data quality remediation is modest compared to the cost of an AI implementation that underdelivers. A supplier master deduplication project that takes four weeks and prevents six months of suppressed AP automation performance is a clear-cut investment.

Every organisation I have worked with that completed the data quality work first reports that their AI implementations performed at or above the pilot automation rates in production. The ones that skipped it report a gap between pilot and production that they spend months trying to close, often without success.

Data quality is not a glamorous part of AI strategy. It is the part that determines whether the strategy delivers. Start here, and the rest is considerably more manageable.

For the broader context, see AI in Finance Strategy. For the specific connection to AI reconciliation applications, see LLMs in financial reconciliation. For what happens when you skip this and deploy AI into a broken finance function, see AI won’t fix a broken finance function.

Maebh Collins is a Fellow Chartered Accountant (FCA, ICAEW) with Big 4 training and twenty years of operational experience as a founder and senior finance leader. She writes about AI in finance transformation from the inside out.

Back to Blog | AI in Finance →