The First 90 Days of AI in Your Finance Function

Most AI implementations in finance do not fail dramatically. They drift.

The go-live happens. There is a period of active attention: monitoring outputs, reviewing exceptions, tuning configuration. Then the attention moves to something else. The automation rate sits at 62% when the projection was 85%. Nobody makes a formal decision that 62% is acceptable. It just becomes the number. The team continues to run the manual process alongside the AI tool to catch what the tool misses. The tool becomes one of the systems the function has, rather than the process transformation it was meant to be.

The decisions made in the first 90 days are disproportionately important. This is when the pattern is set. A well-managed first 90 days builds the foundation for a tool that performs. A poorly-managed first 90 days produces the slow drift toward underperformance that is the most common failure mode in enterprise AI.

This post covers what should happen in each phase of the first 90 days, what to watch for, and how to read the early signals.

Days 1 to 30: Foundations and baseline

The first month is not about the AI doing anything useful. That is a hard thing to communicate to a senior leadership team that approved the investment and wants to see results. But it is true, and managing expectations around it is part of the Finance Director’s job.

The work of days 1 to 30 is confirming that the foundations are actually in place, establishing the baseline you will measure against, and completing integration and configuration.

Confirming foundations means going back through every assumption made during vendor evaluation and verifying that it still holds in the live environment. Data quality issues that were supposedly resolved: are they actually resolved in live data, or were they partially remediated in a test set? Integrations that worked in testing: do they hold under live data volumes? Approval workflows that were redesigned: are people operating them as redesigned, or reverting to previous behaviour?

This verification step is often skipped because it feels like going backwards. It is not backwards. It is the step that determines whether the next 60 days produce results or produce queries that cannot be answered.

Establishing the baseline means measuring current process performance before AI touches any live transactions. Processing time per invoice. Exception rate. Error rate. Cost per transaction. Volume by document type and supplier. Without this baseline, the day 90 review becomes an argument about what performance would have been expected without AI. That argument is unwinnable because nobody measured it.

Red flags in this phase: data quality problems that were assumed to be resolved but were not. This manifests as validation failure rates higher than projected in the first live transactions. Integration failures on live data volumes after clean testing. This manifests as manual intervention required to move data between systems that should be connected. Team members who were not fully engaged with the change now working around the new process. This manifests as parallel manual processes continuing after go-live, with the AI tool used selectively rather than as the primary workflow.

Each of these is addressable. None of them are addressable if they are not identified in the first 30 days. The building an AI-ready finance function roadmap is the preparation that should have eliminated these risks. If the preparation was not complete, the first 30 days are where you discover that.

Days 31 to 60: First live processing and exception review

This is the phase where the real learning happens. The AI is now processing live transactions. The automation rate, the exception distribution, and the error pattern are becoming visible in real data.

The automation rate in live production will almost always be lower than in testing. This is not a failure. It is expected. Test environments use cleaned, curated data with known document types. Live environments have the full complexity of real-world transaction data: the supplier who changed their invoice template last month, the expense claim with an unreadable receipt, the invoice where the PO number was entered incorrectly in the system and the three-way match fails.

The question is not why is the automation rate lower than in testing. The question is: what is it, and what is causing the gap?

Categorise every exception. This is the most important discipline of days 31 to 60. Every transaction that routes to the human review queue should be categorised by failure reason. Four categories cover the majority.

Wrong because of data quality. The extraction was correct, but the validation failed because the vendor master has an inactive record, the PO number field has inconsistent formatting, or the approved supplier list has not been updated. These exceptions tell you the data remediation was incomplete.

Wrong because of process design. The exception is legitimate: it is a transaction type or edge case that the process design did not anticipate. The routing rules need to be updated. These exceptions tell you where the process design has gaps.

Wrong because the model needs tuning. The extraction was incorrect, or the classification was incorrect, in a way that is systematic: the same document type is consistently mishandled. These exceptions tell you where the model configuration needs adjustment.

Wrong because it was a judgment call. The transaction is ambiguous, unusual, or requires context that the AI tool does not have access to. These exceptions are expected. They should stay in the human queue. They tell you the human review process is working.

The distribution of that categorisation is what tells you where to focus. If 60% of exceptions are in the data quality category, the data remediation was not complete and that is where the work must go. If most exceptions are model configuration issues, the vendor needs to be engaged to tune the model on your document population. If exceptions are concentrated in edge cases and judgment calls, the tool is performing as expected and the automation rate will improve as the model trains on your live data.

Days 61 to 90: Tuning and the first review

By day 60, you have enough live production data to make informed decisions. The day 90 review should be formally scheduled and should cover specific, measurable outcomes.

Actual automation rate versus pilot projection. The number, not a narrative explanation of why the number is what it is. If the projection was 80% and the actual rate is 68%, that is a 15% gap. Document it, understand it, and have a specific plan for closing it.

Exception category distribution. What does the 90-day exception analysis tell you about the underlying problems? Data quality category should be reducing week-on-week as remediation work is completed. Model configuration issues should be addressed by the vendor within the first 60 days. Process design gaps should be designed and deployed by day 90. What remains in the exception queue at day 90 should be dominated by genuine edge cases.

Team adoption and behaviour change. Is the team using the tool as the primary workflow, or running it alongside manual processes? A tool that is used only selectively will never achieve its projected automation rate because the transactions where the team chooses manual processing are systematically the ones where the tool would produce exceptions. Selective use produces a misleadingly high rate on a low proportion of volume.

Cost against plan. Implementation cost, configuration cost, integration cost, the time cost of the exception handling workflow and the day 90 review itself. Is the actual cost-to-date tracking against the business case projection?

What you would do differently. An honest retrospective on the implementation process. This is not about blame. It is about building institutional knowledge before the next implementation.

Signs the implementation is working

Automation rate at or above 70% of the pilot projection within the first 90 days. If the projection was 85%, a 90-day rate of 60% or above is acceptable given the ramp. Anything below 60% of projection needs specific explanation and a recovery plan.

Exception categories concentrated in edge cases and judgment calls rather than data quality or model configuration errors. This means the foundational work was done properly and the tool is handling the routine population correctly.

Team actively using the tool rather than working around it. The parallel manual process, if it was run during go-live as a safety net, has been discontinued or reduced to spot-check frequency. The team is querying exceptions through the tool’s review interface rather than outside it.

Time savings visible in the process metrics. Not projected savings. Actual savings, measured against the baseline established in days 1 to 30. Month-end close is faster. Invoice processing backlog has reduced. The team is spending less time on routine data entry.

Vendor engagement is substantive. The vendor is reviewing your exception data, proposing model configuration changes based on your document population, and delivering on commitments made during implementation.

Signs the project is in trouble

Automation rate significantly below projection with no clear diagnosis. This is the warning sign that the categorisation discipline in days 31 to 60 was not applied. If you do not know which category is driving your exceptions, you cannot fix the right thing.

Most exceptions in the data quality category at day 60. This means the data remediation work that should have happened before go-live was not complete. It is not a fatal problem, but it requires immediate resource allocation to the remediation work and honest communication about the timeline impact on projected benefits.

Team reverting to manual processes alongside the AI tool. The most common manifestation is informal: team members processing invoices manually and then entering the result into the AI tool’s interface to get the audit trail, rather than using the tool to process the invoice. This produces a paper automation rate that does not reflect actual processing. It also means the tool is adding work rather than reducing it, which is unsustainable.

Vendor blame-shifting when problems arise. This is common and it is a significant problem.

A vendor who immediately points to your data quality, your integration complexity, or your team’s change management as the reason for underperformance did not do sufficient due diligence before go-live. A professional vendor reviews your data quality before committing to automation rate projections. They identify change management risks before signing the contract. A vendor who discovers these issues after go-live and uses them to explain away underperformance did not do the work they should have done. That is a shared failure at minimum. Do not accept the framing that your organisation’s imperfections are solely responsible for underperformance that competent due diligence should have anticipated. The evaluating AI vendors framework is designed in part to filter for vendors who do this work properly. When vendor behaviour in the first 90 days does not match what was represented in the sales cycle, document it and address it directly with their account team.

The governance layer in the first 90 days

The AI governance framework should be operational from day one of live processing, not built during the first 90 days. If you are building governance in parallel with running live processing, you are operating an uncontrolled AI process. That is a risk management failure, not a delay.

What the first 90 days adds to governance is calibration. Human review thresholds set before go-live on projected performance may need adjustment based on actual performance. The model change management protocol should be tested with at least one vendor update, to confirm that the notification, validation, and rollback procedures work as designed.

At day 90, update the governance artefacts to reflect what you have learned: actual thresholds, actual error rates, actual exception categories. This is the version presented to external auditors. It should reflect reality, not the implementation planning assumptions.

The change management in AI finance post covers the people dimension in more depth. Technology implementations succeed or fail at the behavioural layer. The first 90 days is when those patterns are set. Managing them actively is not optional.

Maebh Collins is a Fellow Chartered Accountant (FCA, ICAEW) with Big 4 training and twenty years of operational experience as a founder and senior finance leader. She writes about AI in finance transformation from the inside out.

Back to Blog | AI in Finance → | Work With Me →