AI Governance for Finance

I started building an AI governance framework for my finance workflows about eight months ago. Not because anyone asked me to. Because I was using AI tools daily for variance commentary, forecast narrative drafts, and data pattern recognition, and I realised that I had no documented answer to a simple question: if the external auditor asks how I validate AI-generated analysis that touches financial data, what do I show them?

That question kept me up one night. By the next morning I had the outline of a framework. Eight months later, I can tell you exactly how it works, where it has been tested, and what I would change if I were starting over.

The reason to build governance before deployment (not after) is straightforward. Once an AI tool is producing outputs that flow into the actuals pack, the forecast, or the board deck, you have already created a dependency. Retrofitting governance onto a tool people rely on daily is harder than building it in from the start because you are fighting inertia and habit simultaneously. I have seen this with ERP implementations and I have seen it with BI tools. The pattern is the same: the tool gets adopted, everyone gets comfortable, and then governance becomes a cleanup project instead of a design decision.


The Validation Protocol

Every AI output that touches financial data in my workflow goes through a three-layer validation before it reaches anyone outside the finance team. No exceptions. I established this rule after an early incident where an LLM-generated variance commentary misstated a cost centre variance by three lakhs. The number looked plausible. The narrative around it was convincing. I caught it during my review, but the experience made the risk concrete in a way that theoretical concerns never do.

Layer 1: Numerical verification. Every quantitative claim in the AI output gets checked against the source data. Variance amounts, percentages, growth rates, and any derived figures. I do not trust the model to quote numbers accurately, even when those numbers appear in the input I provided. I wrote about this failure mode in detail in Prompt Engineering for Financial Data: What Actually Works. The short version is that LLMs optimise for plausible text, not numerical precision, and a plausible number is not always the correct number.

Layer 2: Causal verification. Every explanatory claim gets assessed against business context. If the model says marketing spend increased because of a campaign launch, I verify that a campaign actually launched. If the model attributes a favourable G&A variance to timing of licence renewals, I check whether that is the actual driver or whether the model invented a plausible-sounding explanation. About 30% of the causal explanations in a typical AI-generated variance commentary are fabricated. They read well. They are wrong.

Layer 3: Completeness check. Did the model miss any material variance? Did it flag the right items for follow-up? Did it correctly distinguish between timing differences and run-rate changes? Errors of omission are harder to catch than errors of commission because you are looking for something that is not there.

This three-layer protocol adds about 30 to 40 minutes to my process. The total time (including data preparation and AI generation) is still significantly less than writing commentary from scratch. But the validation is not optional. It is the control that makes the entire workflow defensible.


Audit Trail Requirements

An external auditor will follow the trail from the financial statements back through every adjustment, judgement, and data transformation to the source document. When AI enters that chain, the trail needs to remain unbroken and interpretable.

I log every AI-assisted output with five pieces of information.

What tool produced it. The specific AI model and version. Not just “an AI tool” but which tool, which version, and which configuration (including the prompt template, if applicable). Model behaviour changes between versions, and an auditor needs to know which version was in use when a specific output was produced.

What input went in. The exact data and context provided to the model. For variance commentary, this is the actuals table and the structured prompt. For forecast narratives, this is the forecast data and the analytical context. I save a snapshot of the input for every production run because the model’s output is only as reliable as the data it received.

What output came out. The raw, unedited model output before any human modification. This matters because the auditor needs to see what the model actually said, not just the final version that went into the business review pack.

What changed during validation. Every edit I make during the three-layer validation gets tracked. Numbers corrected, explanations replaced, items added, items removed. This creates a clear record of where the AI was right and where human judgement overrode the model.

Who approved the final output. A named individual (in my current workflow, that is me) signs off on the validated output before it goes into any deliverable. “The AI generated it” is not an acceptable attribution. “I reviewed and approved it” is.

This documentation takes about five minutes per output. I store it in a structured log alongside the deliverable. It is the single most valuable artefact for audit readiness because it demonstrates that AI is a tool in the workflow, not a decision-maker.


Data Classification: What Goes Into AI Prompts

This is the governance area most finance professionals skip, and it is the one that creates the most risk. Not every piece of financial data should go into an AI prompt. The classification framework I use has three tiers.

Tier 1: Unrestricted. Aggregated data, publicly available information, and anonymised benchmarks. This data can go into any AI tool, including cloud-hosted models, without additional review. Example: industry-level financial ratios, aggregated sector data, publicly reported standards guidance.

Tier 2: Internal only. Company financial data that is aggregated at a level where individual transactions, customers, or employees are not identifiable. Monthly actuals summaries, cost centre totals, revenue by product line. This data can go into AI tools but only tools that meet the organisation’s data processing standards. I verify that the tool’s privacy policy does not allow training on user inputs and that the data processing agreement covers confidentiality. For the tools I use daily, I reviewed these agreements once and documented the findings.

Tier 3: Restricted. Transaction-level data, customer-identifiable information, employee compensation data, unpublished financial results, and anything that falls under insider trading regulations. This data does not go into external AI tools. Full stop. If I need AI assistance with restricted data, I use a locally hosted model or I anonymise the data before it enters any external system.

The classification is not complicated. It mirrors the data classification framework that most organisations already apply to email and file sharing. The difference is that many finance professionals who would never email a customer list to an external vendor will paste that same data into an AI chat interface without a second thought. The governance framework makes the standard explicit and consistent across all channels, including AI tools.


Model Risk Documentation

Every AI tool in my workflow has a one-page risk profile. Not a 30-page assessment. One page with five sections.

Purpose. What the tool does and why it is in the workflow. “Generates first-draft variance commentary from structured actuals data to reduce manual drafting time.”

Known limitations. Where the tool fails or produces unreliable output. “Fabricates causal explanations approximately 30% of the time. Misquotes numbers from input data approximately 15% of the time. Cannot account for business context not provided in the prompt.”

Controls. What validation and review steps are in place to catch failures. “Three-layer validation protocol. Numerical verification, causal verification, completeness check. All outputs reviewed and approved by a named finance team member before external distribution.”

Performance tracking. How I measure whether the tool is working as expected over time. I track error rates by type (numerical errors, fabricated explanations, omissions) on a monthly basis. If the error rate on any category increases by more than 10 percentage points from the trailing three-month average, the tool goes under review.

Escalation path. What happens when the tool fails in a material way. If an AI-generated output contains an error that reaches a deliverable before being caught, I document the failure, update the limitations section, and assess whether the control framework needs to change.

One page. Five sections. Updated quarterly or when a material failure occurs. This document answers the question that every auditor and every risk committee will eventually ask: “You are using AI in your finance workflows. What could go wrong, and how do you manage that risk?”


The Questions Your External Auditor Will Ask

I have now been through one audit cycle with AI tools in my workflow. The questions the auditors asked were not exotic. They were exactly the questions you would expect from professionals trained to evaluate internal controls.

“What AI tools are used in the financial reporting process?” They wanted a complete inventory. Not just the tools I use for analysis, but any AI capability embedded in the systems that process financial data. The AP automation tool’s anomaly detection model counts. The forecasting platform’s ML-based trend analysis counts. The list was longer than I initially thought, and building it was a useful exercise on its own.

“How do you validate AI-generated outputs?” This is where the three-layer validation protocol and the audit trail documentation proved their worth. I could show them the process, the evidence of review, and the tracked changes between raw AI output and final deliverable. Their follow-up question was about whether the validation is performed consistently. The structured log answered that.

“What happens when the AI tool is wrong?” They wanted to see evidence that errors had been caught, documented, and addressed. I showed them two examples from the error log: one where the model misstated a variance figure and one where it fabricated a causal explanation. In both cases, the error was caught during validation, corrected, and documented. The auditors were satisfied that the control was operating effectively.

“Who is accountable for AI-assisted outputs?” The answer was clear: the same person who would be accountable if the output had been produced manually. AI does not change the accountability chain. It changes the toolset. The person who reviews and approves the output owns it, regardless of how the first draft was produced.

“How do you manage data confidentiality when using AI tools?” The data classification framework answered this. I could show them which data tiers were permitted for which tools, the data processing agreements I had reviewed, and the policy that restricted data never enters external AI systems.

None of these questions required deep technical knowledge about how AI models work. They required evidence of governance, validation, and accountability. The same standards that apply to spreadsheet models and manual processes apply to AI-assisted processes. The framework just makes that explicit.


The Approval Workflow for New AI Tools

When I evaluate a new AI tool for my finance workflow, it goes through a four-step approval process before it touches any production data.

Step 1: Use case definition. What specific task will this tool perform? What data will it need? What output will it produce? If I cannot define the use case precisely, the tool does not move forward. “It would be useful for general analysis” is not a use case. “It will generate first-draft forecast commentary from the rolling forecast data, reducing manual drafting time from two hours to 30 minutes” is a use case.

Step 2: Data classification review. Based on the use case, what tier of data will the tool need? If it requires Tier 3 (restricted) data, does the tool meet the standards for restricted data processing? For cloud-hosted tools, this means reviewing the vendor’s data processing agreement, data residency policies, and training data practices. If the vendor’s terms allow training on user inputs, the tool is disqualified for anything above Tier 1 data.

Step 3: Pilot with validation. The tool runs in parallel with the existing process for at least one full cycle (typically one month-end close or one forecast cycle). I compare the AI-assisted output against the manually produced output and track accuracy, error types, and time savings. If the tool does not measurably improve the process, it does not proceed.

Step 4: Risk profile and control design. If the pilot succeeds, I create the one-page risk profile and define the validation controls before the tool enters production use. The controls are in place before the first production output, not added later.

This process takes about four to six weeks for a new tool. That timeline is not fast, but it means that every AI tool in my workflow has a documented purpose, defined controls, and a proven track record before it produces any output that reaches a stakeholder.


What I Would Change

If I were building this framework from scratch today, I would change two things.

First, I would start the tool inventory earlier. I underestimated how many AI capabilities were already embedded in the systems I use daily. The AP automation platform, the BI tool’s natural language query feature, the forecasting software’s pattern detection. Building the inventory retrospectively was harder than it needed to be because I had to trace capabilities through product documentation and vendor conversations. Starting with the inventory before building the governance framework would have given me a more complete picture from day one.

Second, I would involve the internal audit team earlier in the process. I built the framework and then presented it. They had useful feedback (particularly on the audit trail documentation and the performance tracking metrics) that would have been more efficient to incorporate during design rather than after. Internal audit teams think about controls for a living. Their input on an AI governance framework is genuinely valuable, and getting their buy-in early also smooths the path when the external auditors arrive.


I continue to refine this framework as AI tools evolve and as I learn from each audit cycle. The core principle has not changed: AI is a tool in the finance workflow, not a substitute for professional judgement. Governance makes that distinction explicit, documented, and auditable.

If you are building AI governance for your finance function, or if you are preparing for the audit questions that are coming (and they are coming), I would love to compare approaches. The best governance frameworks I have seen are built by practitioners who are actually using the tools daily, not by policy teams working from theory. Let’s connect.

Series Insight

Part of my series on AI in Finance

How AI and machine learning are reshaping FP&A, audit, and financial reporting. Practical frameworks for finance professionals working through automation, LLMs, and data-driven decision making.

View all articles in this series →

Work through this with me

I run focused learning cohorts on FP&A frameworks, financial modelling, and the CA-to-CFO transition. Small groups, real problems, practical output.

Join the Cohort