I Gave an LLM Our Variance Pack. Here Is What It Got Right.
Last October, I ran an experiment that changed how I think about variance analysis. I took our monthly actuals pack, anonymised it (replacing all entity names with generic labels, masking account numbers, and stripping any client or employer-identifiable information), and fed it into an LLM with a structured prompt. I asked it to produce the same variance commentary I normally spend three hours drafting after every month-end close.
The result was not what I expected. It was not a disaster, and it was not a replacement for my work. It was something more interesting: a draft that was about 70% useful and 30% confidently wrong, in ways that taught me more about how LLMs process financial data than any research paper could.
This is a detailed account of that experiment, what the model got right, what it got wrong, and the hybrid workflow I have been using since.
What I Fed the Model
The input was a standard monthly actuals pack for a mid-size business unit. Three months of data: budget, prior year, and actual for the current month. Revenue by product line, COGS by category, operating expenses by cost centre, and headcount actuals versus plan. Nothing exotic. This is the same data set that sits in every FP&A analyst’s variance workbook.
I structured the input as a markdown table rather than a CSV dump. Early testing had shown me that LLMs handle tabular data more reliably when it is formatted as a table rather than raw comma-separated values. The model seems to “read” the column headers more accurately when they are visually aligned.
I also provided context that a new analyst joining the team would get: the business is a B2B SaaS company, revenue is subscription-based with a professional services component, and the primary cost drivers are headcount and cloud infrastructure. Without this context, the model would produce generic commentary that could apply to any business. With it, the commentary started sounding like it came from someone who understood the company.
What the LLM Got Right
Pattern recognition across line items. The model identified that revenue was 4.2% above budget while professional services was 11% below, and connected this to a broader pattern: the product mix was shifting toward self-serve implementation. I had planned to make exactly this observation in my commentary. The model got there in three seconds.
Consistency of format. Every variance commentary I have ever written follows the same structure: state the variance, explain the driver, note whether it is a timing difference or a run-rate change, and flag any action required. I specified this format in the prompt, and the model followed it perfectly for every line item. No formatting drift, no missing sections, no inconsistency between how it described a favourable variance and an adverse one. This is something that human analysts (including me) routinely get wrong when we are tired at the end of a long close.
Speed. The full commentary, covering 14 line items with variance explanations, took about 40 seconds to generate. My typical time for the same output is two and a half to three hours. Even after the editing and validation I describe below, the total time dropped to about one hour. That is a genuine productivity gain.
Trend identification. I gave the model three months of data specifically to test whether it could spot trends. It identified that cloud infrastructure costs had increased sequentially for all three months and were now 8% above the Q3 run-rate. It flagged this as a potential reforecast item. Correct observation. Correct implication. I would have made the same call.
What the LLM Got Wrong
Hallucinated numbers. The model stated that marketing spend was “₹12.4L above budget, primarily driven by the product launch campaign in the third week of October.” Two problems. The marketing variance was ₹8.7L, not ₹12.4L. And I had not provided any information about a product launch campaign. The model fabricated a plausible-sounding explanation for a number it also fabricated. This is the failure mode that makes AI-generated financial commentary dangerous without validation. The text reads well. The numbers are wrong.
Missed context that requires institutional knowledge. The G&A line showed a ₹6L favourable variance. The model attributed it to “lower-than-expected administrative costs, likely reflecting timing of annual license renewals.” The actual reason was that we had delayed a consulting engagement to Q4 because the project sponsor changed roles. No model could know this. The explanation requires knowledge that lives in email threads and meeting conversations, not in the actuals pack.
Over-confident causal explanations. For every single variance, the model provided a causal explanation. Not once did it say “the driver of this variance is unclear and requires investigation.” In my experience, at least two or three line items in every monthly pack have variances where the honest answer is “I need to check with the business.” The model’s inability to express uncertainty is a structural limitation, not a fixable prompt issue. LLMs are trained to produce confident text. Variance analysis requires the analyst to distinguish between “I know why” and “I need to find out why.”
Calculation errors. When I asked the model to compute year-over-year growth rates from the data I provided, it got two out of seven calculations wrong. Not dramatically wrong. Off by a percentage point in both cases, likely from rounding errors in intermediate steps. But in financial reporting, approximately right is wrong. A variance commentary that says “14% year-over-year growth” when the actual figure is 15% undermines credibility with anyone who checks the math.
The Prompt Structure That Worked
After several iterations, I settled on a prompt structure with four distinct sections.
Section 1: Role and context. “You are a senior FP&A analyst reviewing the October monthly actuals for [business description]. Your commentary will be read by the CFO and the business unit VP.”
Section 2: The data. The full actuals table in markdown format, with clear column headers for Budget, Prior Year, Actual, Variance to Budget (absolute and percentage), and Variance to Prior Year.
Section 3: Output format specification. “For each line item with a variance greater than 5% or ₹2L, provide: (1) the variance amount and direction, (2) the primary driver if identifiable from the data, (3) whether this appears to be a timing difference or a run-rate change, (4) any recommended action or follow-up. If the driver is not identifiable from the data provided, state that explicitly.”
Section 4: Constraints. “Do not fabricate explanations. Do not invent data points not present in the input. If you cannot determine the driver of a variance from the data provided, say so. Do not perform calculations. Use only the numbers provided in the table.”
Section 4 was the breakthrough. Adding “do not perform calculations” eliminated the hallucinated numbers almost entirely. The model stopped trying to derive figures and instead quoted directly from the input table. Adding “if you cannot determine the driver, say so” did not fully fix the over-confidence problem, but it reduced it noticeably. The model began flagging about half of the genuinely ambiguous variances instead of inventing explanations for all of them.
The Hybrid Workflow I Use Now
I do not use the LLM to produce final variance commentary. I use it to produce a first draft that I then edit. The workflow runs in four steps.
Step 1: Data preparation. I format the actuals pack as a markdown table with all the columns the model needs. This takes about ten minutes and I have templated it.
Step 2: LLM draft generation. I run the structured prompt and get the initial commentary. About 40 seconds.
Step 3: Validation pass. I go through the draft line by line. I check every number against the source data. I replace fabricated explanations with the actual drivers (which I know from business context, conversations with budget owners, and prior months). I flag any line item where neither I nor the model can explain the variance, and I mark those for follow-up. This takes about 30 to 40 minutes.
Step 4: Business context layer. I add the narrative that connects individual variances to the broader business story. The model can tell you that professional services revenue is below budget. It cannot tell you that this is because the implementation team has been pulled into a strategic account that will generate ₹40L of incremental ARR in Q1. That layer of interpretation is where the FP&A analyst adds value that no model can replicate.
The total time is roughly one hour, down from three. The quality is at least as high as before because the LLM catches formatting inconsistencies and pattern-level observations that I sometimes miss when I am writing the commentary manually at the end of a long close week.
What This Means for FP&A Teams
I want to be direct about what this is and what it is not.
This is not a story about AI replacing the variance analyst. The model cannot do the job. It cannot explain why G&A was favourable (because it does not know about the delayed consulting engagement). It cannot connect the professional services miss to the strategic account decision. It cannot tell the CFO whether the cloud infrastructure trend is a genuine cost escalation or a timing artefact from a migration project.
What the model can do is eliminate the mechanical parts of the work: scanning 14 line items, calculating variances (when you pre-calculate them and provide them in the input), formatting commentary consistently, and identifying patterns across periods. These tasks take time. They are important. They are also not where the analyst’s judgement matters most.
The net effect is that the analyst spends less time on formatting and first-draft writing and more time on the work that actually changes decisions: understanding why the numbers are what they are, connecting the actuals to the forecast, and telling the story that helps the leadership team allocate resources differently.
If your FP&A team is spending three or four hours per month writing variance commentary that follows a consistent format and pulls from the same data structure, an LLM-assisted workflow can cut that time significantly. But only if you build the validation layer into the process. Skipping validation is how you end up presenting hallucinated numbers to the CFO, and that conversation goes exactly as badly as you would expect.
Where I Am Taking This Next
The variance commentary experiment was the starting point. I am now testing similar workflows for forecast narrative generation (where the model drafts the commentary around the rolling forecast update) and for anomaly detection in transactional data (where the model flags patterns that might indicate coding errors or unusual activity).
The common thread is the same: the model handles the mechanical pattern-matching layer, and the human provides the business context and validation. Neither layer works well alone. Both together produce output that is faster and more consistent than either could achieve independently.
I write more about the specific prompting techniques in Prompt Engineering for Financial Data: What Actually Works, and about how AI fits into forecasting workflows in AI-Powered Forecasting: Where Models Beat Gut and Where They Do Not.
If you are experimenting with LLMs in your FP&A process, or if you are trying to figure out where to start, I would genuinely love to hear what you are finding. The practice is moving fast, and the best insights I have gotten so far have come from comparing notes with other finance professionals who are building with these tools. Let’s connect.
Series Insight
Part of my series on AI in Finance
How AI and machine learning are reshaping FP&A, audit, and financial reporting. Practical frameworks for finance professionals working through automation, LLMs, and data-driven decision making.
View all articles in this series →Work through this with me
I run focused learning cohorts on FP&A frameworks, financial modelling, and the CA-to-CFO transition. Small groups, real problems, practical output.
Join the CohortExplore Related Categories