Prompt Engineering for Financial Data: What Actually Works

Most prompt engineering advice on the internet is written for people building chatbots or generating marketing copy. Almost none of it translates directly to financial data. I learned this the hard way after spending two weeks trying to get an LLM to produce usable variance commentary using generic prompting techniques. The results ranged from vaguely plausible to dangerously wrong.

The problem is not that LLMs are bad at financial analysis. The problem is that financial data has properties that generic prompting does not account for: precision requirements measured in basis points, hierarchical structures (GL codes map to cost centres that map to business units that map to segments), and a professional audience that will catch any number that is off by even a small amount. Getting useful output from an LLM on financial data requires prompting techniques built specifically for these constraints.

This is what I have learned from six months of daily experimentation with LLMs on real FP&A data.

One critical note before we begin: every technique in this article assumes you are working with anonymised data. I never paste raw client data, employer-identifiable financials, or confidential information into any external LLM. I replace entity names with generic labels, mask account numbers, and strip anything that could identify the company or its counterparties before the data enters any prompt. Confidentiality is non-negotiable as a CA, and I wrote about the full governance framework in Building an AI Governance Framework.


The Fundamental Challenge

LLMs are trained on internet text. They have absorbed patterns from millions of documents, including financial reports, earnings transcripts, and analyst commentary. This gives them a surface-level fluency with financial language that is genuinely impressive and deeply misleading.

The model knows that “revenue was below expectations” is a reasonable thing to say in a variance commentary. It knows that phrases like “driven primarily by” and “partially offset by” are the standard connectors in financial narrative. It can produce text that looks exactly like a CFO would expect to read in a monthly business review. The structure is right. The language is right. The tone is right.

The numbers, however, might be completely fabricated.

This happens because the model is optimising for plausible text, not for numerical accuracy. When it encounters a data table and is asked to produce commentary, it will generate text that sounds correct rather than text that is correct. A 4.2% variance might become 4.5% in the output. A ₹12.8L overrun might become ₹14.2L. The model is not lying. It is doing what it was trained to do: produce the most probable next token. And in the context of financial narrative, a plausible number is often more probable than the exact number.

Understanding this failure mode is the foundation of every technique that follows. Every effective prompt for financial data is, at its core, a set of constraints designed to prevent the model from generating plausible-sounding numbers that do not match the input.


Technique 1: Structured Context Injection

The single most impactful improvement I made to my prompts was providing structured context before asking the model to do any work. Not just the data table. The entire analytical frame.

What to include:

  • The chart of accounts structure, or at least the hierarchy relevant to the analysis (revenue by product line, OPEX by cost centre, headcount by function)
  • Prior period data (the model needs a baseline to assess whether a variance is unusual or within normal range)
  • Variance thresholds that matter to your organisation (“flag variances greater than 5% or ₹3L, whichever is smaller”)
  • The reporting period and any known one-time items (“October includes a ₹4L provision release in G&A that should be called out as non-recurring”)

Why it works: The model cannot infer your chart of accounts structure from the data alone. If you provide a flat table with 20 line items, the model does not know that “Cloud Infrastructure” and “Data Centre Costs” roll up into “Technology OPEX” unless you tell it. Providing the hierarchy means the model can produce commentary at the right level of aggregation and can identify when sub-line movements offset each other.

What I actually provide:

A header section before the data that reads something like this: “This is the October actuals pack for the India B2B SaaS business unit. Revenue is subscription-based (ARR model) with a professional services component. The cost structure is 65% headcount, 20% technology infrastructure, 15% other OPEX. The business has 180 employees across engineering, sales, and G&A. The prior year comparator reflects a smaller headcount base of 140, so year-over-year OPEX variances are expected to be adverse.”

That paragraph costs me two minutes to write and it prevents an entire category of errors where the model misinterprets the data because it lacks business context.


Technique 2: Role-Based Prompting

This technique is well-documented in general prompting literature, but the finance-specific application matters. The role you assign the model changes the register, depth, and assumptions of the output.

What works: “You are a senior FP&A analyst with ten years of experience in SaaS businesses. You are preparing variance commentary for the monthly business review, which will be read by the CFO, the VP of the business unit, and the VP of Sales. Your commentary should be precise, concise, and focused on decision-relevant insights.”

Why the audience specification matters: When I tell the model the CFO will read the output, the commentary becomes more concise and focuses on materiality. When I tell it the VP of Sales will read it, the revenue section gets more detail and the language avoids jargon that a non-finance executive might not follow. This mirrors what good FP&A analysts do naturally: adjust the commentary for the audience.

What does not work: “You are a world-class financial analyst.” Flattery produces no measurable improvement in output quality. The model does not try harder because you called it world-class. Specificity about the role, the context, and the audience drives better output. Adjectives do not.

A subtlety I discovered: Adding “you are cautious about causal claims and will distinguish between observations supported by the data and hypotheses that require further investigation” produces noticeably better output than the base role prompt alone. The model will still make some unsupported causal claims, but fewer. It starts hedging appropriately on ambiguous variances, which is exactly what I want.


Technique 3: Output Format Specification

LLMs will default to whatever format seems most natural for the text they are generating. For financial commentary, that default is usually a paragraph of narrative prose. Narrative prose is fine for executive summaries. It is terrible for line-by-line variance analysis because it buries numbers inside sentences and makes validation painful.

I specify the exact output format I want, and I provide an example.

The format I use most often:

“For each material variance, provide:

  1. Line item: [name]
  2. Variance to budget: [amount] ([percentage]), [favourable/adverse]
  3. Variance to prior year: [amount] ([percentage])
  4. Driver: [explanation based on data provided, or ‘requires investigation’ if not determinable]
  5. Classification: [timing / run-rate / one-time]
  6. Action: [none required / flag for reforecast / escalate to budget owner]”

Why this works better than free-form: Every number is in a predictable location, so I can validate the output against the source data in a systematic sweep rather than hunting through paragraphs. The classification field forces the model to make a judgement call that I can agree with or override. The action field produces a natural to-do list for follow-up conversations.

The executive summary layer: After the line-by-line output, I ask for a three-sentence executive summary that highlights the two or three most significant variances and their combined impact. This gives me the narrative paragraph for the top of the business review pack, drafted in a tone that is appropriate for senior leadership.

Providing an example: I include one completed example in the prompt so the model can see exactly what I want. This eliminates ambiguity about formatting, tone, and level of detail more effectively than instructions alone. The example acts as a template that the model follows for all subsequent line items.


Technique 4: Chain-of-Thought for Financial Reasoning

When I need the model to do anything that resembles analysis rather than description (for example, assessing whether a cost trend is concerning or explaining why a revenue beat did not translate to a margin beat) I force it to show its reasoning step by step.

The prompt addition: “Before providing your conclusion, show your reasoning. State the relevant data points, the comparison you are making, and the logic that leads to your conclusion. If the reasoning requires a calculation, state the formula and inputs but do not compute the result. I will verify calculations independently.”

Why “do not compute” is critical: LLMs are unreliable at arithmetic. They are not using a calculator. They are predicting the most likely next token, which means they sometimes produce a plausible-looking number that is not the correct result of the stated calculation. By telling the model to state the formula and inputs but not compute the result, I get the analytical logic without the arithmetic risk. I compute the result myself or cross-reference against the source data.

An example of where this helped: I asked the model to analyse a margin compression. Without chain-of-thought, the model said “gross margin declined 2.3 percentage points, driven by higher cloud infrastructure costs.” With chain-of-thought, the model walked through the reasoning: revenue increased 8% while COGS increased 14%, with the COGS increase split between cloud infrastructure (up 22%, the primary driver) and support headcount (up 6%, secondary). The detailed reasoning exposed that the cloud infrastructure increase was the dominant factor, which changed the follow-up conversation from a general cost review to a specific infrastructure spending discussion.

The limit of this technique: Chain-of-thought prompting makes the output longer and slower. For a 14-line-item variance pack, the chain-of-thought version takes about three minutes to generate instead of 40 seconds. I use it selectively, only for variance items where the story is complex enough to warrant structured reasoning.


What Never Works

Some approaches consistently produce bad output regardless of how carefully I structure the prompt. These are worth listing because they are the traps that seem like they should work.

Vague prompts. “Analyse this data and give me insights” produces generic observations that any intern could make by looking at the same table. The model needs specific questions to produce specific answers. “Identify the three largest adverse variances to budget and explain the likely drivers” is a better prompt than “tell me what is going on.”

Asking for specific numbers without providing the data. “What was our Q3 marketing spend?” when the prompt contains only October data. The model will answer confidently with a fabricated number rather than saying it does not have the information. This sounds obvious, but it happens frequently in multi-turn conversations where the context window fills with earlier exchanges and the model starts referencing data from previous prompts that is no longer relevant.

Trusting any calculation. I have tested this extensively. Even when I provide the exact inputs and ask the model to multiply two numbers, it gets the answer wrong about 15% of the time for numbers with more than four significant digits. For financial data, where every number matters, this error rate is unacceptable. Every calculation in the output must be verified independently.

Asking the model to access external data. “Compare our margin to industry benchmarks” when you have not provided the benchmarks. The model will produce benchmarks from its training data, which may be outdated, wrong, or from a different industry segment. If you need benchmarks in the output, provide them in the prompt.

Long, complex prompts without structure. A 2,000-word prompt written as continuous prose produces worse output than a 500-word prompt with clear sections, numbered instructions, and explicit format specifications. The model responds to structure. Give it structure.


The Validation Layer

Every technique above improves the probability of useful output. None of them eliminate the need for human validation. This is not a temporary limitation that will be fixed in the next model release. It is a structural property of how these models work.

The validation process I follow takes three passes.

Pass 1: Number check. Every quantitative claim in the output gets checked against the source data. Variance amounts, percentages, and any derived figures. This is the fastest pass because it is purely mechanical: does the number in the commentary match the number in the table?

Pass 2: Causal check. Every explanatory claim gets assessed against what I know about the business. Did marketing spend actually increase because of a product launch? Or was it a timing difference in agency invoicing? The model’s causal explanations are hypotheses, not conclusions. Some will be right. Some will be plausible but wrong. Some will be fabricated entirely.

Pass 3: Completeness check. Did the model miss any material variance? Did it flag the right items for follow-up? Did it correctly classify timing differences versus run-rate changes? This pass catches errors of omission, which are harder to spot than errors of commission.

Three passes sounds like a lot of work. In practice, it takes 30 to 40 minutes for a full monthly actuals pack. The total time (10 minutes for data preparation, 40 seconds for generation, 35 minutes for validation) is still less than half the time I used to spend writing the commentary from scratch.


Where This Is Heading

The techniques above represent where LLM prompting for financial data is today: useful but heavily constrained, productive but requiring validation at every step. I expect this to improve. Models are getting better at following numerical constraints, and the ability to connect LLMs to structured databases (rather than pasting tables into a text prompt) will reduce a whole category of formatting and parsing errors.

But the core dynamic will not change soon. Financial data requires precision. LLMs are probabilistic. The gap between those two realities means that prompt engineering for finance will always be about designing constraints that keep the model inside the boundaries where it performs well and away from the boundaries where it fails confidently.

I wrote about the broader experiment that led to these techniques in I Gave an LLM Our Variance Pack. Here Is What It Got Right, and about how AI fits into forecasting specifically in AI-Powered Forecasting: Where Models Beat Gut and Where They Do Not.

If you are building prompts for financial data and have found techniques that work (or failure modes I have not covered here), I would love to compare notes. The best prompting patterns I have found so far came from conversations with other finance professionals who are running similar experiments. Let’s connect.

Series Insight

Part of my series on AI in Finance

How AI and machine learning are reshaping FP&A, audit, and financial reporting. Practical frameworks for finance professionals working through automation, LLMs, and data-driven decision making.

View all articles in this series →

Work through this with me

I run focused learning cohorts on FP&A frameworks, financial modelling, and the CA-to-CFO transition. Small groups, real problems, practical output.

Join the Cohort