My AI Toolkit for FP&A
Over the past eighteen months, I have tested more AI tools than I want to admit. Some I abandoned after a single session. A few changed how I work every day. Most fell somewhere in between: useful in theory, frustrating in practice, replaced by something better within a month.
This post is the honest inventory. I am walking through every stage of my FP&A cycle and sharing the specific tools and workflows I use right now. Not what I tried once and wrote a LinkedIn post about. What I actually reach for when the month-end close is running hot and I need variance commentary by Thursday morning.
If you have read my earlier posts on LLM-powered variance analysis, prompt engineering for financial data, and AI-powered forecasting, this is the meta-layer that ties those pieces together. The toolkit behind the experiments.
Stage 1: Data Preparation
This is where AI saves me the most time and where the tools are the most mature. Data prep used to consume 30 to 40% of my FP&A cycle. Now it is closer to 15%.
SQL generation with LLMs. I write about 60% of my data warehouse queries by describing what I need in plain English and letting an LLM generate the SQL. The model I use for this is Claude, run through a local CLI tool. I describe the table structure and the output I want (for example, “monthly revenue by product line and customer segment for the last 24 months, with a trailing three-month average column”), and the model produces a query that works on the first attempt about 80% of the time. The remaining 20% needs minor corrections, usually around JOIN logic or date handling that is specific to our warehouse schema.
I tried GitHub Copilot for this. It works well inside VS Code when the context is already open in the editor. But for ad hoc queries where I need to describe the business intent before writing the SQL, a conversational interface produces better results because I can front-load the context.
Data cleaning and transformation. Raw ERP exports are messy. Account codes that changed mid-year, cost centre reclassifications, currency conversion inconsistencies. I used to clean these manually in Excel. Now our data team maintains a set of automated scripts that handle the structural transformations, and I use Claude for the ambiguous cases that require business judgement.
Here is what I mean by ambiguous cases. When a GL account code maps to two different cost centre names across periods (because someone renamed a department), I need to decide which name to use for the consolidated view. I paste the historical mapping and the current period data into Claude and ask it to recommend a normalised name. The model gets this right about 95% of the time, and when it does not, the error is easy to spot because I keep a log of every mapping decision. This saved me approximately four hours per month-end.
What I abandoned. I tested three dedicated “AI data cleaning” tools (Trifacta, now Alteryx Designer Cloud, was the most polished). They solve a general data wrangling problem, not the specific FP&A data problem. My data issues are almost always about chart of accounts logic and period-over-period consistency, not about missing values or duplicate rows. A targeted workflow that understands my GL structure outperforms a general-purpose tool every time.
Stage 2: Analysis
This is the stage where I have the strongest opinions, because I have spent the most time testing approaches that did not work.
Variance commentary drafting. I covered this in detail in I Gave an LLM Our Variance Pack. The short version: I anonymise the monthly actuals (removing entity names, client identifiers, and anything confidential), format the data as a markdown table with pre-calculated variances, and paste it into Claude with a structured prompt. I get a first draft of line-by-line commentary in about 40 seconds. I validate every number and replace fabricated explanations with real drivers. Total time drops from three hours to one. This is the single highest-ROI AI workflow in my toolkit.
The model I use is Claude (Opus for complex packs, Sonnet for simpler monthly reviews). I tested GPT-4 and Gemini on the same task. GPT-4 produced slightly more polished prose but hallucinated numbers at a higher rate. Gemini was faster but inconsistent on format compliance. Claude follows structured output instructions more reliably, which matters when you need every variance item in the same format for the business review pack.
Anomaly detection in transactional data. Our data team runs a weekly automated report that pulls GL transactions above a threshold, calculates z-scores against the trailing six-month distribution for each account, and flags outliers. That statistical layer does not use AI. The AI layer comes after: I take the flagged transactions (with vendor names and counterparty details stripped out), paste them into Claude with the account description and recent history, and ask for a preliminary classification (coding error, timing difference, genuine anomaly, or requires investigation). The model’s classification accuracy sits at about 75%, which is good enough to prioritise my review. I investigate the “genuine anomaly” and “requires investigation” items first and handle the rest in batch.
I tried using an LLM for the statistical detection step too. It was unreliable. The model flagged items that looked narratively interesting rather than statistically unusual. A large transaction to a vendor with an unusual name would get flagged even if it was a routine monthly payment. A small transaction coded to a dormant account would get missed because nothing about it seemed noteworthy in plain text. Statistical anomaly detection needs actual statistics, not pattern matching on descriptions. I keep those two steps separate now.
What I abandoned. Several “AI-powered analytics” platforms that promised automated insights from financial data. The common failure mode: they produced observations that were technically correct and strategically useless. “Revenue in October was 4.2% above budget” is not an insight. I already know that from the data. An insight is “revenue beat budget because three Enterprise renewals pulled forward from November, and the underlying run rate is actually 2% below plan.” No automated analytics tool produced that level of interpretation because it requires business context that does not exist in the data.
Stage 3: Modelling
This is where AI helps the least, and where I see the most overblown claims from vendors.
Copilot in Excel. I tried Copilot in Microsoft 365 for financial modelling. It can write basic formulas and summarise tables, but it breaks down on anything that involves nested logic, structured references across multiple sheets, or the kind of circular reference handling that is common in three-statement models. I still build Excel models by hand. Copilot helps with documentation (writing cell comments that explain the formula logic), but the model-building itself remains a manual process.
Where Copilot breaks financial models. Copilot generates output based on patterns from its training data, and financial models have domain-specific conventions that general-purpose patterns violate. For example, Copilot once suggested a discount factor approach that was mathematically valid but applied annual discounting when mid-year convention was required. These errors are subtle and they produce output that passes a casual review. I caught it because I know what the correct output should be, but a less experienced analyst might not.
LLMs for model review. After building a model, I paste the key formulas and assumptions into an LLM and ask it to review for logical consistency. “Given these assumptions (revenue growth of 15%, gross margin of 72%, OPEX growing at 10%), does the implied free cash flow margin in Year 5 seem reasonable for a SaaS business at this scale?” The model’s response is a sanity check, not an audit. It catches errors that happen when you change one assumption and forget to update a downstream dependency.
Stage 4: Reporting
Reporting is where I have seen the most improvement in the past six months, and it is the stage where I think AI tools are evolving the fastest.
Automated narrative generation for board packs. I wrote about the underlying prompting techniques in Prompt Engineering for Financial Data. The operational workflow is: I prepare an anonymised data summary (financials, KPIs, key movements, with all entity and client names replaced by generic labels), paste it into a prompt template that specifies the tone, audience, and format for the board pack, and get a first draft of the narrative sections. The CFO reviews and edits the draft, but the starting point is significantly more polished than a blank page.
The prompt template matters enormously here. I built templates for each section of the board pack: financial summary, revenue deep-dive, OPEX review, cash flow commentary, and forward-looking guidance. Each template specifies the format (bullet points versus narrative, level of detail, which KPIs to highlight) and the voice (concise, direct, decision-oriented). Building these templates took about two weeks of iteration. Using them saves approximately five hours per quarterly board cycle.
Presentation slide drafts. I take the narrative output and use it to build slides in PowerPoint. For teams with data engineering support, this step can be automated with scripting tools that populate a corporate template. In my workflow, the AI-generated narrative eliminates the most tedious part of board prep: writing the same “revenue increased X% driven by Y” commentary that appears on every quarterly deck. The slide assembly itself is still manual, but the content is already drafted.
What I abandoned. I tried using AI to generate charts directly from data descriptions. The tools produced charts that were aesthetically acceptable but analytically wrong: wrong axis scales, misleading baselines, aggregation levels that obscured the story. Chart design for financial presentations requires understanding what the audience needs to see, and that judgement is context-specific. I build charts in Power BI and Excel with full control over the visual design and use AI only for the narrative that accompanies them.
The Glue Layer: Repeatable Workflows That Connect the Stages
The most valuable part of my toolkit is not any single tool. It is the set of repeatable workflows that connect the stages above. Some of these I run manually through Claude. Others, our data team has automated into scripts that I trigger and review.
The reconciliation narrator. I take the output of an automated reconciliation (two data sources compared line by line), anonymise it by stripping account identifiers and counterparty names, and paste it into Claude with a prompt asking for a plain-English summary of the differences. The output reads something like: “Three items totalling ₹8.4L are unmatched between the GL and the sub-ledger. Two appear to be timing differences. One is a potential coding error based on the account category mismatch.” I review the output and act on it. Before this workflow, I read through reconciliation exceptions manually, about an hour per complex reconciliation.
The forecast commentary updater. When the rolling forecast gets updated, I export the current and prior versions, compare the line items with the largest changes, and paste the anonymised delta into Claude to generate a summary of what moved and why (based on the assumption changes logged by the FP&A team). This accelerates the “bridge from last forecast to current forecast” commentary that consumes time in every forecast cycle.
The month-end checklist assistant. I copy the status of each close task from our tracking spreadsheet into Claude and ask for a summary email draft: what is complete, what is pending, what is at risk. Simple, but it eliminated a 30-minute manual task that happened every day during close week.
All three workflows follow the same principle: take structured data from a defined source, anonymise anything confidential, feed it to an LLM with a specific prompt, and validate the output against the source before using it. The validation step is non-negotiable. Every number in the output must appear in the source data. Without that check, you are publishing LLM-generated financial content without review, and I have seen enough hallucinated numbers to know how that ends.
What the Full Toolkit Looks Like
If I list everything in one place, this is my current working stack:
Tools I use directly: Claude (browser and CLI) for variance commentary, data cleaning decisions, narrative drafting, and ad hoc analysis. Power BI for stakeholder dashboards. Excel for financial modelling. Copilot in Microsoft 365 for formula documentation.
Infrastructure our data team maintains: SQL queries against BigQuery, automated data pipelines for ERP exports and reconciliations, statistical anomaly detection reports. I define the business logic and validation rules. The data team handles the automation.
Confidentiality protocol: Every piece of data I paste into an external LLM is anonymised first. Entity names replaced with generic labels. Client and vendor identifiers stripped. Account numbers masked. Confidentiality is non-negotiable, and I wrote about the full governance framework in Building an AI Governance Framework.
Validation discipline: Every workflow that produces financial output includes a manual validation pass where I check LLM-generated numbers against source data. Any mismatch gets corrected before the output goes anywhere.
What is not on the list: No-code AI platforms (tested three, abandoned all because they could not handle FP&A data specificity). Any tool that promises “AI-powered financial analysis” without giving me control over the prompt and validation layer.
The Honest Assessment
My AI toolkit saves me roughly 15 to 20 hours per month across the full FP&A cycle. That is real. It is not transformative in the way vendor demos suggest, but it shifts my time allocation meaningfully. I spend less time on data preparation and first-draft commentary and more time on the work that changes decisions: talking to business unit leaders about what the numbers mean, building scenarios for the CFO, and connecting the financial picture to operational reality.
The tools that survived all share three properties. They give me control over the input. They give me control over the validation. And they integrate into my existing workflow rather than requiring a new platform. The tools I abandoned all failed on at least one of these criteria, usually by trying to do too much automatically without giving me enough control over the output.
If you are building your own AI toolkit for finance, start with one workflow where you spend disproportionate time on mechanical work. Open Claude or ChatGPT, paste in an anonymised version of that data with a structured prompt, and see what comes back. For me, that starting point was variance commentary. For you, it might be reconciliation, or data prep, or forecast updates. Start narrow. Validate obsessively. Expand only when you trust the output.
I wrote about the data infrastructure that makes this toolkit possible in Beyond Excel: The AI-Augmented Finance Stack, and about the forecasting workflow specifically in AI-Powered Forecasting: Where Models Beat Gut and Where They Do Not.
If you are building AI into your FP&A workflow and want to compare notes on what is working (or share a tool I should be testing), I would love to hear from you. The best additions to my toolkit have come from conversations with other finance professionals running similar experiments. Let’s connect.
Series Insight
Part of my series on AI in Finance
How AI and machine learning are reshaping FP&A, audit, and financial reporting. Practical frameworks for finance professionals working through automation, LLMs, and data-driven decision making.
View all articles in this series →Work through this with me
I run focused learning cohorts on FP&A frameworks, financial modelling, and the CA-to-CFO transition. Small groups, real problems, practical output.
Join the CohortExplore Related Categories