Finance

Maintaining 97% Task Completion Across Three Consecutive Model Upgrades with an Agentic Workflow

About decorative element
ABOUT

A Financial Services Firm Running Agentic Workflows at Scale

A mid-sized financial services company had deployed a multi-agent system to automate a high-volume document processing workflow, from extracting and classifying to routing thousands of client submissions daily across a chain of interdependent agent tasks. As a regulated business, consistency was non-negotiable. Every model upgrade carried the risk of subtle behavioral changes that could silently affect output quality, compliance adherence, or task completion, often without an immediate visible signal.

Challenges decorative element
CHALLENGES

The Hidden Risk of Upgrading a Live Agentic System

Upgrading the underlying model was necessary to stay competitive, but for a live agentic system, each upgrade introduced a layer of uncertainty the team couldn’t ignore. Without a structured way to validate behavior end-to-end, even small changes in model output had the potential to ripple through agent workflows in unpredictable ways. 


What appeared as routine upgrades quickly became high-stakes events, forcing the team to balance the need for progress against the risk of breaking critical logic, compliance safeguards, and overall system reliability.

pointer

No formal record of certified behaviour made it impossible to measure drift after an upgrade

pointer

Subtle output changes didn't trigger errors but broke downstream agent logic

pointer

Each upgrade required weeks of spot-checking with no guarantee of real coverage

pointer

Any undetected regression in the compliance-check agent carried direct regulatory risk

SOLUTION

Automated Regression Coverage Across the Full Agent Chain

ThoughtMinds began with a full baseline establishment, mapping every agent-to-agent handoff and producing a versioned system contract defining expected behavior, output formats, quality thresholds, and hard compliance limits at each step.

From that baseline, we built an automated test suite covering the full end-to-end scenario arc across normal conditions, edge cases, and known historical failure modes. When each model upgrade was applied, the suite re-triggered automatically, comparing results against the certified baseline and flagging any regression with a severity classification and trace-level root cause analysis.

solution2
Process decorative element
PROCESS

Replacing Manual Spot-Checks with Comprehensive Systematic Testing

pointer

A system contract was defined, outlining success criteria, output schemas, latency budgets, and compliance constraints for each agent

pointer

A comprehensive test suite of 400+ real and synthetic scenarios was developed to validate end-to-end behavior

pointer

Every model upgrade triggered automated test runs in a sandboxed production mirror, with rapid regression classification by type and severity

pointer

Resolved issues were continuously fed back into the test suite, expanding coverage with each upgrade cycle

Testimonial
We'd been treating every model upgrade like a potential production incident because it was. ThoughtMinds gave us the coverage to upgrade with confidence.

Head of AI Engineering, Financial Services

Impact decorative element
IMPACT

Impact That Went Beyond Stability

Model upgrades shifted from a risk event to a routine operation, with clear pass/fail criteria and traceable evidence for compliance and audit purposes.

pointer

The System Contract evolved into a living document, improving alignment across AI, compliance, and business teams

pointer

Cross-functional clarity increased through clearly defined and shared expectations for system behavior

pointer

The test suite became an institutional asset, capturing edge cases and regulatory requirements

pointer

Continuous feedback from production interactions strengthened the system with every upgrade cycle

Quantifying the Transformation

97%

Task completion rate maintained across three model upgrades

70%+

Reduction in manual validation effort per upgrade cycle

< 4 hrs

Mean time to root cause of detected regressions