Data Is the Real Model: Lessons From Building an LLM-Powered KYC Agent

There's a belief I kept running into when we started building our KYC compliance agent at ByteDance: if the model is wrong, get a better model. Swap GPT-4 for something newer. Tune the prompt. Add a system instruction. Repeat.

It took a few painful production incidents to realise that was the wrong frame entirely.

Data quality fundamentally outweighs model sophistication in production LLM systems. Here's what that actually looks like in practice.

The Problem: Confidently Wrong Answers

Our KYC agent was designed to answer compliance questions — things like "what documents are required for a business account in Singapore?" or "does this transaction pattern trigger a SAR filing?". The model would respond fluently, in full sentences, with the right tone. And sometimes it was completely wrong.

Not random-noise wrong. Plausible wrong. The kind of wrong that passes a casual read and fails an audit.

The initial instinct, as always, was to assume the model needed better prompting or that we were using the wrong model. That instinct was wrong.

What Was Actually Broken: Three Data Problems

After digging in, we found three distinct failure modes — none of which were about the model.

1. Inconsistent Source Data

Our knowledge base had three overlapping sources: old internal policies, updated regional regulations, and country-specific rule overrides. They contradicted each other in subtle ways — a rule that was valid in one document was superseded in another, but both still lived in the retrieval index.

The model wasn't hallucinating. It was seeing genuinely conflicting information and picking one. Reasonably, but incorrectly.

2. Ambiguous Ground Truth

Some compliance questions don't have explicit written answers. They rely on institutional knowledge — things experienced compliance officers know but never wrote down, because it was assumed. The model had to guess at unstated assumptions that any human in the room would have understood immediately.

3. Poorly Chunked Retrieval

Even with RAG in place, the chunking strategy was naive — split by token count, not by logical boundary. A single policy rule would get sliced across two chunks. The model would see the first half of the rule in context, miss the qualifying clause that lived in the next chunk, and confidently answer based on incomplete information.

The Real Insight

Hallucination wasn't a bug — it was a rational response to bad inputs.

LLMs don't invent problems. They amplify the data quality that already exists. If your data is contradictory, the model will reflect contradiction. If your data has gaps, the model will fill them with its best guess. The model is essentially a very expensive mirror.

What Actually Fixed It

None of the fixes involved changing the model. All of them were about data discipline.

Enforce a single source of truth per rule. We built a curation layer that resolved conflicts before they hit the retrieval index — one authoritative version of each policy, clearly versioned, with older versions archived and excluded from retrieval.

Let the agent say "I don't know." We added answerability checks: if the retrieved context didn't fully support a response, the agent would decline rather than guess. This felt like a regression at first (lower answer rate), but it was the right trade-off — wrong answers in compliance are far more expensive than no answers.

Align chunking to policy boundaries. We rewrote the ingestion pipeline to chunk by section and clause rather than by token limit. A rule stays intact as a unit. The model sees the full context it needs, not an arbitrary slice of it.

Hallucinations dropped significantly after these changes. Not because we made the model smarter — because we made the data more truthful.

What This Means More Broadly

The AI industry has spent years obsessing over model benchmarks. And the models are remarkable. But most production failures I've seen aren't model failures — they're data failures that look like model failures.

If you're building on top of LLMs, treat your data as part of the model itself, not merely input to it. The curation pipeline, the retrieval strategy, the ground truth — these are engineering problems that no amount of model scaling will solve for you.

We don't have a model problem. We have a data discipline problem.

And that's actually good news — because data is something you can own, structure, and continuously improve. Models are something you rent.