An agent failed on me last week. Not spectacularly. Quietly.
It produced code that looked right, passed the tests I gave it, and would have shipped if I hadn't caught it. The problem wasn't the model. The model was fine. The problem was that I'd handed it a stale version of an interface and it built confidently against the old shape.
Garbage in, garbage out. Nothing new. Except now the garbage is moving faster and the out part is production code.
Everyone's racing to build better agents. Bigger models, smarter planners, longer reasoning chains. That race is interesting but it's the wrong one to be running. The teams who will win the next wave aren't the ones with the best agents. They're the ones who figured out how to feed those agents the right information at the right time.
Agents don't fail because they're dumb
They fail because they're uninformed.
Watch an agent hallucinate a function signature and you'll see it immediately. It had no way to know the real signature. It had the training data, it had your prompt, it had whatever context window you gave it, and none of those told it what the current codebase actually contained. So it guessed, and the guess was plausible, and the guess was wrong.
That's not a reasoning failure. That's a context failure. You asked it a question it didn't have the data to answer and it answered anyway.
The fix isn't a smarter model. The fix is better information delivery. Retrieval that pulls the actual current function signature before the agent generates. Memory that remembers what the codebase looked like last week and flags the diff. State that reflects what's running in production right now versus what the last commit claims.
That's infrastructure work. That's what nobody wants to build because it isn't glamorous, and that's exactly why the teams who do build it are going to pull ahead.
What context engineering actually is
Strip the jargon away and there are four things.
Retrieval. When the agent needs to know something, go get it. Not from training data. From the real current source. Codebase, database, API, whatever the authoritative version is. The engineering problem is relevance: fetching too much drowns the agent in noise, fetching too little leaves it blind. There's an art to deciding what matters for this specific question at this specific moment.
Memory. What has this agent seen before? What decisions were made in earlier sessions? What did the user correct last time that should inform this time? Without memory every interaction starts from zero. With memory you build compounding intelligence. The engineering problem is persistence: what's worth saving, what's noise to discard, how long does it stay useful.
Streaming state. Some questions require knowing what's happening right now, not what was true an hour ago. Market prices. Inventory counts. Active user sessions. Build status. The engineering problem is latency: can you get the agent fresh-enough state without blocking its reasoning while you fetch.
Relevance ranking. You have all the data in the world. You have a fixed context window. What makes it in. The engineering problem is prioritization, and it's the hardest of the four because the right answer depends on what the agent is about to do, which you often don't know until it starts doing it.
None of this is glamorous. All of this is load-bearing.
This isn't actually new. It's just renamed.
If you've spent time writing smart contracts you've already thought about this problem. You just called it something else.
A contract is only as good as the data it reads. Put a bad price feed in front of a lending protocol and the protocol is compromised no matter how elegant the contract code is. Mango Markets. Compound during Luna. A long list of exploits where the contract did exactly what it was supposed to do with the information it was given, and the information was wrong.
Oracle design is context engineering. Decide what data the contract needs. Decide where to source it from. Decide how to aggregate multiple sources so one bad feed can't poison the result. Decide how stale is too stale. Decide what happens when the feed fails. Build monitoring that catches drift before it becomes exploitation.
Read that paragraph again and mentally substitute "agent" for "contract." It's the same problem. Both systems execute confidently on the information you hand them. Both systems fail in ways that look like the system failed when actually the data failed. Both systems require you to build the information layer with the same rigor you'd build the execution layer, because the execution layer is only as trustworthy as what's feeding it.
Engineers coming from web3 are weirdly well-positioned for this shift. We've been thinking about trust-minimized data pipelines for years. The vocabulary changed. The problem didn't.
Where it breaks in production
Two failure modes, opposite directions.
Too much context. The team over-retrieves because retrieval feels free. Every query pulls back forty documents, the agent gets a 90% full context window before it starts thinking, and the signal gets lost in the noise. The agent then confidently builds against whichever chunk happened to be most recent or most verbose, not the chunk that actually answered the question. The symptom is plausible-looking output that's subtly wrong in ways nobody catches until production.
Too little context. The team under-retrieves because the infrastructure to retrieve well doesn't exist yet. The agent gets the question and a thin slice of context and fills in the gaps with training-data plausibility. Hallucinated APIs. Invented function signatures. References to libraries that don't exist in your stack. The symptom is output that looks right and fails immediately on first run, which is actually the better failure mode because at least you see it.
The middle, where the context is sized to the question and ranked by real relevance, is the craft. Most teams aren't doing it yet. The ones who figure it out first are going to look like their agents got magically smarter, when really their context pipeline got surgical.
If you're hiring for this
Look at your "AI engineer" job description. Is it asking for prompt engineering? Fine-tuning? Model selection?
If so, you're hiring for the wrong role.
The engineer who'll actually move your product forward is the one who's going to look at your retrieval pipeline, your memory strategy, your state freshness guarantees, and your relevance ranking, and rebuild whichever one is the weakest link. That person is probably going to come from an infrastructure background, not an ML background. They're going to care about latency and staleness and cache invalidation more than they care about the latest model release.
That's not a glamorous hire. The job spec doesn't read like "AI." It reads like "senior backend engineer who understands information retrieval." Half the candidates you'd dismiss for "not being AI enough" are the ones who'll make your AI work.
Rename the role if you have to. Hire for the problem, not the hype.
The takeaway
The model is not the moat. The model is a commodity, or it will be by the end of next year. Every serious company will have access to approximately the same reasoning capability.
What separates the teams that ship agents that work from the teams that ship agents that hallucinate is the quality of the information pipeline feeding them. That pipeline is boring to build, hard to build well, and almost entirely invisible when it's working.
If you want to know which AI product will win its category a year from now, don't look at the model. Look at the context.