Everyone's building AI agents. Nobody's talking about what happens after you deploy them. The demos look incredible. The blog posts promise autonomous workflows that handle themselves. The reality is messier, more expensive, and more interesting than anyone is admitting.
We run Octopoda, a memory and observability engine for AI agents. Our system logs every memory write, every search, every decision, every loop, every failure across every agent connected to our platform. After a month of real production data from real developers building real things, we have numbers that tell a very different story from the hype.
This isn't a survey. Nobody filled out a form. This is raw operational data from agents doing actual work.
The headline numbers
In our first 30 days of production:
Write latency across the platform averages 30 milliseconds. Read latency averages 10 milliseconds. Semantic search averages 5 milliseconds. The infrastructure can handle the load. The agents themselves are the bottleneck.
72% of agent time is spent on things that aren't the actual task
This number shocked us. We broke down what agents actually spend their time doing:
Answering questions, making decisions, producing output
Relearning things the agent already knew from previous sessions
Deciding which tool to use and interpreting the result
Tools failing, retries, fallback logic, trying different approaches
Reasoning about what to do next without making progress
The agents that perform best all have three things in common
We ranked agents by their health scores and looked at what the top 20% do differently from the bottom 20%.
Persistent memory
Agents with memory make fewer redundant tool calls, give more consistent answers, and recover faster from errors because they can reference past context.
Focused prompts
The best performing agents have prompts under 500 words with clear role definitions and explicit rules. The worst have prompts over 2,000 words trying to cover every scenario. Longer prompts don't make agents smarter. They make agents confused.
Limited tool sets
More tools means more decisions about which one to use. Each decision is a chance to pick wrong. The best agents have exactly the tools they need and nothing more.
Multi-agent systems: the shared memory effect
Multi-agent is not automatically better. Multi-agent with shared knowledge is dramatically better. Multi-agent without shared knowledge is actually worse than doing it alone.
When agents can read what other agents have found, decided, and concluded, the whole team performs at a level no individual agent can match. But adding more agents without giving them a way to communicate actually makes things worse — each agent works in isolation, duplicates effort, and sometimes contradicts the others.
The most common failure mode costs the most money
We categorised every failure event in our system:
Quietly burns tokens doing nothing useful. Average cost: $4.70. Most expensive single loop: $180.
Agent picks web search when it should check memory. Mostly prompt and tool description issues.
Agent skips tools entirely and generates from training data. Confident, fluent, sometimes completely wrong.
Agent acts on outdated information. A customer's plan changed but memory wasn't updated.
APIs timing out, rate limits hit, unexpected response formats.
The token economy is brutal at scale
| Task type | Avg tokens | Cost (GPT-4) | At 100/day |
|---|---|---|---|
| Simple Q&A | 2,400 | ~$0.08 | $240/mo |
| Research + web search | 12,000 | ~$0.40 | $1,200/mo |
| Multi-agent pipeline | 35,000 | ~$1.00 | $3,000/mo |
| Customer support | 6,500 | ~$0.22 | $6,600/mo |
The developers doing well financially all use the same strategies: cheaper models for simple routing, response caching for repeated questions, strict iteration limits, and persistent memory to avoid the 23% amnesia tax.
What the data says about the future
Average number of tools per agent up 40% in 30 days. Developers building more ambitious systems.
Total memories stored doubled in the last week. Infrastructure behind agent memory needs to scale faster than expected.
Ratio of multi-agent to single agents went from 1 in 10 to 1 in 5 over the month.
Platform average went from 65 to 70. Developers are learning from their debugging tools. Visibility drives improvement.
What this means for you
If you're building AI agents or planning to, here's what the data says you should focus on:
We publish these numbers because the AI agent industry needs more honesty about what's actually happening in production. The demos are impressive. The reality is harder. But the reality is also genuinely promising — if you build with your eyes open.
pip install octopoda. See what's actually happening inside your agents.

