Running Claude Agents in Production: 7 Things That Will Break
Demo-to-production is the hardest gap in AI agent development. Not because the technology is unreliable — Claude is remarkably capable — but because production exposes assumptions that a demo doesn’t test. Real data. Real infrastructure. Real failure modes. Real costs at real scale.
Here are the seven failure modes we see most often, what they look like in practice, and how to prevent them.
1. Runaway Loops
What it looks like: An agent tasked with “fetch the current lead data from the CRM” hits a rate limit. The responsible thing seems to be to retry. So it retries. And retries. 73 times over 19 minutes, spending $4.20, making no progress.
Why it happens: No circuit breaker. No maximum retry count. No monitoring that notices the pattern. The agent is technically following its instructions.
Prevention:
- Configure maximum retry attempts at the gateway level
- Set anomaly detection that alerts when an agent makes N identical calls in M minutes
- Set a daily token budget that would cap this at, say, $2 before stopping the session
- Monitor for sessions that are “alive but not making progress”
A loop that burns $4 is a minor incident. The same loop, with a higher budget and no monitoring, is a $200 surprise. The controls are cheap to implement and the alternative is expensive to discover.
2. Blast Radius Incidents
What it looks like: An agent with broad AWS permissions, tasked with “clean up the development environment,” interprets “clean up” more aggressively than you intended. S3 buckets, CloudWatch alarms, IAM roles, EC2 snapshots — all deleted. Some of them weren’t in the development environment.
Why it happens: No resource scoping in the policy. “Development environment” is a human concept; the agent doesn’t share your mental model of which resources are in scope.
Prevention:
- Scope policies to specific resource ARNs and prefixes, not wildcards
- Require approval for any deletion operation, regardless of environment
- Tag resources explicitly and enforce policy based on tags
- Test your policies by prompting the agent to do things it shouldn’t be able to do
The rule: an agent should only be able to affect resources that are explicitly in its scope. Everything else should return a 403.
3. Missing Audit Trail
What it looks like: Something unexpected happens. You need to understand what the agent did. You have Claude’s conversation history (maybe), but no structured log of actions taken, what was allowed vs. denied, or what external effects were produced.
Why it happens: Audit logging is treated as optional infrastructure. It’s not.
Prevention:
- Route all agent operations through a control plane that logs every action
- Log the structured data: agent ID, session ID, action type, resource, result, timestamp
- Export logs to storage you own and control (not just a vendor dashboard)
- Test that you can reconstruct a session: given a session ID, can you replay what happened?
The test: pick any session from last week and answer “what did this agent do between 2:00 PM and 2:30 PM?” If you can’t do this in under 5 minutes, you don’t have a real audit trail.
4. Shared Credentials
What it looks like: You have 8 agents. They all use the same AWS access key. One of them needs to be revoked because it’s behaving unexpectedly. To revoke it, you have to rotate the credential — which breaks all 8 agents simultaneously.
Why it happens: Shared credentials are easier to set up initially. Per-agent identity requires more configuration.
Prevention:
- Each agent gets a unique identity (Sentrely manages this automatically)
- Credential rotation affects only the target agent, not the fleet
- Audit logs attribute actions to specific agents, not “a thing with these credentials”
- If one agent is compromised or misbehaving, it can be isolated without affecting others
This also matters for compliance: “an agent did this” is not sufficient attribution. “claude-deploy-01 in session d97e2169 did this at 14:23:07” is.
5. No Cost Controls
What it looks like: Your agent fleet runs well for two weeks. Then a bug in one agent causes a loop. By the time someone notices, it’s spent $340 on a task that should have cost $3. The monthly API invoice is $800 over budget.
Why it happens: Token budgets are never configured. Cost monitoring is manual. Nobody set up an alert at 80% of expected spend.
Prevention:
- Set per-project daily and monthly token budgets with hard limits
- Configure alerts at 50%, 80%, and 100% of budget
- Route cost alerts to a monitored Slack channel
- Set per-session limits so a single runaway session can’t consume a week’s budget
The asymmetry here is important: a budget that’s too tight costs you an interrupted task. No budget costs you an unexpected invoice. Default to conservative limits and loosen them based on observed usage.
6. Context Drift in Long Sessions
What it looks like: An agent starts a long task — migrate 500 records from old schema to new schema. Three hours in, it’s made inconsistent decisions because the early context about what it was doing has drifted out of the active window. Some records are migrated correctly, some with the old schema, some with errors.
Why it happens: LLM context windows are finite. Long-running agents that rely on conversational context accumulate noise and lose early instructions.
Prevention:
- Design agents to work in discrete, short-context chunks rather than long single sessions
- Checkpoint progress explicitly (write state to a file or database)
- For batch operations, process in smaller batches with verification between them
- Use structured task definitions that get re-injected at each checkpoint rather than relying on conversation history
Long sessions are a yellow flag. An agent that needs to maintain context for hours is usually an agent whose task could be decomposed better.
7. No Kill Switch
What it looks like: Something is going wrong. An agent is behaving unexpectedly. You need to stop it. But the only way you can think of to stop it is to revoke the credentials it’s using — which also breaks 7 other agents that are running fine.
Why it happens: Kill switches are designed for incidents, and incident response is designed after the fact.
Prevention:
- Every agent session has a unique session ID
- The control plane exposes a “terminate session” endpoint
- You’ve tested this endpoint in a non-incident context and know it works
- You’ve also tested “pause all agents in project X” for larger incidents
The test: can you stop a specific running agent in under 30 seconds, without affecting anything else? If not, fix that before you go to production.
The Common Thread
None of these failure modes are exotic. They’re the predictable consequences of deploying powerful automation without the operational controls that every other kind of production system requires.
The good news: all seven are preventable with a control plane that’s properly configured before you go live. Not after the first incident. Before.
The teams that run agents most confidently aren’t the ones with the most powerful agents. They’re the ones who invested in the boring infrastructure: policies, audit trails, budgets, session management, kill switches. The unsexy stuff that makes the exciting stuff safe to actually run.
Put this into practice with Sentrely
Everything covered in this article is built into Sentrely's managed control plane. Get early access and have it running against your Claude agents in minutes.