MCP Servers in Production: Architecture Patterns That Survive Real Workloads
We've built twenty-two MCP implementations. Here are the architecture patterns that survived — and the mistakes that didn't.
MCP Servers in Production: Architecture Patterns That Survive Real Workloads
Anthropic's Model Context Protocol is the most significant shift in how we build AI systems since function calling. Not because of what it does — tools for LLMs aren't new — but because of what it enables: a standard interface that decouples the model from everything it interacts with.
We've been building MCP servers since the protocol went public. Production systems handling real data, real users, and real failure modes. Twenty-two implementations across fintech, healthcare, developer tools, and operational monitoring.
Here's what we've learned. Some of it aligns with the docs. Much of it doesn't.
The Promise vs. The Reality
The MCP pitch is clean: define tools, expose them via a standard transport, and any compatible client can use them. Your AI assistant gets superpowers — database access, API integrations, file management — through a unified protocol.
The reality is a bit messier.
MCP gives you a protocol. It doesn't give you an architecture. And the gap between "my tool works in Claude Desktop" and "my tool handles 50 concurrent users in a regulated environment" might cause AI adoption stalling.
Three categories of problems come up in every production deployment:
1. State management — MCP tools are stateless by design, but real workflows aren't
2. Error propagation — when a tool fails mid-chain, the model needs to understand why and recover
3. Security boundaries — the model has access to everything the MCP server can do, which is exactly the problem
Pattern 1: The Layered Server Architecture
Most MCP tutorials show a flat structure: one server, a list of tools, each tool does one thing. This works for demos. It collapses under production requirements.
We use a three-layer architecture:
Transport Layer
Handles connection lifecycle, authentication, and protocol negotiation. We support three transports depending on deployment context:
- STDIO — for local development and CLI tools. Zero network overhead. The model talks directly to the process.
- SSE (Server-Sent Events) — for web-based clients that need real-time streaming. Works through corporate proxies.
- Streamable HTTP — for microservice deployments where you need load balancing and horizontal scaling.
The critical insight: your tool logic should never know which transport it's running on. If your database query function checks `if transport === 'sse'`, you've coupled things that should be independent.
Service Layer
Business logic lives here. Each service is a self-contained module with its own dependencies, configuration, and error handling. Services don't know about MCP — they're plain functions that take parameters and return results.
This separation is not optional. When (not if) you need to expose the same logic through a REST API, a scheduled job, or a different AI protocol, you'll be grateful the business logic isn't entangled with MCP's tool registration.
Tool Layer
Thin adapters that map MCP tool definitions to service functions. Input validation, parameter coercion, response formatting. Nothing else.
```
Pseudostructure — not production code
Transport (SSE/STDIO/HTTP)
└── Tool Layer (validation, MCP schema)
└── Service Layer (business logic)
└── Data Layer (DB, APIs, cache)
```
When we built a stock analysis MCP server with 39+ tools, this layering was the difference between maintainability and chaos. Each tool is 10-15 lines of glue. The real logic lives in services that we can test, profile, and deploy independently.
Pattern 2: Intelligent Caching (Because Models Are Expensive)
Here's something the MCP docs don't emphasize enough: every tool call costs tokens. The model sends a request, receives a response, and that response goes into the context window. If your tool returns 5KB of JSON for a stock analysis, that's ~1,500 tokens consumed. Multiply by a multi-step conversation, and you're burning through context window and API budget simultaneously.
Our caching strategy has three tiers:
Tier 1: Response Cache (Redis)
Tool responses cached by input hash. TTL varies by data type — stock prices get 15 minutes, company fundamentals get 24 hours, static reference data gets a week. Cache hit means no computation, no API calls, no latency.
Tier 2: Computation Cache (In-Memory)
Intermediate results that are expensive to compute but don't need persistence. Technical indicators, statistical calculations, correlation matrices. These live in a process-local LRU cache and survive across tool calls within a session.
Tier 3: Semantic Cache
This is the non-obvious one. If a user asks "How is AAPL doing?" and five minutes later asks "What about Apple stock?" — those are semantically identical queries that produce identical tool calls. We hash the normalized query intent, not just the raw parameters.
The result: in a typical analysis session, 40-60% of tool calls hit cache. That's real money saved and real latency eliminated.
Pattern 3: Error Recovery That Actually Works
The default MCP error behavior is: tool throws an exception, model receives an error message, model decides what to do. In practice, the model usually apologizes and gives up. Or worse — hallucinates the answer, or retries the same failing call in a loop.
We implement structured error responses that give the model actionable information:
```json
{
"error": true,
"error_type": "rate_limit",
"retry_after_seconds": 30,
"fallback_available": true,
"fallback_tool": "get_cached_price_data",
"partial_data": { "last_cached": "2026-03-10T14:30:00Z" }
}
```
The model doesn't need to guess. It knows: this failed because of a rate limit, it can retry in 30 seconds, or it can call a fallback tool that returns slightly stale data. The user gets an answer, not an apology.
Three error patterns we've standardized:
1. Graceful degradation — when the primary data source fails, fall back to cached data with a staleness indicator
2. Circuit breakers — after N consecutive failures, stop calling the failing service entirely and switch to alternatives
3. Partial results — if 8 out of 10 API calls succeed, return the 8 that worked with a clear indication of what's missing
Verifying That Tool Calls Actually Happen
Before deploying any MCP integration, verify that the model is actually calling your tools — not hallucinating responses. This sounds obvious. It isn't.
Models are remarkably good at producing plausible-looking tool output from parametric knowledge. If your MCP server goes down silently, the model might continue the conversation with fabricated data, and neither you nor the user will notice until the numbers don't add up.
Our approach: canary files. We maintain test resources with deliberately absurd content — a file named "Saint Anthony" that contains detailed technical documentation about automated cow milking systems. If you ask the model about Saint Anthony and it returns historical or religious content, you know it's generating from training data. If it returns dairy farming specifications, the MCP connection is live. Simple benchmark tasks with unambiguous, verifiable answers. Run them before every deployment.
Pattern 4: Security as Architecture, Not Afterthought
MCP has a fundamental security challenge: the model decides which tools to call. If your MCP server exposes a `delete_all_records` tool, the model can call it. Prompt injection, confused context, or simple misunderstanding — all paths lead to the same outcome.
Our security architecture has three layers:
Tool-Level Permissions
Not every tool should be available in every context. A read-only analysis session shouldn't see write tools. An unauthenticated user shouldn't see admin tools. We filter the tool list dynamically based on session context.
Parameter Validation
Beyond schema validation (which MCP supports natively), we enforce business rules. A date range can't exceed 2 years. A stock ticker must exist in our universe. A query can't contain SQL injection patterns. The model is creative — your validation should be paranoid.
Audit Logging
Every tool call is logged: who, what, when, with what parameters, and what was returned. Not for compliance theater — for debugging. When a client asks "why did the AI say X?" you need to reconstruct the exact chain of tool calls that produced that response.
For regulated industries — healthcare, finance, legal — this audit trail isn't optional. It's the difference between "we use AI" and "we use AI that we can explain to a regulator."
Pattern 5: Resource-Aware Tool Design
MCP supports both tools (actions the model can take) and resources (data the model can read). Most implementations ignore resources entirely and put everything in tools. This is a mistake.
Resources are for static or slowly-changing reference data: configuration, schemas, documentation, system status. The model reads them once and they stay in context.
Tools are for dynamic operations: queries, calculations, mutations. Each call costs tokens and latency.
The distinction matters for cost. A tool that returns your database schema every time the model needs to construct a query is wasting tokens — that schema changes once a quarter. Make it a resource. A function that runs the actual query against live data — that's a tool.
We typically split 70/30: 70% tools, 30% resources. The resources handle everything the model needs to "know" about the system. The tools handle everything the model needs to "do."
When MCP Makes Sense (and When It Doesn't)
After twenty-two implementations, here's our honest assessment:
MCP is the right choice when:
- You're building for multiple AI clients (Claude, GPT, local models) and don't want vendor lock-in
- Your system has complex, multi-step workflows that benefit from the model's reasoning about tool orchestration
- You need standardized audit trails across all AI-data interactions
- Your team already maintains APIs and wants to expose them to AI without rewriting
- You want architectural discipline — even with a single model provider, MCP enforces separation of concerns between your AI layer and business logic that pays off as the system grows
MCP is overkill when:
- Your "tools" are simple CRUD operations — a REST API with good documentation works fine
- Latency budget is under 100ms — MCP's protocol overhead adds 20-50ms per call
- You don't need the model to decide which tool to call — hard-coded pipelines are simpler and more predictable
- Your task requires rigid formatting, style enforcement, or deterministic behavior — skills and system prompts handle this better than tool-calling
MCP is wrong when:
- You're trying to replace a well-functioning RAG pipeline — MCP and RAG solve different problems
- You want real-time streaming of large datasets — MCP is request-response, not a streaming protocol
- Security requirements demand zero model autonomy — if a human must approve every action, the model's tool-calling ability adds risk without value
What's Coming
MCP is evolving fast. The protocol is gaining OAuth support, better streaming primitives, and server-to-server composition. We're watching three developments:
1. MCP server composition — servers that call other servers, creating tool hierarchies. This enables complex multi-system workflows without exposing everything to the model.
2. Elicitation — the ability for the server to ask the model clarifying questions mid-tool-execution. Currently, tools are fire-and-forget. Elicitation turns them into conversations.
3. Remote MCP — servers hosted as cloud services rather than local processes. This changes the deployment model entirely — from "run this on your machine" to "subscribe to this capability."
We're already building against some of these. If you're evaluating MCP for enterprise use, talk to us before you commit to an architecture — the protocol is stable enough to build on, but some of the patterns that work today will need adjustment.
Conclusion
MCP isn't magic. It's plumbing. Very good plumbing — standardized, extensible, model-agnostic — but plumbing nonetheless. The value isn't in the protocol itself. It's in what you build on top of it.
The teams that succeed with MCP treat it as an architecture problem, not a tooling problem. They separate transport from logic, cache aggressively, handle errors as first-class citizens, and think about security before the first tool is registered.
The teams that struggle try to make everything a tool, skip caching because "it's just a prototype," and discover security concerns in production.
We've made both categories of mistakes. The patterns above are what survived.
Kenaz builds production MCP servers for enterprises in regulated industries. If you're evaluating MCP for your AI infrastructure, [book a consultation](/contact) — we'll tell you honestly whether it's the right fit.
