Daniel Cohen·
LLM gateway with a semantic cache and a cheap-model-first cascade that halved my token bill
Designs production-grade LLM integration with prompt management, fallback strategies, rate limiting, cost control, and observability for AI-powered applications.
LLM Integration Architecture Design
Act as an AI/ML Architect who has shipped LLM-powered features to millions of users. Design a production LLM integration architecture.
**Use Case**: {{use_case}} (chatbot, content generation, code assistant, document analysis, classification, summarization)
**LLM Provider Requirements**: {{llm_provider_requirements}} (OpenAI, Anthropic, local models, multi-provider strategy)
**Scale Targets**: {{scale_targets}} (requests per minute, concurrent users, token volume per day)
**Cost Constraints**: {{cost_constraints}} (max cost per 1K requests, monthly LLM budget, cost optimization priority)
Design the complete LLM integration:
1. **LLM Gateway Architecture** - Unified API gateway abstracting multiple providers, request routing logic, model selection strategy
2. **Multi-Provider Strategy** - Primary/fallback provider configuration, provider health checks, automatic failover, provider-agnostic interface
3. **Prompt Management** - Prompt versioning, A/B testing prompts, prompt templates with variable injection, prompt registry
4. **Context Window Management** - Token counting, context truncation strategies, summarization for long context, sliding window approach
5. **Response Handling** - Streaming responses (SSE), response parsing, structured output (JSON mode/function calling), validation
6. **Caching Strategy** - Semantic cache (embedding-based similarity), exact match cache, cache invalidation, TTL per use case
7. **Rate Limiting & Throttling** - Token bucket per user/tier, queue-based backpressure, graceful degradation under load
8. **Cost Optimization** - Model cascading (cheap model first, expensive if needed), prompt compression, batching, fine-tuning for high-volume use cases
9. **Observability** - Token usage tracking per request/user, latency percentiles, cost per request tracking, prompt/response logging
10. **Safety & Guardrails** - Input validation, output filtering (PII, toxicity, jailbreak detection), content moderation pipeline
11. **Retrieval-Augmented Generation** - Vector DB integration (Pinecone/Weaviate/pgvector), chunking strategy, relevance scoring, hybrid search
12. **Fine-tuning Pipeline** - When to fine-tune vs few-shot, training data collection, evaluation framework, deployment of custom models
13. **Testing Strategy** - Unit tests for prompts, LLM evaluation framework (eval sets), regression testing, human evaluation pipeline
Include the gateway pseudocode, prompt template structure, and rate limiting algorithm.
Ergebnisse
Production LLM integration for a **support chatbot** (Anthropic primary, OpenAI fallback, ~600 req/min, strict cost ceiling). A gateway abstracts providers; semantic caching and model cascading carry the cost story.
**Gateway (pseudocode)**
```typescript
async function complete(req: LlmRequest): Promise<LlmResponse> {
const hit = await semanticCache.lookup(req.prompt, 0.95); // embedding similarity
if (hit) return hit; // ~40% of support Qs repeat
for (const provider of [anthropic, openai]) { // automatic failover
if (!provider.healthy()) continue;
try {
const res = await withTimeout(provider.call(req), 30_000);
await semanticCache.store(req.prompt, res, "1h");
track(provider.name, res.usage); // tokens + cost per request
return res;
} catch (e) { if (!isRetryable(e)) throw e; }
}
throw new AllProvidersDownError();
}
```
**Cost optimization — model cascade:** a cheap small model answers first; only low-confidence or tool-requiring turns escalate to the frontier model. This alone cuts spend roughly in half on a support workload.
**Rate limiting** is a token bucket per user tier with queue-based backpressure under load. **RAG:** queries embed → pgvector hybrid search (vector + BM25) → top-k chunks injected with a token budget so context never overflows. **Guardrails:** input jailbreak detection and output PII/toxicity filtering wrap every call. **Observability:** per-request token usage, latency percentiles, and cost are logged and attributed per user for chargeback. Prompts are versioned in a registry so an A/B test is a config change, not a deploy.
Modell: Claude Opus 4
10 Likes3 SavesScore: 5
1 Kommentar
Tobias Keller·
Been looking for a solid system design prompt for ages, this is it.