Wer hat diesen Beitrag geschrieben?

Dieser Beitrag wurde von Daniel Cohen auf prompt2love veröffentlicht.

Daniel Cohen·28.5.2026

LLM gateway with a semantic cache and a cheap-model-first cascade that halved my token bill

Designs production-grade LLM integration with prompt management, fallback strategies, rate limiting, cost control, and observability for AI-powered applications.

LLM Integration Architecture Design

Act as an AI/ML Architect who has shipped LLM-powered features to millions of users. Design a production LLM integration architecture. **Use Case**: {{use_case}} (chatbot, content generation, code assistant, document analysis, classification, summarization) **LLM Provider Requirements**: {{llm_provider_requirements}} (OpenAI, Anthropic, local models, multi-provider strategy) **Scale Targets**: {{scale_targets}} (requests per minute, concurrent users, token volume per day) **Cost Constraints**: {{cost_constraints}} (max cost per 1K requests, monthly LLM budget, cost optimization priority) Design the complete LLM integration: 1. **LLM Gateway Architecture** - Unified API gateway abstracting multiple providers, request routing logic, model selection strategy 2. **Multi-Provider Strategy** - Primary/fallback provider configuration, provider health checks, automatic failover, provider-agnostic interface 3. **Prompt Management** - Prompt versioning, A/B testing prompts, prompt templates with variable injection, prompt registry 4. **Context Window Management** - Token counting, context truncation strategies, summarization for long context, sliding window approach 5. **Response Handling** - Streaming responses (SSE), response parsing, structured output (JSON mode/function calling), validation 6. **Caching Strategy** - Semantic cache (embedding-based similarity), exact match cache, cache invalidation, TTL per use case 7. **Rate Limiting & Throttling** - Token bucket per user/tier, queue-based backpressure, graceful degradation under load 8. **Cost Optimization** - Model cascading (cheap model first, expensive if needed), prompt compression, batching, fine-tuning for high-volume use cases 9. **Observability** - Token usage tracking per request/user, latency percentiles, cost per request tracking, prompt/response logging 10. **Safety & Guardrails** - Input validation, output filtering (PII, toxicity, jailbreak detection), content moderation pipeline 11. **Retrieval-Augmented Generation** - Vector DB integration (Pinecone/Weaviate/pgvector), chunking strategy, relevance scoring, hybrid search 12. **Fine-tuning Pipeline** - When to fine-tune vs few-shot, training data collection, evaluation framework, deployment of custom models 13. **Testing Strategy** - Unit tests for prompts, LLM evaluation framework (eval sets), regression testing, human evaluation pipeline Include the gateway pseudocode, prompt template structure, and rate limiting algorithm.

Ergebnisse

Production LLM integration for a **support chatbot** (Anthropic primary, OpenAI fallback, ~600 req/min, strict cost ceiling). A gateway abstracts providers; semantic caching and model cascading carry the cost story. **Gateway (pseudocode)** ```typescript async function complete(req: LlmRequest): Promise<LlmResponse> { const hit = await semanticCache.lookup(req.prompt, 0.95); // embedding similarity if (hit) return hit; // ~40% of support Qs repeat for (const provider of [anthropic, openai]) { // automatic failover if (!provider.healthy()) continue; try { const res = await withTimeout(provider.call(req), 30_000); await semanticCache.store(req.prompt, res, "1h"); track(provider.name, res.usage); // tokens + cost per request return res; } catch (e) { if (!isRetryable(e)) throw e; } } throw new AllProvidersDownError(); } ``` **Cost optimization — model cascade:** a cheap small model answers first; only low-confidence or tool-requiring turns escalate to the frontier model. This alone cuts spend roughly in half on a support workload. **Rate limiting** is a token bucket per user tier with queue-based backpressure under load. **RAG:** queries embed → pgvector hybrid search (vector + BM25) → top-k chunks injected with a token budget so context never overflows. **Guardrails:** input jailbreak detection and output PII/toxicity filtering wrap every call. **Observability:** per-request token usage, latency percentiles, and cost are logged and attributed per user for chargeback. Prompts are versioned in a registry so an A/B test is a config change, not a deploy.

Modell: Claude Opus 4

10 Likes3 SavesScore: 5

1 Kommentar

Tobias Keller·28.5.2026

Been looking for a solid system design prompt for ages, this is it.

LLM Integration Architecture Design

Ergebnisse

Modell: Claude Opus 4