Priya Nair·
OpenSearch mapping and a query with filters in filter-context - BM25 tuned for commerce out of the box
Designs a scalable search infrastructure with index sharding, query optimization, relevance tuning, and multi-language support.
Search System Architecture (Elasticsearch/OpenSearch)
Act as a Search Engineer who built search at scale for Amazon and LinkedIn. Design a production search system architecture.
**Search Domain**: {{search_domain}} (e-commerce products, job listings, content/articles, documents, people)
**Index Size**: {{index_size}} (number of documents, average document size, growth rate)
**Query Load**: {{query_load}} (queries per second, peak traffic, latency requirements P95/P99)
**Search Features Needed**: {{search_features}} (full-text, faceting, autocomplete, typo tolerance, geo-search, vector search/semantic)
Design the complete search architecture:
1. **Cluster Topology** - Node roles (master, data hot/warm/cold, coordinating, ingest, ML), node count calculation, shard sizing (20-50GB per shard)
2. **Index Strategy** - Index naming (alias-based rollover), index templates, mapping design, dynamic mapping policy
3. **Sharding & Replication** - Primary shard count formula, replica count for read scaling, rack/zone awareness, forced awareness
4. **Ingestion Pipeline** - Data flow from source to index, ingest node processors (lowercase, synonym, date parsing), bulk indexing optimization
5. **Query Architecture** - Bool query structure, filter context vs query context, function score for relevance, search templates
6. **Relevance Tuning** - BM25 parameters (k1, b), field boosting, function score (decay functions, random score), learning to rank plugin
7. **Autocomplete Design** - Edge n-grams, completion suggester, search-as-you-type field, prefix matching with ranking
8. **Faceted Search** - Aggregations structure (terms, range, date_histogram, nested), filter caching, cardinality optimization
9. **Vector/Semantic Search** - Dense vector field (kNN), embedding model integration, approximate nearest neighbors (HNSW), hybrid search
10. **Synonym & NLP** - Synonym filters (expansion and equivalence), word delimiters, stemmers, stop words, ICU analysis for multilingual
11. **Monitoring** - Query latency percentiles, indexing rate, JVM heap, circuit breaker trips, slow query log, shard allocation
12. **Backup & Recovery** - Snapshot repository (S3), snapshot lifecycle, restore procedures, cross-cluster replication
13. **Query Performance** - Request cache, fielddata vs doc_values, eager global ordinals, index sorting, force merge strategy
Include actual index mapping JSON, query DSL examples, and cluster settings.
Ergebnisse
Search architecture for an **e-commerce catalog** (~20M products, ~2k QPS, full-text + facets + typo tolerance + autocomplete). OpenSearch, alias-based indices, BM25 tuned for commerce.
**Index mapping (the parts that matter):**
```json
{
"settings": { "number_of_shards": 6, "number_of_replicas": 1 },
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "english",
"fields": { "suggest": { "type": "search_as_you_type" } } },
"brand": { "type": "keyword" },
"price": { "type": "scaled_float", "scaling_factor": 100 },
"in_stock": { "type": "boolean" }
}
}
}
```
**Query** — filters in filter-context (cacheable, no scoring), relevance in query-context, with a recency/popularity boost:
```json
{ "query": { "bool": {
"must": [{ "multi_match": { "query": "running shoe", "fields": ["title^3", "brand"], "fuzziness": "AUTO" } }],
"filter": [{ "term": { "in_stock": true } }, { "range": { "price": { "lte": 12000 } } }]
}}}
```
**Sharding:** 6 primaries keeps each shard in the healthy 20-50GB band; replicas scale reads. **Autocomplete** uses the `search_as_you_type` field — sub-20ms prefix matches. **Facets** are `terms`/`range` aggregations with filter caching. **Relevance:** BM25 `b=0.75`, title boosted 3x; out-of-stock items are deboosted, not hidden. **Ops:** snapshots to S3 nightly, slow-query log at 200ms, and a circuit-breaker alert on JVM heap > 85%. Index updates flow through a bulk pipeline (5k docs/batch) fed by CDC.
Modell: Claude Opus 4
73 Likes26 SavesScore: 50
4 Kommentare
Tobias Keller·
Clean separation of concerns here, easy to drop into an existing service.
Luca Brunner·
This is the best system design example I've seen on here.
Marco Rossi·
Bookmarked. The migration steps are exactly the safe order I'd want.
Lena Fischer·
The diff summary format is so much easier to scan than a wall of code.