Перейти к содержимому

System Design Cases (Middle 3 Deep)

Зачем знать на Middle 3: System design interview — must-pass для Senior+ ролей. От тебя ждут не “знать про CAP”, а уметь развернуть классические кейсы за 45-60 минут, спорить про trade-offs, обосновывать capacity. Ниже — 10 классических кейсов с deep dive (обычно интервьюер уходит вглубь одного-двух).

  1. URL Shortener (TinyURL)
  2. Distributed Cache (Redis-like)
  3. Notification System
  4. Rate Limiter
  5. Distributed Queue (Kafka-like)
  6. Search (Elasticsearch-like)
  7. Chat (WhatsApp-like)
  8. Feed (Twitter timeline)
  9. Payment System
  10. GPS Tracking / Logistics
  11. 30 общих вопросов System Design
  12. Источники

Functional:

  • POST /shorten body {url}{short}.
  • GET /{short} → 301 redirect.
  • Custom alias.
  • Optional TTL.
  • Analytics (clicks, geo, referer).

Non-functional:

  • 100M URLs/day → ~1200 writes/sec.
  • 100B total URLs за 100 дней retention.
  • Read:Write = 10:1 → ~12k reads/sec, peak ~25k.
  • p99 redirect < 50ms.
  • Availability 99.99%.
  • Storage: 100B * (long URL avg 500B + short 7B + metadata 100B) ≈ 60 TB.
  • QPS reads peak: 25k.
  • Bandwidth in: 600 MB/s, out: 30 MB/s.
POST /api/v1/shorten
body: {"long_url": "https://...", "alias": "promo", "ttl": 86400}
→ 201 {"short": "promo", "expires_at": "..."}
GET /:short
→ 301 Location: <long_url>

PG:

urls (
short VARCHAR(10) PRIMARY KEY,
long TEXT NOT NULL,
owner_id BIGINT,
created_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ
)

ClickHouse for analytics:

clicks (short, ts, ip, country, referer, user_agent) ENGINE=MergeTree
  • 7 символов base62 → 62^7 ≈ 3.5 * 10^12 (хватит на годы).
  • Варианты ID:
    1. Counter (auto-increment) + base62 encode. Risk: leaks order, requires single source of truth.
    2. Snowflake: timestamp(41) + worker(10) + seq(12) = 64 bit → base62.
    3. Random + check collision: redo on conflict (probability low at 62^7).
    4. Pre-generated batch: worker берёт 1000 ID из ID-service, отдаёт локально.

Choice: Snowflake-like (no DB round-trip), encode в base62.

+---------+
| CDN |
+----+----+
|
+----v----+
| LB |
+----+----+
|
+---------------+---------------+
| |
+----v-----+ +----v-----+
| Read API | | Write API|
+----+-----+ +----+-----+
| |
+----v-----+ +----v-----+
| Redis | <-- write-back --- | PG |
| cache | | (sharded)|
+----------+ +----+-----+
|
+----v-----+
| CDC |
+----+-----+
|
+----v-----+
| ClickHouse|
+----------+
  1. Request → CDN (cache hit → 301 return).
  2. CDN miss → LB → Read API.
  3. Read API → Redis (LRU, 10M hot keys, 90%+ hit).
  4. Cache miss → PG (read replica) → write back to cache.
  5. 301 response with Cache-Control: max-age=3600.
  1. Generate short via Snowflake.
  2. INSERT into PG primary.
  3. Write to Redis (write-through).
  4. Async event → Kafka → ClickHouse for indexing.
  • Write QPS: Snowflake distributed → no bottleneck.
  • Read on hot keys: CDN + Redis замыкают 95%.
  • Storage growth: shard PG by hash(short) % N. Add read replicas per shard.
  • Analytics: don’t aggregate on hot path; Kafka consumer batches into ClickHouse.
  • Random ID → no order leak, but no friendly enumeration.
  • 301 vs 302: 301 cacheable (good for traffic), but invalidation hard. Use 302 if URL может меняться.

  • Get/Set/Delete by key.
  • TTL.
  • LRU eviction.
  • Cluster (auto-sharding).
  • HA: replica per shard.
  • p99 < 1ms intra-DC.
client
|
[client-side hash] <- consistent hashing ring
|
shard 0 shard 1 ... shard N
| | | |
M R M R <- master + replica per shard
  • Каждый node — N virtual nodes (vnodes).
  • Key → hash → найти ближайший vnode по часовой.
  • Add/remove node перемещает только N/total keys.
type Ring struct {
nodes []uint64 // sorted hashes
nodeMap map[uint64]string
}
func (r *Ring) GetNode(key string) string {
h := xxhash.Sum64String(key)
idx := sort.Search(len(r.nodes), func(i int) bool { return r.nodes[i] >= h })
if idx == len(r.nodes) { idx = 0 }
return r.nodeMap[r.nodes[idx]]
}
  • Doubly-linked list + hashmap O(1).
  • Approximated LRU (Redis) — sample N keys, evict oldest. Cheaper.
  • Lazy expiration: проверка на read.
  • Active: periodic scan random keys, evict expired.
  • Master accepts writes.
  • Async push to replicas (low-latency).
  • Read from replicas (eventual consistency).
  • Failover: replica → master via sentinel/raft.
  • Hot key: один shard перегружен. Mitigation: hash key + suffix (key:shard1, key:shard2).
  • Network bandwidth: pipelining, compression.
  • Strong vs eventual: chose eventual за speed.
  • Memory cost: high (everything in RAM).

  • Multi-channel: email, SMS, push, in-app, webhook.
  • 1M notifications/sec peak.
  • Templates с переменными.
  • Rate limiting per user (max 5/day).
  • Provider abstraction.
  • Retry с backoff.
  • Idempotency.
+--------+ +--------+ +-----------+
| API |->| Kafka |->| Per-channel|
| (POST | | | | workers |
| /send) | +--------+ | |
+--------+ | email |
| sms |
| push |
+-----------+
|
provider (SendGrid, Twilio, FCM)
|
+-----v-----+
| Status DB |
| (delivery,|
| bounces) |
+-----------+
  • Client supplies Idempotency-Key.
  • API stores key + result for 24h (Redis).
  • Duplicate request → return cached.
  • Sliding window per user.
  • Redis: INCR user:123:notif:hour with EXPIRE.
  • Or token bucket (см. файл 45).
  • Primary: SendGrid.
  • Fallback: SES.
  • Health metrics per provider; circuit breaker switches.
  • Storage: Postgres / S3.
  • Render: Go text/template или Liquid.
  • Variables: {{.UserName}}, {{formatDate .ts}}.
  • Exponential backoff: 1s, 5s, 30s, 5m, 1h.
  • DLQ after 5 attempts.
  • Permanent errors (invalid email): no retry, mark as failed.
  • At-least-once delivery — accepted, but template idempotency mandatory.
  • Real-time vs batch — push/SMS real-time, digest emails batched hourly.

  • Per IP / user / API key.
  • Multiple tiers (free=100rpm, paid=10krpm).
  • Distributed (one limit across pods).
  • Low latency (<5ms).

Token bucket:

  • Bucket capacity C, refill R per second.
  • Request consumes 1 token.
  • Burst allowed up to C.

Leaky bucket:

  • Same as token but constant rate output. Smooths burst.

Sliding window log:

  • Store timestamps of each request.
  • Count в [now-window, now]. Точно, но память.

Sliding window counter:

  • 2 buckets (current + previous window).
  • Estimate = current + prev * (1 - elapsed_in_current/window).
  • Approximation, low memory.

Fixed window counter:

  • Counter per minute. Reset at boundary.
  • ⚠️ Boundary effect: 2x burst at minute change.
-- KEYS[1] = key, ARGV[1] = capacity, ARGV[2] = refill_rate, ARGV[3] = now, ARGV[4] = tokens_to_take
local data = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(data[1]) or tonumber(ARGV[1])
local ts = tonumber(data[2]) or tonumber(ARGV[3])
local elapsed = math.max(0, tonumber(ARGV[3]) - ts)
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
local allowed = 0
if tokens >= tonumber(ARGV[4]) then
tokens = tokens - tonumber(ARGV[4])
allowed = 1
end
redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", ARGV[3])
redis.call("EXPIRE", KEYS[1], 3600)
return {allowed, tokens}

Atomic via Lua → no race.

  • Headers: X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After.
  • Hot Redis key — sharding by hash(user).
  • Single Redis node — clustered.

  • High throughput (millions msg/s).
  • Durable.
  • Partitioned для parallel consume.
  • Ordered внутри partition.
  • At-least-once delivery.
producers ----> [broker] ---> consumers
|
partitions: p0, p1, p2, ...
each replicated x3
log on disk (append-only)
  • Topic = N partitions.
  • Producer выбирает partition: round-robin, hash(key), explicit.
  • Order guaranteed внутри partition.
  • Parallelism = N partitions.
  • Один leader per partition + followers.
  • ISR (In-Sync Replica set).
  • acks=all — leader ждёт ISR ack.
  • min.insync.replicas=2 — минимум для durability.
  • Append-only log на disk.
  • Segments по 1GB.
  • Retention by time (retention.ms=7d) или size.
  • Каждый consumer в группе — owner partition(s).
  • Coordinator distributes (sticky assignment).
  • Offset commit (per partition) — autocommit или manual.
  • Transactional producer + idempotent producer + read-committed isolation.
  • Sequence numbers per producer per partition.
  • vs RabbitMQ: Kafka — log (replay), Rabbit — queue (ack/del).
  • vs SQS: SQS managed, no order guarantee у standard, FIFO у FIFO type.

  • Full-text search across millions docs.
  • Faceting (filter by tags, dates).
  • Ranking (relevance).
  • p95 < 200ms.
  • Mapping: term → posting list (doc IDs, positions).
  • Tokenization, normalization (lowercase, stem).
  • N-gram for fuzzy.
  • By doc ID hash (each shard ~100M docs).
  • Each shard — independent Lucene index.
  • N replicas per shard for read scaling.
  1. Coordinator parses query.
  2. Fan-out to all shards (or specific).
  3. Each shard returns top-K.
  4. Coordinator merges, returns top-K global.
  • TF-IDF: term frequency * inverse doc frequency.
  • BM25: модификация TF-IDF (saturation, length normalization).
  • Custom: boost by signals (recency, popularity).
  • Real-time (Lucene segments + refresh interval 1s).
  • Bulk indexing для backfill.
  • Hot shard: skewed data. Mitigation: routing-aware sharding.
  • Deep pagination (page 1000): cost. Use search_after.

  • 1:1 + groups.
  • Real-time delivery.
  • Online presence.
  • Offline message storage.
  • Push for offline.
  • E2EE optional.
client <--ws--> gateway (sticky) <--> message bus (Kafka)
| |
v v
presence (Redis) storage (Cassandra)
|
v
push (FCM / APNS)
  • Sticky session (consistent hash by user_id).
  • Hold connection.
  • Send/receive frames.
  • Heartbeat (every 30s).
  • Partition: conversation_id.
  • Clustering: ts DESC.
  • Query: “last 50 messages in chat” — efficient.
CREATE TABLE messages (
conversation_id UUID,
ts TIMESTAMP,
message_id UUID,
sender_id UUID,
body BLOB,
PRIMARY KEY ((conversation_id), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
  • Redis SET online:users with TTL.
  • Pub/sub for friends notifications.
  • Fan-out на send (write amplification).
  • 1000-member group: 1 send → 1000 messages stored? Or one stored + fanout на read?
  • WhatsApp: fan-out на send (но cap group size).
  • Signal Protocol: X3DH + Double Ratchet.
  • Server видит ciphertext, не plaintext.
  • Forward secrecy: per-message key.
  • При offline (last_seen > Xs) → отправить FCM/APNS.

  • Followers timeline (chronological + ranked).
  • 500M users, avg 200 follows.
  • Latency timeline read < 200ms.

Fan-out on write (push):

  • При публикации: для каждого follower’а — INSERT в его timeline.
  • Pros: O(1) read.
  • Cons: O(F) write. Celebrity (50M followers) убьёт систему.

Fan-out on read (pull):

  • Timeline вычисляется при запросе: SELECT из feeds следуемых, sort.
  • Pros: O(1) write.
  • Cons: O(F) read.

Hybrid (Twitter):

  • Push для regular users.
  • Pull для celebrities.
  • Merge на read.
  • Redis Sorted Set per user (score=ts, member=tweet_id), capped at 1000.
  • Recency baseline.
  • Engagement signal (likes, replies, retweets).
  • ML model (LightGBM/transformer).
  • Hot accounts (Elon Musk).
  • Storage: 500M users * 1000 tweet IDs * 16B ≈ 8TB. Sharded Redis.

  • Strong consistency (no double-debit).
  • Idempotency.
  • Audit trail.
  • PCI-DSS.
  • 1k TPS peak.
client -> API -> Payments service -> Saga orchestrator
|
+---------------+------+---------+
v v v v
AuthZ Risk eng Balance Provider
(3DS) (Stripe)
  • Client supplies Idempotency-Key: <uuid>.
  • Server stores key + result for 24h.
  • Duplicate → cached response (no double-charge).

Two-phase commit:

  • Prepare phase on all participants.
  • Commit if all OK.
  • Blocking on coordinator failure.
  • Rarely используется в distributed system.

Saga:

  • Sequence of local transactions, each with compensating action.
  • E.g. reserve → charge → ship → ship_fail → refund.
Order placed
-> Reserve inventory
-> Charge payment
-> Ship
Compensations: refund, release_inventory.
  • Каждая транзакция = pair: debit on one account, credit on another.
  • Invariant: SUM(all entries) = 0.
  • Audit clear.
  • Append-only, immutable.
  • Кripted/signed.
  • Retain 7+ years (regulatory).
  • Daily job: compare internal balance vs provider report.
  • Discrepancy → manual review.
  • Don’t store PAN. Use tokenization (Stripe handles).
  • Network isolation (PCI scope).
  • Annual audit.

  • 1M vehicles, 1Hz updates → 1M points/sec.
  • Real-time dashboard.
  • Historical query (“vehicle X в pos за вчера”).
  • Geofencing alerts.
device -> MQTT broker -> Kafka -> ingestion svc -> ClickHouse / TimescaleDB
|
v
Redis (latest pos)
|
v
WebSocket gateway -> dashboards
  • TimescaleDB: PostgreSQL extension, SQL-friendly.
  • ClickHouse: columnar, fastest analytics.
  • InfluxDB: purpose-built TSDB.

Compare:

  • ClickHouse: лучший для batch analytics.
  • TimescaleDB: лучшее SQL + transactional.
  • InfluxDB: лучшее для metric-style.

Choose: ClickHouse для history, Redis для latest.

CREATE TABLE positions (
device_id UInt64,
ts DateTime64(3),
lat Float64,
lon Float64,
speed Float32,
heading UInt16
) ENGINE=MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (device_id, ts);
  • PostGIS: polygons + ST_Contains.
  • Or RTree in app memory for hot zones.
  • Event when device crosses boundary → alert.
  • Raw GPS noisy → snap to nearest road.
  • Algorithms: Hidden Markov Model on graph.
  • Vehicle Routing Problem (NP-hard).
  • Heuristics: Clarke-Wright savings, OR-Tools.
  • Ingestion: Kafka partition by device_id, parallel consumers.
  • Live updates: WebSocket fan-out to dashboards. Use Redis pub/sub.

  1. CAP теорема — когда P happens, что выбираешь и почему?
  2. ACID vs BASE — пример где где?
  3. Eventual consistency — как разрешать конфликты?
  4. Strong vs Weak consistency — пример из payment.
  5. Что такое quorum read/write (N=3, W=2, R=2)?
  6. Чем replication отличается от sharding?
  7. Master-master vs master-slave?
  8. Conflict-free Replicated Data Type (CRDT) — пример?
  9. Vector clock vs Lamport clock — разница?
  10. Что такое read repair / anti-entropy?
  11. Heartbeat и failure detection — phi accrual?
  12. Leader election — Raft высоко-уровнево.
  13. Consistent hashing — что решает?
  14. Bloom filter — где применять?
  15. CDN: edge vs origin, cache invalidation strategy.
  16. Reverse proxy: L4 vs L7?
  17. WebSocket vs SSE vs long-polling?
  18. gRPC vs REST vs GraphQL?
  19. Backpressure — паттерны?
  20. Circuit breaker состояния (closed/open/half-open).
  21. Bulkhead pattern.
  22. Saga vs 2PC.
  23. Outbox pattern — зачем?
  24. CDC — Debezium принцип.
  25. Event sourcing vs CRUD.
  26. CQRS — когда оправдано?
  27. Multi-region active-active vs active-passive.
  28. Latency budget: 200ms p95 — куда уходит время?
  29. Что такое observability triangle (logs, metrics, traces)?
  30. Capacity planning — формула для DB storage growth.

  1. “Designing Data-Intensive Applications” — Martin Kleppmann
  2. “System Design Interview” Vol 1, 2 — Alex Xu
  3. “Designing Distributed Systems” — Brendan Burns
  4. “Database Internals” — Alex Petrov
  5. ByteByteGo blog
  6. Highscalability.com
  7. Kafka: The Definitive Guide
  8. Cassandra: The Definitive Guide
  9. Stripe Engineering Blog
  10. Discord Engineering Blog
  11. Uber Engineering Blog
  12. Twitter Engineering Blog
  13. AWS Architecture Center
  14. Google SRE Workbook
  15. “Cloud Native Patterns” — Cornelia Davis
  16. “Microservice Patterns” — Chris Richardson
  17. “Building Microservices” — Sam Newman
  18. PostgreSQL docs
  19. Redis docs
  20. Elastic / OpenSearch docs