Resilience patterns: Circuit Breaker, Bulkhead, Retry, Timeout, Hedging

Зачем знать на Middle 3: В микросервисах любой downstream может тормозить или падать. Без resilience patterns ваш сервис падёт каскадно: один зависший backend → ваши goroutine накапливаются → OOM → весь сервис down. На уровне Senior знаешь: когда circuit breaker tripит, как combined retry+timeout+CB работают вместе, реализуешь idempotency key как Stripe, делаешь hedging для tail latency, не оверретраишь и не убиваешь downstream amplificaция.

Содержание

Концепция
Глубже / production-практики
Gotchas
Real cases
Вопросы (25)
Practice
Источники

1. Концепция

1.1 Why resilience

Микросервисная архитектура = сеть взаимозависимостей. Одна точка failure масштабируется:

Service A → Service B → Service C
                                 ↓ fails
                              Service B retries → timeouts
                                 ↓ затопляет
                              Service A goroutines block → exhausted
                                 ↓
                              Service A падает

Это cascading failure. Resilience patterns предотвращают каскад.

1.2 Резюме паттернов

Паттерн	Защищает от
Timeout	Бесконечное ожидание
Retry	Transient failures
Circuit Breaker	Постоянный failed downstream
Bulkhead	Один dependency не топит весь сервис
Idempotency Key	Duplicate side effects
Hedging	Tail latency (один slow instance)
Fallback	Деградация UX > полный fail
Health checks	Routing трафика к unhealthy

Все эти паттерны комбинируются. Robust client = timeout + retry + circuit breaker + idempotency.

2. Глубже / production-практики

2.1 Circuit Breaker

Концепция: detect постоянный failure downstream и временно перестать вызывать, чтобы дать backend время recovery.

Три состояния:

    [Closed]  ── failure threshold reached ──→  [Open]
       ▲                                          │
       │ success in half-open                    │ timeout
       │                                          ▼
    [Half-Open] ←────────────────────────── (trial requests)

Closed: нормальная работа. Каждый запрос идёт downstream. Считаем failures.
Open: downstream сломан. Все запросы fail immediately без attempt. Это спасает downstream от amplification.
Half-Open: после cool-down period, разрешаем N trial requests. Если успешны — close. Если fail — back to open.

Trip threshold options:

Consecutive failures: trip после N подряд failed (e.g. 5).
Error rate в окне: trip если > 50% ошибок в 10-секундном окне.
Slow call rate: trip если > 50% calls дольше 2 секунд.

В production: error rate > consecutive (более robust к bursts).

2.2 Реализация: sony/gobreaker

import "github.com/sony/gobreaker"

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "downstream-api",
    MaxRequests: 5,                   // в half-open сколько trials
    Interval:    30 * time.Second,    // сбрасывать счётчик каждые 30s в closed
    Timeout:     60 * time.Second,    // open → half-open после 60s
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
        return counts.Requests >= 20 && failureRatio >= 0.5
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        log.Printf("CB %s: %s → %s", name, from, to)
    },
})

result, err := cb.Execute(func() (interface{}, error) {
    return httpClient.Get(url)
})

⚠️ MaxRequests в half-open — обычно 1–5. Если все 5 успешны — обратно в closed.

⚠️ Interval — для сброса счетчика в closed. Без этого старые failures навсегда влияют.

⚠️ OnStateChange — экспортируйте в Prometheus, alert при переходе в Open.

2.3 Bulkhead

Концепция: изолировать ресурсы per dependency, чтобы один busy/slow dependency не съел все ресурсы.

Названо в честь корабельных перегородок — пробоина в одном отсеке не топит корабль.

Реализации:

Goroutine semaphore:

type Bulkhead struct {
    sem chan struct{}
}

func NewBulkhead(maxConcurrent int) *Bulkhead {
    return &Bulkhead{sem: make(chan struct{}, maxConcurrent)}
}

func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
    select {
    case b.sem <- struct{}{}:
        defer func() { <-b.sem }()
        return fn()
    case <-ctx.Done():
        return ctx.Err()
    }
}

Separate connection pool per dependency:

clientA := &http.Client{Transport: poolA}  // pool 100 connections for A
clientB := &http.Client{Transport: poolB}  // pool 50 connections for B

Separate goroutine worker pool:

workerPoolA := pool.New(50)  // 50 goroutines для requests to A
workerPoolB := pool.New(20)

⚠️ Bulkhead protects from resource exhaustion, не от latency. Combine с timeout + CB.

2.4 Retry

Когда retry:

Transient errors: network timeouts, 5xx, rate limits (with respect для headers).
Idempotent operations only (GET, или mutations с idempotency key).

Когда НЕ retry:

4xx (client errors): bad request, not found, forbidden — обычно не fixable.
Non-idempotent без idempotency key — duplicate side effects.

Exponential backoff with jitter:

import "github.com/cenkalti/backoff/v4"

operation := func() error {
    return httpClient.Post(url, body)
}

b := backoff.NewExponentialBackOff()
b.InitialInterval = 100 * time.Millisecond
b.MaxInterval = 10 * time.Second
b.MaxElapsedTime = 30 * time.Second
b.RandomizationFactor = 0.5  // jitter

err := backoff.Retry(operation, b)

Jitter: без него тысячи clients ретраят одновременно → thundering herd. Jitter — random ±50%.

Алгоритмы jitter:

Full jitter: delay = random(0, base * 2^n).
Equal jitter: delay = base * 2^n / 2 + random(0, base * 2^n / 2).
Decorrelated jitter: delay = min(cap, random(base, prev * 3)).

В AWS Architecture Blog предлагают full jitter — лучший trade-off.

2.5 Retry budget

Проблема: при partial outage, каждый client делает 3 retry → load на downstream ×4 → ещё хуже outage.

Solution: retry budget — globally cap retries.

type RetryBudget struct {
    rate   float64  // e.g. 0.1 = max 10% дополнительной нагрузки от retries
    window time.Duration

    requests, retries atomic.Int64
}

func (rb *RetryBudget) Allow() bool {
    req := rb.requests.Load()
    ret := rb.retries.Load()
    if req == 0 { return true }
    return float64(ret) / float64(req) < rb.rate
}

Envoy, gRPC have built-in retry budgets.

2.6 Timeout

Hierarchy:

Per-attempt timeout < Total request timeout < Client overall timeout.
Client timeout < Downstream timeout (cascade avoidance).

Если client ждёт 30s, downstream 60s — после client timeout, downstream продолжает работать впустую.

ctx, cancel := context.WithTimeout(ctx, 5*time.Second)  // per-attempt
defer cancel()
req = req.WithContext(ctx)
resp, err := client.Do(req)

Context propagation: каждый layer передаёт context — downstream видит deadline. gRPC и net/http умеют это.

⚠️ Backend должен respect context: ctx.Done() → stop work. Без этого timeout не помогает.

2.7 Idempotency Key (Stripe-style)

Problem: POST /charge — переотправка → double charge.

Solution: client включает unique Idempotency-Key header. Server stores result per key для TTL (24h–7d).

POST /v1/charges
Idempotency-Key: abc-123-def
Content-Type: application/json

{"amount": 1000, "currency": "USD"}

Server logic:

func charge(w http.ResponseWriter, r *http.Request) {
    key := r.Header.Get("Idempotency-Key")

    // Check if already processed
    cached, _ := db.GetIdempotencyResult(key)
    if cached != nil {
        // Return cached response
        w.Write(cached.Body)
        return
    }

    // Lock per key (prevent concurrent duplicate processing)
    lock := db.AcquireLock(key)
    defer lock.Release()

    // Re-check inside lock
    cached, _ = db.GetIdempotencyResult(key)
    if cached != nil { w.Write(cached.Body); return }

    // Actually process
    result := actuallyCharge(req)

    // Cache result for TTL
    db.SaveIdempotencyResult(key, result, 24*time.Hour)

    w.Write(marshal(result))
}

⚠️ Concurrent requests с same key → second waits для first. Distributed lock через Redis/PG advisory lock.

⚠️ TTL: достаточно длинный, чтобы покрыть retry window клиента. Stripe: 24h.

⚠️ Если request body differs для same key — return error (ambiguous).

2.8 Hedging requests

Tail latency problem: p99 latency 2 секунды, хотя p50 = 50 мс. Один slow instance или network issue.

Hedging: после X мс ожидания, послать второй request (на другой instance). Принять first response, cancel другой.

func hedgedRequest(ctx, req) (*Response, error) {
    primary := make(chan result, 1)
    go func() {
        r, e := client.Do(req)
        primary <- result{r, e}
    }()

    select {
    case r := <-primary:
        return r.resp, r.err
    case <-time.After(100 * time.Millisecond):  // after 100ms — hedge
    }

    backup := make(chan result, 1)
    go func() {
        r, e := client.Do(req)
        backup <- result{r, e}
    }()

    select {
    case r := <-primary:
        return r.resp, r.err
    case r := <-backup:
        return r.resp, r.err
    }
}

⚠️ Cost: 2x load on downstream. Use только для idempotent reads, на critical endpoints.

⚠️ gRPC supports hedging нативно через service config:

{
  "hedgingPolicy": {
    "maxAttempts": 2,
    "hedgingDelay": "0.1s"
  }
}

2.9 Fallback patterns

1. Default response:

func getRecommendations(userID) []Item {
    items, err := mlService.Recommend(userID)
    if err != nil {
        return defaultPopularItems  // pre-computed list
    }
    return items
}

2. Cached value:

func getUserPrefs(userID) Prefs {
    p, err := db.GetPrefs(userID)
    if err != nil {
        if cached, ok := cache.Get(userID); ok {
            return cached.(Prefs)  // stale cache acceptable
        }
        return defaultPrefs
    }
    cache.Set(userID, p)
    return p
}

3. Degraded service:

Disable non-critical features (rec engine off, comments off).
Return partial response with warning.
Read-only mode.

2.10 Combinations: robust client

// 1. Circuit Breaker wraps everything
// 2. Inside CB: Retry with backoff
// 3. Each attempt has timeout
// 4. Idempotency key for safe retry of mutations

func robustCall(ctx context.Context, req Request) (Response, error) {
    return cb.Execute(func() (interface{}, error) {
        return backoff.RetryNotify(func() error {
            attemptCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
            defer cancel()

            req := req
            req.IdempotencyKey = generateKey()  // stable per logical call

            resp, err := httpClient.Do(req.WithContext(attemptCtx))
            if err != nil {
                if isRetryable(err) { return err }
                return backoff.Permanent(err)  // don't retry
            }
            return nil
        }, b, nil)
    })
}

2.11 Health check propagation

Если ваш сервис depends on downstream A, B, C — ваш health check должен отражать состояние deps.

type Health struct {
    DB    bool
    Redis bool
    A     bool
    B     bool
}

func (h *Health) Handler(w http.ResponseWriter, r *http.Request) {
    h.DB = pingDB(ctx) == nil
    h.Redis = pingRedis(ctx) == nil

    if !h.DB || !h.Redis {
        w.WriteHeader(503)  // unhealthy → k8s remove from LB
    }
    json.NewEncoder(w).Encode(h)
}

⚠️ Differentiate readiness vs liveness:

Liveness: am I alive? Pass даже если deps down (restart не поможет).
Readiness: can I serve traffic? Fail если deps down.

⚠️ Cascading unhealth: если A считает себя unhealthy из-за B, и B из-за A — deadlock. Carefully design dependencies в health check.

2.12 Don’t retry into broken backend

Circuit breaker уже solves это. Но в multi-layer:

Client → LB → Service A → Service B

Каждый layer может retry. Total = 3 × 3 × 3 = 27x load on B. Solution:

Retry only at outermost layer (client).
Inner layers — no retry, propagate error.
Or: agree on retry budget.

2.13 alibaba/sentinel-golang

Sentinel — flow control & circuit breaking, originally Java, port на Go. Более full-featured чем gobreaker:

Rate limiting (per QPS, per concurrent).
Adaptive — based on system load.
Hot spot parameter throttling.
Cluster flow control.

import "github.com/alibaba/sentinel-golang/api"
import "github.com/alibaba/sentinel-golang/core/circuitbreaker"

_, err := circuitbreaker.LoadRules([]*circuitbreaker.Rule{
    {
        Resource:       "downstream-api",
        Strategy:       circuitbreaker.ErrorRatio,
        Threshold:      0.5,
        RetryTimeoutMs: 10000,
        StatIntervalMs: 60000,
        MinRequestAmount: 20,
    },
})

e, b := api.Entry("downstream-api")
if b != nil {
    // blocked by CB
}
defer e.Exit()
// do work

2.14 Service mesh resilience (Istio, Linkerd)

Service mesh deliver patterns без application code:

Retries (with budget).
Circuit breaker.
Timeouts.
Hedging (Istio EnvoyFilter).

Trade-off:

Pros: language-agnostic, declarative (YAML), uniform across services.
Cons: extra latency (sidecar), operational complexity, harder to debug.

В production обычно defense in depth: библиотека на уровне приложения + mesh для defaults.

2.15 Failure injection (chaos engineering)

Тестировать resilience через injection failures:

Kubernetes: chaos-mesh, litmus.
Istio: VirtualService faults (delay, abort).
Application: feature flag для intentional fail.

Pattern: GameDay — раз в месяц намеренно убиваем dependency, проверяем что circuit breaker tripит и fallback работает.

3. Gotchas

⚠️ Retry без timeout = бесконечное ожидание. Always combine.

⚠️ Retry non-idempotent operations без idempotency key = data corruption.

⚠️ CB threshold слишком low = false positives (transient errors открывают CB). Слишком high = late detection.

⚠️ MaxRequests в half-open = 1 означает: один failed request заново открывает. Usually 3–10.

⚠️ CB per-instance vs per-service: один instance может ложно tripить из-за local issue, в то время как service вцелом OK. Some patterns share CB state via Redis.

⚠️ Hedging amplifies load. Don’t use for write operations.

⚠️ Idempotency key TTL too short. Если client retry через 25h, а TTL 24h — duplicate processing.

⚠️ Idempotency key not unique enough. Use UUID + namespace.

⚠️ Bulkhead semaphore size. Слишком маленький = false rejection. Слишком большой = no isolation.

⚠️ Cascading retries через layers. Multiplicative load. Always retry budget.

⚠️ Circuit breaker не reset-ится. Если у вас перезапуск инстанса reset, но broken downstream stays — каждый new instance retries → amplification.

⚠️ gRPC retry semantics: only when server didn’t process. Failed metadata receipt = retry. Failed после server saw — depends on RPC type.

⚠️ Context propagation breaks при background goroutine без proper ctx. Use context.WithoutCancel (Go 1.21+) ОЧЕНЬ осторожно.

⚠️ Timeout < propagation latency. Если client timeout 1s, а downstream — 800ms work + 200ms network — borderline. Add buffer.

⚠️ Health check itself fails под нагрузкой. CPU exhausted = health check не отвечает = k8s restart = ещё больше load. Make health check ultra-light.

⚠️ Fallback hides real problems. Always log/metric когда fallback используется.

4. Real cases

Case 1: Cascading failure без CB

Контекст: Payment service depends on Bank API. Bank API имел 30-second outage.

Без CB: каждый запрос ждал 30 seconds (HTTP timeout). Goroutines накопились → 50K. RSS = 5 GB. Pod OOM-kill. Pod restart → пустой connection pool → still 30s timeouts. Death spiral.

С CB: после 10 consecutive failures → open. Все subsequent requests fail immediately в 1мс. Resources held. Pod stable. После 60s — half-open trial → Bank API still down → back to open. Eventually Bank API recovered → half-open success → close.

Case 2: Stripe-style idempotency

Контекст: e-commerce checkout. User clicks “Pay” two times rapidly из-за UI lag.

Без idempotency: 2 charge requests → double charge.

С idempotency key (generated client-side per checkout):

Request 1: process, save result в Redis under key.
Request 2 (same key): return cached result. No double charge.

Implementation:

func (s *PaymentService) Charge(ctx, req ChargeRequest) (*Charge, error) {
    key := req.IdempotencyKey  // client provides

    // Distributed lock per key
    lock, err := redisLock.Acquire(ctx, "idem:"+key, 30*time.Second)
    if err != nil { return nil, err }
    defer lock.Release(ctx)

    cached, _ := s.cache.Get(ctx, "result:"+key)
    if cached != nil {
        var c Charge
        json.Unmarshal(cached, &c)
        return &c, nil
    }

    charge, err := s.actuallyCharge(ctx, req)
    if err != nil { return nil, err }

    s.cache.Set(ctx, "result:"+key, charge, 24*time.Hour)
    return charge, nil
}

Case 3: Hedging для P99

Контекст: search service, p50=20ms, p99=2s. Investigation: один-два slow instances регулярно появлялись.

Hedging: gRPC config — после 50ms послать второй request на другой replica.

Result: p99 → 80ms. Cost: ~5% дополнительной нагрузки (5% запросов реально hedged).

Case 4: Retry storm

Контекст: deploy new version, bug → 50% requests fail with 500.

Без retry budget: каждый client retries 3 раза → 3x load → downstream more saturated → still 50% errors → еще retries.

С retry budget (max 10% over baseline): retries capped, system stable, ops видит alert, rollback.

Case 5: Bulkhead saved service

Контекст: Recommendation service вызывает 3 backends: profile, history, ML. Profile service slow (10s latency).

Без bulkhead: 1000 RPS to rec service → 1000 goroutines all blocked on profile. Profile pool exhausted. New requests pile up. OOM.

С bulkhead (semaphore 100 для profile):

Up to 100 concurrent profile calls.
101st request — fail fast with “service overloaded”.
Other backends (history, ML) — separate bulkhead → still work.
Degraded recommendations (no profile data) returned via fallback.

5. Вопросы (25)

Что такое cascading failure и как resilience patterns его предотвращают?
Три состояния circuit breaker, переходы между ними.
Consecutive failures vs error rate — какой trip threshold лучше и почему?
MaxRequests в half-open — что это значит?
Bulkhead pattern: цель, реализация через semaphore.
Когда retry, когда НЕ retry?
Exponential backoff с jitter — алгоритм.
Full jitter vs equal jitter vs decorrelated — какой используется AWS?
Retry budget — зачем и как реализовать.
Per-attempt timeout vs total timeout — иерархия.
Cascade avoidance: client timeout < downstream timeout — почему?
Idempotency key (Stripe): полный flow.
Что делать если 2 concurrent requests с same idempotency key?
Hedging request: алгоритм, cost, когда применять.
gRPC native hedging config.
Fallback patterns: 3 типа с примерами.
Liveness vs readiness probe — разница.
Cascading unhealth deadlock: пример и решение.
Service mesh resilience: pros и cons vs library.
Failure injection (chaos engineering) tools.
Combined robust client: какие 4 паттерна вместе?
Сравните sony/gobreaker и alibaba/sentinel-golang.
CB per-instance vs shared state в Redis — когда что?
Опишите cascading failure incident из практики.
Опишите retry storm и как retry budget помогает.

6. Practice

Задача 1: Реализовать circuit breaker с нуля (3 состояния, error rate threshold). Сравнить с gobreaker.

Задача 2: Bulkhead через semaphore. Test: 200 concurrent calls с capacity 50 — see fail-fast behavior.

Задача 3: Retry with exponential backoff + jitter. Test convergence vs thundering herd.

Задача 4: Реализовать idempotency middleware для HTTP сервера. Use Redis для cache.

Задача 5: Hedging request — отправить request, через 100мс послать второй, return first response.

Задача 6: Combined robust client: CB + retry + timeout + idempotency. Wire всё вместе.

Задача 7: Failure injection: добавить middleware, которое 30% requests fail. Verify CB tripит correctly.

Задача 8: Health check endpoint с проверкой dependencies. Differentiate liveness vs readiness.

Задача 9 (advanced): Istio VirtualService с retries + CB, без application code. Verify behavior через fault injection.

Задача 10: Retry budget — реализовать globally bounded retries (sliding window).

7. Источники

Michael Nygard, “Release It!”, Pragmatic Bookshelf, 2nd ed 2018 — каноническая книга по resilience.
Sam Newman, “Building Microservices”, O’Reilly, 2nd ed 2021.
Netflix Tech Blog, “Fault Tolerance in a High Volume, Distributed System”, 2012.
AWS Architecture Blog, “Exponential Backoff And Jitter”, 2015.
Stripe API Docs, “Idempotent Requests”, https://stripe.com/docs/api/idempotent_requests
Google SRE Book, chapters 21–22 on overload, cascading failure.
sony/gobreaker source and docs.
alibaba/sentinel-golang documentation.
cenkalti/backoff library.
Envoy Proxy documentation: circuit breaker, retries, hedging.
Istio Documentation: VirtualService destination rules.
Casey Rosenthal, “Chaos Engineering”, O’Reilly, 2020.
gRPC documentation: retry policy, hedging policy.
Marc Brooker (AWS Principal Engineer), blog posts on timeouts and retries.
Adrian Cockcroft, talks on Netflix microservices resilience.