Перейти к содержимому

Distributed transactions и consensus: 2PC, TCC, Saga, Paxos, Raft

Зачем знать на Middle 3: В распределённой системе нет «atomic across nodes» — каждый паттерн платит чем-то: 2PC платит блокировкой, TCC сложностью, Saga consistency. Consensus (Raft) нужен везде, где есть leader election, replicated state machine, distributed locks — etcd, Consul, CockroachDB крутятся на Raft. На уровне Senior: понимаешь, почему 2PC — антипаттерн в большинстве случаев, реализуешь TCC, читаешь Raft paper, выбираешь между hashicorp/raft и etcd/raft, делаешь leader election через etcd lease.

  1. Концепция distributed transactions и consensus
  2. Глубже / production-практики (2PC, TCC, Paxos, Raft)
  3. Gotchas
  4. Real cases
  5. Вопросы (25)
  6. Practice
  7. Источники

В моноблоке BEGIN; ...; COMMIT; даёт ACID. В распределённой системе:

Node A → INSERT into accounts (balance -= 100);
Node B → INSERT into accounts (balance += 100);

Что если A commit, B fail? Money disappear. Если оба commit но между ними crash — partial state.

Решения:

  1. 2PC — пытаемся атомарно через coordinator.
  2. TCC — компенсируем через explicit reserve.
  3. Saga — eventually consistent через compensation (file 15).
  4. CRDTs — eventual consistent без consensus, но для specific data types.

Несколько nodes хотят согласиться на single value (или последовательность values). Любая может упасть, network может разделиться (partition), messages могут потеряться.

Theoretical foundation:

  • FLP impossibility (1985): нет deterministic consensus в asynchronous system с failure detection.
  • CAP theorem: Consistency, Availability, Partition tolerance — pick 2 из 3 во время partition.
  • PACELC: если P → A или C; иначе (Else) → Latency или Consistency.

Алгоритмы: Paxos (Lamport, 1989), Raft (Ongaro+Ousterhout, 2014), ZAB (ZooKeeper), Viewstamped Replication.


Phase 1 (Prepare):
Coordinator → all participants: "Can you commit?"
Each participant: lock resources, vote YES/NO
Phase 2 (Commit/Abort):
If all YES → Coordinator says "COMMIT"
If any NO → Coordinator says "ABORT"
Participants act.
Coordinator
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
Prepare? Prepare? Prepare?
│ │ │
▼ ▼ ▼
YES YES YES
│ │ │
└────────────┼────────────┘
COMMIT
┌────────────┼────────────┐
▼ ▼ ▼
Commit Commit Commit

Problems:

  1. Coordinator single point of failure: if coordinator crashes after Phase 1 but before Phase 2, participants stuck holding locks indefinitely.
  2. Blocking: until coordinator decides, participants can’t release resources.
  3. Latency: 2 RTTs минимум.
  4. Heuristic resolution: после timeout participants might decide unilaterally — risk of inconsistency.

XA standard: open spec for 2PC over heterogeneous resources (DB, JMS, etc). Most DBs support XA (Postgres PREPARE TRANSACTION, Oracle, etc).

В Go: database/sql не поддерживает XA из коробки. Можно вручную через PREPARE TRANSACTION:

BEGIN;
INSERT ...;
PREPARE TRANSACTION 'tx_id_123';
-- coordinator decides later:
COMMIT PREPARED 'tx_id_123';
-- or
ROLLBACK PREPARED 'tx_id_123';

⚠️ Prepared transactions hold locks, vacuum blocked. Если coordinator forgets — DB постепенно деградирует. Postgres max_prepared_transactions обычно 0 (disabled). Включается только если sознательно используете 2PC.

Расширение 2PC с дополнительной phase для non-blocking при coordinator failure:

  • Phase 1: CanCommit?
  • Phase 2: PreCommit (запись о намерении).
  • Phase 3: DoCommit.

Идея: если coordinator падает между Phase 2 и Phase 3, participants могут decide based on PreCommit state.

В практике почти не используется — добавляет latency без значимой выгоды над 2PC. Saga намного популярнее.

Аналог 2PC, но через application-level reservation вместо DB locks.

Phase 1 (Try):
All services reserve resources (e.g. earmark inventory, hold payment)
Phase 2:
If all OK → Confirm (commit reservation)
If any fail → Cancel (release reservation)

Example: Booking flow

Try:
Inventory: reserve 5 items (set "reserved_for_order=X" — not deducted)
Payment: authorize $100 (not captured)
Shipping: reserve slot
Confirm:
Inventory: actually deduct 5 items
Payment: capture $100
Shipping: confirm slot
Cancel:
Inventory: release reservation
Payment: void authorization
Shipping: release slot

Преимущества vs 2PC:

  • No DB locks (application-level state).
  • Можно tolerate partial failure (try → cancel).
  • Independent timeouts.

Trade-offs:

  • Каждый сервис must support try/confirm/cancel API explicitly.
  • More code than saga.
  • Reservation TTL needed (auto-cancel if forgotten).
Аспект2PCTCCSaga
AtomicityStrongStrong (Confirm phase)Eventual
BlockingYes (locks)Application-levelNo
Failure modeCoordinator SPOFCancel явноCompensate
Coding effortLow (transparent)High (Try/Conf/Cancel)Medium
Latency2 RTTs sync2 phases syncAsync possible
Production useRare (banking legacy)Some (e-commerce)Common (microservices)

В 2026 most companies use Saga или TCC. 2PC только в специальных случаях (high-consistency requirements внутри trusted boundary).

Replicated state machine: несколько replicas servers держат одинаковую копию state. Updates применяются consistent order.

Use cases:

  • Distributed locks (etcd, Consul).
  • Leader election (Kubernetes leader elections, master в Kafka KRaft).
  • Strongly consistent KV store (etcd, CockroachDB metadata).
  • Configuration (Consul, ZooKeeper).
  • Distributed databases (CockroachDB, TiKV, FoundationDB shards).

Original paper: “The Part-Time Parliament”, 1989/1998. Mathematically rigorous, hard to implement.

Single-decree Paxos (одно value):

Roles:

  • Proposer: предлагает value.
  • Acceptor: голосует за или против.
  • Learner: узнаёт result.

Algorithm:

  1. Prepare phase: Proposer sends prepare(n) с unique number n. Acceptors promise не accept-ить prepare с n’ < n.
  2. Accept phase: если majority promised, Proposer sends accept(n, value). Acceptors accept если they haven’t promised higher n.
  3. Если majority accepted — decision made.

⚠️ Описание выглядит просто, но dual-leader, ordering, recovery — extremely subtle. Multiple implementations failed correctness tests.

Multi-Paxos: multiple decisions (log of values), with leader optimization.

Goal: same correctness as Paxos, but understandable. Successful — most modern distributed systems use Raft.

Three sub-problems:

  1. Leader election.
  2. Log replication.
  3. Safety.
┌──────────┐
│ Follower │ ← initial state
└────┬─────┘
│ timeout (heartbeat lost)
┌──────────┐
│Candidate │ — requests votes
└────┬─────┘
│ majority votes
┌──────────┐
│ Leader │ — appends to log, sends heartbeats
└──────────┘
│ discover higher term
┌──────────┐
│ Follower │
└──────────┘
  • Every node starts as Follower.
  • Random election timeout (150–300ms).
  • If no heartbeat from leader → become Candidate.
  • Candidate increments term, votes for self, requests votes from others.
  • Majority votes → Leader.
  • Otherwise (split vote) → wait, retry с different timeout.
  • Leader receives commands from clients.
  • Appends to its log.
  • Sends AppendEntries to all followers.
  • When majority acknowledged → entry committed.
  • Followers apply to state machine.
  • Election safety: at most one leader per term.
  • Leader append-only: leader never overwrites its log.
  • Log matching: if two logs share entry at same index+term, all prior entries match.
  • Leader completeness: committed entry в любом prior term будет в logs всех future leaders.
  • State machine safety: if entry applied at index i, no other entry at i in any node.

Для N nodes, quorum = ⌊N/2⌋ + 1.

  • 3 nodes: quorum 2 (tolerate 1 failure).
  • 5 nodes: quorum 3 (tolerate 2).
  • 7 nodes: quorum 4 (tolerate 3).

Why odd N? 4 nodes → quorum 3, tolerate 1. Same as 3 nodes но dual write needed. Diminishing returns.

⚠️ Split-brain: если cluster split на (2, 2) в 4-node — no quorum в любой стороне. Cluster halts. Lesson: всегда odd N.

hashicorp/raft:

  • Production-grade, used in Consul, Nomad, Vault.
  • Persistent storage abstractions (BoltDB by default, can be custom).
  • Snapshots, log compaction.
  • API в use:
import "github.com/hashicorp/raft"
config := raft.DefaultConfig()
config.LocalID = "node1"
store, _ := raftboltdb.NewBoltStore(filepath.Join(dataDir, "raft.db"))
snapshots, _ := raft.NewFileSnapshotStore(dataDir, 3, os.Stderr)
transport, _ := raft.NewTCPTransport("0.0.0.0:7000", nil, 3, 10*time.Second, os.Stderr)
fsm := &MyFSM{} // your state machine
r, err := raft.NewRaft(config, fsm, store, store, snapshots, transport)
// Apply command (must be on leader)
f := r.Apply(commandBytes, 10*time.Second)
if err := f.Error(); err != nil { ... }

FSM — interface:

type FSM interface {
Apply(*raft.Log) interface{} // apply committed log entry
Snapshot() (FSMSnapshot, error)
Restore(io.ReadCloser) error
}

etcd/raft:

  • Original implementation in etcd.
  • Used by etcd, CockroachDB, TiKV (initially), M3DB.
  • More low-level, requires more wiring.
  • Better для embedding в высокопроизводительные системы.

dragonboat:

  • Multi-group Raft (один node может participate в many raft groups simultaneously).
  • High throughput.
  • Used by some сторадж systems.

В 2026 для new projects:

  • Need simple? — hashicorp/raft.
  • Need multi-group / max performance? — dragonboat or etcd/raft.
  • Linearizability: реальное время; все operations as if executed sequentially. The strongest model.
  • Sequential consistency: order существует, но не obязательно реальное время.
  • Eventual consistency: replicas eventually converge.

Raft даёт linearizability (с care на reads):

  • Writes: пройти через leader → committed.
  • Reads:
    • Linearizable read: leader checks с majority (ReadIndex pattern) что он ещё leader. Slow.
    • Lease read: leader uses time lease (assume no other leader exists). Faster, but requires synchronized clocks.
    • Follower read: stale, but fast.

Production systems offer all three с config.

Algorithm used in Apache ZooKeeper. Similar to Raft в spirit:

  • Leader-based.
  • 2-phase commit per write.
  • Linearizable writes, FIFO order from same client.

Predates Raft but similar concepts. ZooKeeper still widely used (Kafka до KRaft, HBase, Solr).

Older algorithm (1988), influenced both Paxos и Raft. Brief look:

  • View = “term” в Raft.
  • View change = leader election.
  • Primary-backup pattern.

Less used directly today, но academic значимый.

Способы:

1. etcd lease:

import "go.etcd.io/etcd/client/v3"
import "go.etcd.io/etcd/client/v3/concurrency"
cli, _ := clientv3.New(clientv3.Config{Endpoints: []string{"localhost:2379"}})
session, _ := concurrency.NewSession(cli, concurrency.WithTTL(10))
election := concurrency.NewElection(session, "/leader/my-service")
// Block until elected
if err := election.Campaign(ctx, "node-1"); err != nil { panic(err) }
// I am leader now
// On exit: session.Close() — automatically resigns

2. Kubernetes lease (kubernetes/client-go):

import "k8s.io/client-go/tools/leaderelection"
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Lock: &resourcelock.LeaseLock{...},
LeaseDuration: 15 * time.Second,
RenewDeadline: 10 * time.Second,
RetryPeriod: 2 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) { /* do leader work */ },
OnStoppedLeading: func() { /* cleanup */ },
},
})

3. Raft library: вы сами строите cluster.

4. Redis SET NX EX: примитивный distributed lock, но не safe для нескольких clients (split-brain).

⚠️ Безопасное leader election требует fencing token (monotonic counter). Иначе старый leader, который вернулся после network partition, может попытаться выполнить запись с stale lease.

Constraints:

  • Leader-bound (single writer).
  • Network RTT × 2 (one to followers, one back).
  • Disk sync (fsync) per entry для durability.

Optimizations:

  • Batching: append N entries за один AppendEntries.
  • Pipelining: send next batch до acknowledgment предыдущего.
  • Async commit: client doesn’t wait для disk sync (durability trade-off).
  • Snapshot: log compaction (otherwise log infinite).

Typical numbers:

  • 10K–50K ops/sec (single leader).
  • Multi-group (dragonboat) — millions ops/sec aggregate via sharding.

etcd: Kubernetes’s source of truth. Raft под капотом. ~10K writes/sec.

Consul: service discovery, KV. Raft.

CockroachDB: SQL distributed. Range-level Raft groups (each range ~64 МБ имеет свой Raft group). Linearizable.

TiKV: Raft per region. Backend для TiDB.

HashiCorp Vault: integrated storage uses Raft (in HA mode).

Apache Kafka KRaft mode (since 2.8, production from 3.3): replaces ZooKeeper с собственным Raft impl.

Многие алгоритмы (TrueTime в Spanner, hybrid logical clocks в CockroachDB) полагаются на bounded clock skew.

Lamport clock: scalar counter per node, increment per event. Causal ordering, но не real-time.

Vector clock: vector of counters per node. Detects concurrent events.

Hybrid Logical Clock (HLC): physical time + logical counter. Used by CockroachDB.

Google TrueTime: hardware atomic clocks + GPS, API returns [earliest, latest] interval. Spanner waits for latest before commit для linearizability.

В Go: для most use cases используют time.Now() (sync через NTP, skew ~10–100 ms). Для strict ordering — explicit sequence numbers или Raft log index.

В CockroachDB / TiKV каждый range = Raft group. Если один range очень hot (one tenant писать 100K ops/sec), leader этого range — bottleneck.

Mitigations:

  • Range split: автоматический split по size or load.
  • Leader rebalance: move leader на менее loaded node.
  • Follower reads: stale reads с replicas снижают leader load.

⚠️ Hot range — частая проблема в production. Pre-split при known hot keys (e.g. time-series with timestamp prefix → next hour будет hot).

Добавить или удалить node из Raft cluster — не trivial. Если просто добавить node, может временно появиться 2 disjoint majorities → split-brain.

Joint consensus (original Raft paper): переход через intermediate state где quorum нужен от обоих old и new configurations. Safe but complex.

Single-server changes (Ongaro’s thesis): добавляем/удаляем по одной node за раз. Quorum changes плавно. Used in hashicorp/raft.

// hashicorp/raft: add voter
future := r.AddVoter("node4", "node4-addr:7000", 0, 0)
if err := future.Error(); err != nil { ... }

⚠️ В production: всегда меняйте membership по одной node. Не batch.


⚠️ 2PC blocks resources. Even в normal flow lock held для 1 RTT. Под scale это catastrophic.

⚠️ Coordinator crash в 2PC = participants stuck. Heuristic decision = data corruption risk.

⚠️ TCC reservation TTL — если short — premature cancellation. If long — resources held too long. Trade-off.

⚠️ Paxos correctness: не реализуйте сами. Используйте production library.

⚠️ Raft split-brain prevention: всегда odd cluster size.

⚠️ Raft committed != applied. Apply returns после fsync, но state machine может apply later. Reads ДО apply вернут stale.

⚠️ Read amplification: linearizable reads через leader. If many readers, leader bottleneck.

⚠️ etcd cluster size: typical 3 или 5 nodes. 7 nodes — write latency degrades (slowest follower) without proportional benefit.

⚠️ Snapshots: без log compaction Raft log infinite. Snapshot too frequent → IO cost. Too rarely → recovery slow.

⚠️ Membership changes (add/remove nodes) — separate sub-protocol в Raft (joint consensus или single-server changes). Easy to mess up.

⚠️ Lease read depends on clock skew. Если node clocks drift, two leaders могут coexist. Always use time bounds.

⚠️ Redis SET NX EX is NOT safe distributed lock. Без fencing token — old leader может corrupt data. Use Redlock (still не perfectly safe per Martin Kleppmann’s critique) или Raft-based.

⚠️ Kubernetes leader election requires coordination.k8s.io/v1.Lease. RBAC permissions.

⚠️ Leader churn (frequent re-elections) — symptom of network or clock issues. Не feature.

⚠️ CockroachDB range splits — automatic Raft group sharding. Hot range = bottleneck.

⚠️ Saga vs Raft confusion: Saga — application-level workflow. Raft — replication algorithm. Different layers.


Контекст: legacy bank running on Oracle XA cross-DB transactions.

Incident: один из DBs offline 10 минут. Coordinator timeouts. Prepared transactions stuck. Vacuum blocked. Disk full в 2 часа. Bank stopped processing transactions.

Lesson: 2PC в large-scale is fragile. Migration to saga over 1 год.

Контекст: Redis Sentinel cluster, 5 sentinels. Network partition — 3 в одной side, 2 в другой.

Без proper fencing: 3-side promoted new master. Когда 2-side reconnected, они уже не знали, что master changed. Wrote to old master. Diverged data.

Resolution: dropped writes from old master (manual reconciliation). Migrated к Raft-based KV (etcd-style) для critical state.

Setup: Kubernetes cluster с 3-node etcd.

Issue: одна node maintenance → 2/3 quorum. Если ещё одна fails — cluster halts. Window of vulnerability — стрессно.

Action: расширили до 5 nodes. Tolerate 2 failures.

⚠️ Larger size — slower writes (need majority ACK across more nodes).

Контекст: Custom distributed system, использует hashicorp/raft. KV store.

Problem: snapshot не запускался автоматически, log грос до 50 ГБ. Restart = 30 минут replay.

Fix: tuned config.SnapshotInterval = 1 * time.Hour, config.SnapshotThreshold = 100000. Log compaction после snapshot.

Контекст: e-commerce, ranger transactions «reserve inventory + charge + ship».

Old approach: pseudo-2PC через app-level coordinator. 5% transactions stuck weekly.

New approach (Temporal-based Saga, file 15): 0.01% stuck.

Key difference: Saga is async, no blocking locks. Failures handled через compensation.


  1. Почему distributed transactions сложны?
  2. Опишите 2PC: phases, participants, coordinator.
  3. Coordinator crash в 2PC: что происходит и почему opasно?
  4. XA standard и PREPARE TRANSACTION в Postgres.
  5. Чем 3PC отличается от 2PC и почему 3PC не популярен?
  6. TCC: три фазы, пример с booking.
  7. Сравните 2PC, TCC, Saga по таблице (atomicity, blocking, latency).
  8. Что такое FLP impossibility?
  9. CAP theorem: что нельзя одновременно.
  10. PACELC: что добавляет к CAP?
  11. Зачем нужен distributed consensus?
  12. Paxos roles: proposer, acceptor, learner.
  13. Почему Paxos сложен в реализации?
  14. Raft: 3 sub-problems.
  15. Leader election в Raft: алгоритм.
  16. Log replication: AppendEntries, commit, apply.
  17. Safety properties Raft (5 штук).
  18. Quorum = N/2 + 1. Почему odd N лучше?
  19. Linearizable read через ReadIndex vs lease read.
  20. hashicorp/raft vs etcd/raft vs dragonboat.
  21. etcd lease для leader election: код.
  22. Kubernetes lease — как использовать в Go.
  23. Fencing token: зачем при distributed lock.
  24. Raft performance: ~10K ops/sec. Что bottleneck?
  25. Опишите production incident вокруг 2PC или split-brain.

Задача 1: Реализовать простой 2PC coordinator с 2 participants. Симулировать coordinator crash после phase 1.

Задача 2: Реализовать TCC для booking flow (inventory + payment + shipping reserve).

Задача 3: Подключить hashicorp/raft, создать 3-node cluster, реализовать KV store. Тестировать leader election, append commands.

Задача 4: Симулировать network partition — kill одного node, проверить, что cluster ОК (с 3 nodes — leader election на 2 оставшихся).

Задача 5: etcd leader election: поднять etcd, реализовать leader election на Go.

Задача 6: Kubernetes leader election через client-go: реализовать background worker, который running только на одном pod.

Задача 7: Распределённый lock — implement Redlock на Go, тест с failure injection.

Задача 8: Раскрыть Raft log compaction: реализовать FSM с snapshot + restore.

Задача 9 (advanced): Поднять CockroachDB локально, проверить range splits и Raft groups через cockroach debug commands.

Задача 10: Сравнить write latency в etcd при 3-node, 5-node, 7-node cluster.


  1. Leslie Lamport, “The Part-Time Parliament” (Paxos), 1989/1998.
  2. Diego Ongaro, John Ousterhout, “In Search of an Understandable Consensus Algorithm” (Raft), 2014.
  3. Diego Ongaro PhD thesis, “Consensus: Bridging Theory and Practice”, 2014.
  4. Brian Oki, Barbara Liskov, “Viewstamped Replication”, 1988.
  5. Apache ZooKeeper, “ZAB Algorithm” technical report.
  6. Martin Kleppmann, “Designing Data-Intensive Applications”, chapters 8–9.
  7. CockroachDB Engineering Blog, “Living Without Atomic Clocks”, 2017.
  8. etcd documentation: linearizable reads, membership changes.
  9. hashicorp/raft source and docs.
  10. etcd/raft source.
  11. dragonboat documentation.
  12. Tyler Treat, “From the Ground Up: Distributed Systems” series.
  13. Pat Helland, “Life Beyond Distributed Transactions”, 2007.
  14. Daniel Abadi, “PACELC theorem”, 2010.
  15. Henrik Engström, talks on Apache Kafka KRaft (replacing ZooKeeper).
  16. Marc Brooker (AWS), blog on distributed locks и fencing tokens.
  17. Camille Fournier, “Consensus Systems for the Skeptical Architect”, talks 2017.