Lock-Free Queues Deep: SPSC, MPSC, MPMC, Hazard Pointers, EBR, RCU

Зачем знать на Middle 3: lock-free очереди — это сердце high-throughput систем: scheduler’ы (Go runtime work-stealing queue), networking (kernel bypass, DPDK), databases (lock-free MVCC), trading systems (LMAX Disruptor). Middle 3 / Senior должен понимать: (1) различия SPSC/MPSC/SPMC/MPMC и когда какую выбрать; (2) Michael-Scott queue, Vyukov MPSC, LMAX Disruptor; (3) проблему memory reclamation (Hazard Pointers, EBR, RCU); (4) когда lock-free действительно нужен и почему часто медленнее mutex. Без этого вы не поймёте Go runtime, не сможете правильно спроектировать high-throughput pipeline, и легко создадите subtle bugs.

Содержание

Краткое введение
Глубокое погружение
- 2.1 Lock-free vs wait-free vs obstruction-free
- 2.2 Linearizability vs serializability
- 2.3 SPSC: LMAX-style ring buffer
- 2.4 MPSC: Vyukov queue
- 2.5 SPMC: специфика
- 2.6 MPMC: Michael-Scott queue
- 2.7 Bounded MPMC (slot-based)
- 2.8 ABA tagging
- 2.9 Hazard Pointers
- 2.10 Epoch-Based Reclamation (EBR)
- 2.11 RCU
- 2.12 Performance trade-offs
Gotchas
Real cases
Вопросы
Practice
Источники

1. Краткое введение

Lock-free queue — структура данных, в которой операции Enqueue/Dequeue не используют lock’и, гарантируют progress (хотя бы один поток продвигается за конечное число шагов) и обычно строятся на atomic CAS.

Терминология progress guarantees

Wait-free   ⊂ Lock-free ⊂ Obstruction-free ⊂ (blocking algorithms)

Wait-free:        каждый поток завершает операцию за bounded шагов
Lock-free:        хотя бы один поток завершает операцию за bounded шагов
Obstruction-free: поток завершает операцию, если работает в изоляции
Blocking:         поток может ждать неограниченно (mutex)

Классификация очередей по producer/consumer

Тип	Producers	Consumers	Сложность	Примеры
SPSC	1	1	Простейший	Audio buffer, single pipeline
MPSC	N	1	Средняя	Event loop input, log aggregator
SPMC	1	N	Средняя	Broadcast (less common)
MPMC	N	N	Сложная	General-purpose work queue

Когда lock-free оправдан

Сценарий	Lock-free	Mutex
Низкая contention, короткая критическая секция	Может быть медленнее	Predictable, простой
Высокая contention, миллионы ops/sec	Часто выигрывает	Park/unpark dominates
Real-time / hard latency bounds	Wait-free обязателен	Mutex даёт unpredictable tail
Простой read-heavy	RCU/atomic.Pointer	RWMutex

⚠️ Главный миф: “lock-free всегда быстрее mutex”. На самом деле под низкой contention sync.Mutex (с его внутренним spin’ом) ОЧЕНЬ быстр (~25 ns uncontended). Lock-free структуры с CAS-loop’ом могут быть медленнее в average case.

2. Глубокое погружение

2.1 Lock-free vs wait-free vs obstruction-free

Wait-free — самое строгое: каждая операция завершается за фиксированное число шагов. Никакой retry loop’ов. Примеры: atomic.Add (на x86 — LOCK ADD instruction).

Lock-free — слабее: возможны retry, но system as a whole progresses. CAS-loop в lock-free стэке: один поток может повторять CAS многократно, но при этом другой поток гарантированно завершает push.

Obstruction-free — самое слабое: операция завершается, если другие потоки приостановлены. Под contention может быть infinite retry. Редко используется на практике.

2.2 Linearizability vs serializability

Linearizability — каждая операция выглядит, как будто произошла мгновенно в какой-то точке между её start и end. Сильнее, чем serializability.

Time:    -------[A enqueue 1]-------
                    -------[B enqueue 2]-------
                              -------[C dequeue]------

Если C видит 2 раньше 1 — нелинеаризуема (B завершил после A в реальном времени).

Serializability — операции эквивалентны какому-то sequential order. Не обязательно соблюдает real-time order.

Lock-free очереди обычно linearizable. Это критично, потому что иначе clients видят “временной парадокс”.

2.3 SPSC: LMAX-style ring buffer

Один producer, один consumer. Простейший случай, можно сделать wait-free.

Buffer (capacity = 8):
  +---+---+---+---+---+---+---+---+
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  +---+---+---+---+---+---+---+---+
            ^               ^
            head=2          tail=5
    consumer reads          producer writes

Memory ordering требования:

Producer пишет данные в slot до обновления head.
Consumer читает head (acquire) до чтения данных.
Это типичный publish-subscribe pattern.

Реализация в Go:

package spsc

import (
    "sync/atomic"
)

const cacheLine = 64

type SPSC[T any] struct {
    mask uint64
    buf  []T

    // Padding для cache line separation между head и tail
    _    [cacheLine]byte
    head atomic.Uint64 // producer writes here
    _    [cacheLine - 8]byte
    tail atomic.Uint64 // consumer reads here
    _    [cacheLine - 8]byte
}

func New[T any](capacityPow2 int) *SPSC[T] {
    cap := 1 << capacityPow2
    return &SPSC[T]{
        mask: uint64(cap - 1),
        buf:  make([]T, cap),
    }
}

// Push вызывается ТОЛЬКО одним producer'ом
func (q *SPSC[T]) Push(v T) bool {
    head := q.head.Load()
    tail := q.tail.Load() // acquire
    if head-tail >= uint64(len(q.buf)) {
        return false // full
    }
    q.buf[head&q.mask] = v
    q.head.Store(head + 1) // release
    return true
}

// Pop вызывается ТОЛЬКО одним consumer'ом
func (q *SPSC[T]) Pop() (T, bool) {
    var zero T
    tail := q.tail.Load()
    head := q.head.Load() // acquire
    if tail >= head {
        return zero, false // empty
    }
    v := q.buf[tail&q.mask]
    q.tail.Store(tail + 1) // release
    return v, true
}

2.3.1 Почему padding критичен

Без padding head и tail могут оказаться в одной cache line:

Cache line (64 bytes):
[ head | tail | ...other... ]
   ^      ^
   P      C  → producer и consumer обновляют одну линию
           → MESI ping-pong:

P updates head → P's cache: M, C's cache: I
C reads tail → C must fetch line → P's cache: S, C's cache: S
C updates tail → C's cache: M, P's cache: I
P reads head → ...

С padding head и tail в разных cache lines — нет conflict’a.

Бенчмарк:

// Без padding: ~150 ns/op
// С padding:   ~30 ns/op (5x speedup)

2.3.2 LMAX Disruptor

LMAX Disruptor (originally Java) — продвинутая версия SPSC/MPSC ring buffer с:

Cursor sequence — explicit publishing point.
Batching — consumer обрабатывает несколько items за раз.
Wait strategies — yielding, blocking, busy-spin.

В Go аналоги: lmax/disruptor-go (заброшен), smallnest/queue (поддерживается).

2.4 MPSC: Vyukov queue

Несколько producers, один consumer. Классическая реализация — Vyukov intrusive MPSC queue.

Идея:

Producers atomic’ом обменивают tail pointer на свой node, и затем связывают свой node со следующим (атомарно).
Consumer обходит linked list через next pointers.

Initial state:
  head -> stub -> nil
  tail = stub

Producer 1 pushes A:
  prev = atomic.Swap(&tail, A) // prev = stub
  prev.next = A

  head -> stub -> A -> nil
  tail = A

Producer 2 pushes B (concurrently with 1):
  prev = atomic.Swap(&tail, B) // prev = A (if after 1)
  prev.next = B

  head -> stub -> A -> B -> nil
  tail = B

package mpsc

import (
    "sync/atomic"
    "unsafe"
)

type node struct {
    value any
    next  unsafe.Pointer // *node
}

type Queue struct {
    head *node           // consumer accesses (no contention)
    _    [56]byte        // padding
    tail unsafe.Pointer  // *node, producer CAS target
}

func New() *Queue {
    stub := &node{}
    q := &Queue{head: stub}
    atomic.StorePointer(&q.tail, unsafe.Pointer(stub))
    return q
}

func (q *Queue) Push(v any) {
    n := &node{value: v}
    prev := atomic.SwapPointer(&q.tail, unsafe.Pointer(n))
    // ⚠️ window: после swap до linking
    // если consumer попытается прочитать в этот момент,
    // он увидит nil в next старого tail'а
    atomic.StorePointer(&(*node)(prev).next, unsafe.Pointer(n))
}

func (q *Queue) Pop() (any, bool) {
    head := q.head
    next := (*node)(atomic.LoadPointer(&head.next))
    if next == nil {
        return nil, false // empty (or producer in progress)
    }
    q.head = next
    return next.value, true
}

⚠️ Subtle race: между atomic.SwapPointer(&q.tail, ...) и atomic.StorePointer(&prev.next, ...) есть окно, где consumer видит queue “пустым”, хотя producer уже в процессе. Это lock-free, но не wait-free — consumer должен retry’нуть позже.

Используется в Go runtime для очередей в нескольких местах (runtime/proc.go, work-stealing адаптации).

2.5 SPMC: специфика

Один producer, несколько consumers. Менее распространён, потому что для классического “work queue” чаще нужен MPMC.

Простейший подход — producer пишет в shared array, consumers CAS’ят на индекс чтения:

type SPMC[T any] struct {
    buf []T
    head atomic.Uint64 // producer-only
    tail atomic.Uint64 // CAS by consumers
}

func (q *SPMC[T]) Pop() (T, bool) {
    for {
        tail := q.tail.Load()
        head := q.head.Load()
        if tail >= head {
            var zero T
            return zero, false
        }
        v := q.buf[tail&mask]
        if q.tail.CompareAndSwap(tail, tail+1) {
            return v, true
        }
        // retry
    }
}

⚠️ ABA risk: между tail.Load() и CAS другие consumers могут продвинуть tail на N и обратно (теоретически). В Go менее опасно (GC удержит buf).

Use case: pub/sub с одним publisher и многими subscribers (но обычно делается через broadcast channel или sync.Cond).

2.6 MPMC: Michael-Scott queue

Классическая lock-free MPMC очередь. Maged Michael, Michael Scott (1996). Используется в Java’s ConcurrentLinkedQueue.

head -> dummy -> A -> B -> C
                              ^
                              tail

Enqueue D:
  tail.next = D  (CAS)
  tail = D       (CAS, helps if previous step interleaved)

Dequeue:
  read head.next
  head = head.next  (CAS)

package msqueue

import (
    "sync/atomic"
    "unsafe"
)

type node struct {
    value any
    next  unsafe.Pointer // *node
}

type MSQueue struct {
    head unsafe.Pointer // *node
    tail unsafe.Pointer // *node
}

func New() *MSQueue {
    dummy := &node{}
    return &MSQueue{
        head: unsafe.Pointer(dummy),
        tail: unsafe.Pointer(dummy),
    }
}

func (q *MSQueue) Enqueue(v any) {
    n := &node{value: v}
    for {
        tail := atomic.LoadPointer(&q.tail)
        next := atomic.LoadPointer(&(*node)(tail).next)
        if tail != atomic.LoadPointer(&q.tail) {
            continue // tail changed, retry
        }
        if next != nil {
            // tail is lagging, help advance
            atomic.CompareAndSwapPointer(&q.tail, tail, next)
            continue
        }
        // try to link new node
        if atomic.CompareAndSwapPointer(&(*node)(tail).next, nil, unsafe.Pointer(n)) {
            atomic.CompareAndSwapPointer(&q.tail, tail, unsafe.Pointer(n))
            return
        }
    }
}

func (q *MSQueue) Dequeue() (any, bool) {
    for {
        head := atomic.LoadPointer(&q.head)
        tail := atomic.LoadPointer(&q.tail)
        next := atomic.LoadPointer(&(*node)(head).next)
        if head != atomic.LoadPointer(&q.head) {
            continue
        }
        if head == tail {
            if next == nil {
                return nil, false // empty
            }
            // tail is lagging, help advance
            atomic.CompareAndSwapPointer(&q.tail, tail, next)
            continue
        }
        v := (*node)(next).value
        if atomic.CompareAndSwapPointer(&q.head, head, next) {
            return v, true
        }
    }
}

Ключевые моменты:

Helping mechanism: если поток видит, что tail отстал (есть next, но tail указывает на старое), он помогает продвинуть tail. Без этого можно зависнуть.
Dummy node: head всегда указывает на dummy (уже извлечённый) node, чтобы не было race condition между empty check и pop.
ABA problem: classical Michael-Scott использует tagged pointers (counter в high bits). В Go ABA менее опасен из-за GC, но возможен с unsafe.
Memory reclamation: когда можно освободить старые nodes? В Go — автоматически через GC. В C/C++ нужны Hazard Pointers или EBR.

2.7 Bounded MPMC (slot-based)

Vyukov bounded MPMC queue — array-based, fixed capacity. Каждый slot имеет sequence counter, который кодирует state.

Slot state encoded via sequence counter:
  - seq == pos:        slot ready for write
  - seq == pos + 1:    slot ready for read
  - other:             busy / not ready

package boundedmpmc

import (
    "runtime"
    "sync/atomic"
)

type slot[T any] struct {
    seq   atomic.Uint64
    value T
}

type Queue[T any] struct {
    buf  []slot[T]
    mask uint64

    _    [56]byte
    enq  atomic.Uint64
    _    [56]byte
    deq  atomic.Uint64
}

func New[T any](capacityPow2 int) *Queue[T] {
    cap := uint64(1 << capacityPow2)
    q := &Queue[T]{
        buf:  make([]slot[T], cap),
        mask: cap - 1,
    }
    for i := range q.buf {
        q.buf[i].seq.Store(uint64(i))
    }
    return q
}

func (q *Queue[T]) Enqueue(v T) bool {
    var pos uint64
    for {
        pos = q.enq.Load()
        s := &q.buf[pos&q.mask]
        seq := s.seq.Load()
        diff := int64(seq) - int64(pos)
        if diff == 0 {
            if q.enq.CompareAndSwap(pos, pos+1) {
                s.value = v
                s.seq.Store(pos + 1)
                return true
            }
        } else if diff < 0 {
            return false // full
        } else {
            // another producer is ahead, retry
            runtime.Gosched()
        }
    }
}

func (q *Queue[T]) Dequeue() (T, bool) {
    var zero T
    var pos uint64
    for {
        pos = q.deq.Load()
        s := &q.buf[pos&q.mask]
        seq := s.seq.Load()
        diff := int64(seq) - int64(pos+1)
        if diff == 0 {
            if q.deq.CompareAndSwap(pos, pos+1) {
                v := s.value
                s.seq.Store(pos + q.mask + 1)
                return v, true
            }
        } else if diff < 0 {
            return zero, false // empty
        } else {
            runtime.Gosched()
        }
    }
}

Преимущества:

Bounded memory (нет аллокаций per op).
Cache-friendly (array layout).
Performance: 5-10x быстрее channel’ов под contention.

Недостатки:

Bounded — может вернуть full.
Complex memory ordering — легко сделать bug.
Не для production без тщательного теста с -race и стресс-тестами.

2.8 ABA tagging

Tagged pointers: используем high bits указателя для счётчика (counter).

64-bit pointer:
  [ 16-bit counter | 48-bit address ]

Каждый CAS обновляет counter:
  oldVal = load(ptr)         // tag = T, addr = A
  // ... do work ...
  newVal = (T+1)<<48 | A_new // tag = T+1
  CAS(ptr, oldVal, newVal)

Если другой поток сделал A → B → A между нашими load и CAS, tag будет другим (B сделал свой increment), и CAS провалится → retry.

В Go это сложно реализовать чисто, потому что unsafe.Pointer — 8 байт, и tag нужно держать отдельно. Альтернатива — struct с pointer + counter, CAS на 128-bit (но atomic.Value не поддерживает 128-bit на платформах без CX16).

Workaround в Go: использовать struct из двух полей и CAS через sync/atomic.Value или специальные двухступенчатые механизмы. Чаще проще использовать Hazard Pointers или просто полагаться на GC.

2.9 Hazard Pointers (HP)

Hazard Pointers — техника безопасной memory reclamation для lock-free структур.

Идея:

Каждый поток заявляет (publishes) указатели, которые он сейчас читает.
Перед reclaim (free) поток проверяет все hazard pointers всех потоков.
Если указатель в чьём-то hazard list — не освобождаем (defer).

Thread 1 reads node A:
  HP[1] = A
  read A.value
  ...
  HP[1] = nil

Thread 2 wants to free A:
  for each thread T:
    if HP[T] == A:
      defer (re-check later)
  if no hazard: free(A)

Pseudocode pop из lock-free стэка с HP:

func (s *Stack) Pop(myHP *atomic.Pointer[node]) (int, bool) {
    for {
        head := s.head.Load()
        if head == nil {
            return 0, false
        }
        myHP.Store(head)  // declare hazard
        if s.head.Load() != head {
            continue // head changed between load and HP set
        }
        next := head.next.Load()
        if s.head.CompareAndSwap(head, next) {
            myHP.Store(nil) // clear hazard
            retire(head)    // schedule for reclamation
            return head.value, true
        }
    }
}

func retire(n *node) {
    // add to local retire list
    // periodically scan all HPs and free those not in any HP
}

Преимущества HP:

Bounded memory overhead (per-thread HP slots — фиксированное число).
Не нужны grace periods.

Недостатки HP:

Каждое чтение требует publish + recheck → overhead.
Сложно для multiple pointers в одной операции.

В Go: HP не нужны (GC решает), но в low-level коде (использующем runtime/debug.SetFinalizer или unsafe) могут быть полезны.

Folly’s HP: folly/synchronization/Hazptr.h — production-grade implementation в Facebook’s library.

2.10 Epoch-Based Reclamation (EBR)

Идея EBR:

Глобальный счётчик epoch’ов.
Каждый поток входит в epoch перед чтением shared структур.
Reclaimed objects складываются в bucket по epoch’у.
Можно освободить bucket epoch E, когда все потоки прошли через E+2.

Global epoch: 5
Thread states:
  T1: epoch=5, reading
  T2: epoch=4, paused
  T3: epoch=5, reading

Cannot reclaim bucket E=4 (T2 might still hold pointers from E=4).
Can reclaim bucket E=3 (no thread in epoch ≤ 3).

package ebr

import (
    "sync/atomic"
)

type EBR struct {
    epoch    atomic.Uint64
    threads  []*threadState // per-thread state
    retired  [3][]any       // retired objects by epoch mod 3
}

type threadState struct {
    active atomic.Bool
    epoch  atomic.Uint64
}

func (e *EBR) Enter(t *threadState) {
    t.epoch.Store(e.epoch.Load())
    t.active.Store(true)
    // memory barrier
}

func (e *EBR) Exit(t *threadState) {
    t.active.Store(false)
}

func (e *EBR) Retire(obj any) {
    epoch := e.epoch.Load()
    e.retired[epoch%3] = append(e.retired[epoch%3], obj)
    if shouldAdvance() {
        e.advance()
    }
}

func (e *EBR) advance() {
    cur := e.epoch.Load()
    // check if all threads in epoch ≥ cur
    for _, t := range e.threads {
        if t.active.Load() && t.epoch.Load() < cur {
            return // can't advance yet
        }
    }
    newEpoch := cur + 1
    e.epoch.Store(newEpoch)
    // reclaim bucket (newEpoch+1)%3 (which is now safe)
    for _, obj := range e.retired[(newEpoch+1)%3] {
        _ = obj // free
    }
    e.retired[(newEpoch+1)%3] = nil
}

Преимущества EBR:

Меньше overhead per-read (один atomic increment + write).
Простая реализация.

Недостатки:

Memory amplification под высокой нагрузкой (если потоки долго не выходят из epoch, retired buckets растут).
Stalled thread (например в runtime.LockOSThread + блокировка) задерживает reclamation.

Используется в: Crossbeam (Rust), Folly’s UnboundedQueue, hazardous Linux kernel paths.

2.11 Read-Copy-Update (RCU)

RCU — паттерн из Linux kernel. Идея:

Readers: не блокируются, читают через atomic pointer load.
Writers: создают копию, модифицируют её, атомарно swap’ят pointer. Старая copy не освобождается до конца “grace period” — момента, когда все pre-swap readers завершились.

Initial:
  ptr -> [v1: {a:1, b:2}]

Reader 1 starts:
  local = atomic.Load(&ptr) // local = v1

Writer changes b:
  v2 = copy(*ptr)
  v2.b = 5
  atomic.Store(&ptr, v2)
  // ptr -> v2: {a:1, b:5}
  // wait for grace period
  // free v1

Reader 1 finishes (still using v1 — safe).
Reader 2 starts:
  local = atomic.Load(&ptr) // local = v2

В Linux kernel “grace period” определяется через quiescent states (когда поток точно не держит указатель — например, переключение контекста).

В Go — упрощённая RCU через atomic.Pointer:

type RCU[T any] struct {
    p atomic.Pointer[T]
    mu sync.Mutex // writers serialize
}

func (r *RCU[T]) Read() *T {
    return r.p.Load()
}

func (r *RCU[T]) Update(f func(*T) *T) {
    r.mu.Lock()
    defer r.mu.Unlock()
    old := r.p.Load()
    new := f(old)
    r.p.Store(new)
    // old will be GC'd when no readers hold reference
}

Когда подходит:

Read-mostly (1000:1).
Структура небольшая (копирование не дорого).
Eventual consistency (write visibility не моментальная).

Используется в: Linux kernel dcache, routing tables, RCU lists. В Go — atomic.Pointer patterns в gRPC, etcd.

2.12 Performance trade-offs

Бенчмарки (примерные, на современном x86):

Операция	sync.Mutex	sync.RWMutex (read)	atomic.AddInt64	Lock-free MPMC	channel buffered
Uncontended	25 ns	30 ns	5 ns	20 ns	50 ns
4 threads contended	200 ns	50 ns (reads)	100 ns (cache contention)	300 ns	500 ns
32 threads contended	5 µs	500 ns	2 µs	2 µs	50 µs

Выводы:

Под низкой contention — mutex быстрее lock-free (spin path).
Под высокой contention — lock-free выигрывает (нет park/unpark), но cache line bouncing замедляет.
Channels медленнее mutex’ов всегда (overhead select + scheduler involvement).
Sharded counters уделывают atomic.AddInt64 под contention (см. файл 08).

Главное правило: измеряйте профайлером, не угадывайте.

3. Gotchas

⚠️ Vyukov MPSC: race между Swap и Store of next pointer. Consumer может увидеть queue “пустой”, когда producer уже в Swap. Это известный feature — consumer должен retry.
⚠️ Michael-Scott Dequeue без helping. Если поток падает между двумя CAS’ами, очередь зависнет. Helping mechanism (читай tail, помогай ему продвинуться) — необходим.
⚠️ ABA в Go через sync.Pool. Если ноды лежат в Pool и переиспользуются, ABA реален. Lock-free структуры + Pool = очень осторожно.
⚠️ Memory ordering на ARM64 слабее x86. CAS на x86 имплицитно full barrier. На ARM нужны explicit acquire/release. Go atomic пакет это абстрагирует, но при использовании unsafe ловите за хвост.
⚠️ False sharing между head/tail в SPSC queue. Должны быть в разных cache lines. Без padding — 5-10x slowdown под нагрузкой.
⚠️ Bounded MPMC (Vyukov slot-based) backpressure. При full очереди возвращает false. Caller должен решать — retry, drop, или backoff. Простой retry-loop = busy wait.
⚠️ Hazard Pointers требуют global registry потоков. В Go горутины умирают, регистрация сложнее, чем в pthread.
⚠️ EBR + LockOSThread = stalled epoch advancement. Если поток заблокирован syscall’ом, epoch не продвигается → memory not reclaimed.
⚠️ RCU не для frequent writes. Каждый write — копия. При 50/50 read/write — катастрофа.
⚠️ atomic.Pointer[T] zero value — pointer to T’s zero. Если T — interface, Load() вернёт (*T)(nil), и вызов методов на нём — nil deref.
⚠️ Linearizability нарушается при retry с разными arguments. Если в retry-loop’е вы меняете input, операция не атомарна логически. Например: Pop() retry с другим item — bug.
⚠️ runtime.Gosched() в hot retry loop дёргает scheduler. Под высокой contention хуже, чем PAUSE.
⚠️ Wait-free алгоритмы часто экспоненциально медленнее lock-free. Wait-free queue (Kogan-Petrank) теоретически лучше, но на практике lock-free MS-queue быстрее.
⚠️ Memory leak в Vyukov MPSC. Если consumer не успевает за producer’ами, queue растёт неограниченно (unbounded). Для bounded — нужен другой алгоритм.
⚠️ Lock-free структура с unsafe.Pointer не работает с GC stack scanning корректно. Go GC scan’ит stack и heap, и если pointer хранится только в uintptr (escaped), GC может его освободить под вами. Используйте unsafe.Pointer, не uintptr.

4. Real cases

4.1 Go runtime: work-stealing run queue

Каждый P (processor) имеет local circular run queue (256 slots). Это SPSC, но с одной хитростью: другие P могут “украсть” половину очереди (work stealing). Когда P0 пуст, он CAS’ит на P1’s queue и тащит половину.

Реализация в runtime/proc.go:

Локальный enqueue — atomic ops без CAS (один writer).
Steal — CAS на queue head + tail.
Если local queue полна — overflow в global queue (mutex-protected).

4.2 Folly UnboundedQueue (Facebook)

folly::UnboundedQueue — production MPMC очередь. Использует:

Segmented linked list (chunks of array).
EBR (Epoch-Based Reclamation) для memory.
Multiple variants: blocking, lock-free, throughput-optimized.

В Go аналог — lock-free-mpmc-queue или smallnest/queue, но менее зрелые.

4.3 LMAX Disruptor (originally Java)

LMAX Exchange (trading platform). Single producer, multiple consumers ring buffer. Достигают 1M+ ops/sec на ядро. Ключевые оптимизации:

Padding для cache lines.
Batching на consumer side.
Custom wait strategies (busy-spin, yielding, blocking).
Pre-allocated objects (no GC pressure в Java).

В Go full LMAX-style — smallnest/queue (поддерживается).

4.4 Aeron messaging library

Aeron (originally Java, есть Rust/C++ ports) — high-performance messaging. SPSC и MPSC очереди в shared memory. Используется в финтехе, low-latency trading.

4.5 NATS server’s worker pool

NATS использует MPMC очередь работы для распределения сообщений по subscriber’ам. Внутри — собственная реализация (на channel’ах + рейтинг лимиты).

4.6 Kafka Java client’s RecordAccumulator

Kafka producer накапливает batches per-partition. Каждый partition — MPSC queue (много threads пишут в один batch, один thread flush’ит). Используется Deque + synchronization.

5. Вопросы

Что такое linearizability и чем отличается от serializability?
Linearizability — каждая операция выглядит мгновенной в какой-то точке между start и end, real-time order preserved. Serializability — операции эквивалентны какому-то sequential order (не обязательно real-time).
Различие wait-free, lock-free, obstruction-free?
Wait-free: каждый поток progresses за bounded шагов. Lock-free: хотя бы один поток progresses. Obstruction-free: поток progresses в изоляции.
Почему SPSC проще MPMC?
В SPSC нет contention на head (один writer) и tail (один reader). Никаких CAS-loop, только atomic load/store с правильным memory ordering.
Что такое cache line bouncing?
Несколько cores обновляют переменные в одной cache line. MESI протокол пересылает линию между cores при каждом write, что вызывает 10-100x slowdown.
Зачем padding между head и tail в ring buffer?
Чтобы producer (пишет head) и consumer (пишет tail) не модифицировали одну cache line. Без padding — false sharing → 5-10x slowdown.
Что такое Vyukov MPSC queue?
Lock-free intrusive linked list MPSC очередь. Producers atomic’ом swap’ят tail, потом linkают next. Consumer обходит linked list. Используется в Go runtime.
Что такое helping mechanism в Michael-Scott queue?
Если поток видит, что tail отстал (есть next, но tail указывает на предыдущий node), он помогает продвинуть tail через CAS. Без этого dead lock-free guarantee нарушается.
Что такое ABA problem?
Значение A → B → A между Load и CAS, но между этим что-то произошло (например, node был free’d и переиспользован). CAS пройдёт, но операция логически некорректна.
Как решить ABA?
Tagged pointers (counter в high bits), DCAS, Hazard Pointers, EBR. В Go GC решает большинство случаев, но не все.
Что такое Hazard Pointers?
Каждый поток публикует “опасные” указатели (которые сейчас читает). Перед free поток проверяет все hazard pointers всех потоков. Если найден — defer reclamation.
Чем EBR (Epoch-Based Reclamation) отличается от Hazard Pointers?
EBR использует глобальный счётчик epoch’ов. Объекты освобождаются, когда все потоки прошли через epoch+2. Меньше overhead на читателей, но больше memory amplification.
Что такое RCU и где используется?
Read-Copy-Update: readers без блокировки, writers копируют + atomic swap pointer’a. Старые copies удаляются после grace period. Используется в Linux kernel (routing tables, dcache).
Можно ли реализовать RCU в Go?
Упрощённо — да, через atomic.Pointer. Grace period обеспечивается GC. Не полный RCU (нет per-CPU reader lists), но рабочий паттерн.
Когда lock-free медленнее mutex?
Под низкой contention. sync.Mutex имеет spin path (uncontested ~25 ns). Lock-free retry loop’ы дают overhead, особенно если CAS-loop проигрывает много итераций.
Что такое bounded vs unbounded queue?
Bounded: фиксированная capacity, full → backpressure (return false). Unbounded: растёт неограниченно, OOM risk при slow consumer.
Какие memory ordering требования у SPSC ring buffer?
Producer: write data → release-store of head. Consumer: acquire-load of head → read data. На x86 (TSO) автоматически. На ARM64 нужен явный barrier.
Что такое Disruptor pattern (LMAX)?
SPSC/MPSC ring buffer с published cursor, batching, и pre-allocated objects. Достигает 1M+ ops/sec на ядро.
Где в Go runtime используется lock-free?
Run queue каждого P (work-stealing), глобальная очередь горутин (mutex-protected), некоторые semaphore структуры, mcache в memory allocator.
Что такое slot-based bounded MPMC?
Vyukov array-based queue. Каждый slot имеет sequence counter. Producer ждёт slot в состоянии “ready for write”, consumer — в состоянии “ready for read”.
Можно ли использовать unsafe.Pointer для tagged pointers в Go?
Технически да, но unsafe.Pointer должен быть валидным — нельзя arbitrary арифметику. Workaround: пара pointer+tag в struct, CAS через atomic.Value или 128-bit ops (не везде).
Почему GC в Go упрощает lock-free?
Не нужны Hazard Pointers/EBR для memory reclamation — GC сам определит, что объект не reachable. Минус: больше pressure на GC.
Что такое intrusive vs non-intrusive структура?
Intrusive: ноды связи (next, prev) часть самого объекта. Non-intrusive: ноды — отдельные wrapper’ы. Intrusive быстрее (одна аллокация), но навязывает structure объекту.
Когда нужен wait-free вместо lock-free?
Hard real-time системы (медицинские приборы, automotive), где tail latency должна быть bounded. Lock-free может зациклиться при contention.
Чем atomic.Pointer[T] отличается от unsafe.Pointer + atomic.LoadPointer?
atomic.Pointer[T] — типизированный (compile-time safety), нет boxing, нет casts. Performance — почти идентичен (compiler оптимизирует).
Что значит “linearization point”?
Точка в выполнении операции, где effects становятся observable. Для lock-free push — момент successful CAS. Для Dequeue в MS-queue — successful CAS on head.
Чем bounded MPMC проигрывает unbounded?
Bounded имеет фиксированный размер → backpressure при full. Unbounded растёт, но slow consumer = OOM. Trade-off предсказуемость vs throughput.
Можно ли сделать MPMC очередь wait-free?
Теоретически да (Kogan-Petrank queue), но на практике медленнее lock-free MS-queue в average case.
Что такое sequence number в bounded MPMC?
Метка состояния slot’а: seq == pos → готов к write, seq == pos+1 → готов к read. Producer/consumer проверяют, не пересекаются, не пересоздают.
Как профилировать lock-free структуру в Go?
go test -bench -benchmem, -race (но lock-free могут давать false negatives), pprof.Lookup("mutex") (для смешанных). Главное — стресс-тест на разных GOMAXPROCS.
Зачем нужен runtime/debug.SetGCPercent(-1) в lock-free бенчмарках?
Чтобы исключить GC pauses из measurements. Реальный production обычно с GC; но для микробенчмарка lock-free структуры — лучше отключить.

6. Practice

6.1 SPSC ring buffer с правильным cache line padding

Реализуйте SPSC ring buffer для int64. Бенчмарк vs chan int64 buffered. Должны увидеть 5-10x speedup.

6.2 Vyukov MPSC queue

Реализуйте MPSC queue по Vyukov алгоритму. Race detector должен молчать. Бенчмарк под 16 producers / 1 consumer.

6.3 Michael-Scott MPMC queue

Реализуйте MS-queue. Тесты с -race под 16 producers/16 consumers. Проверьте linearizability через property-based test.

6.4 Bounded MPMC (Vyukov slot-based)

Реализуйте bounded MPMC. Сравните с MS-queue под равной нагрузкой. Должны увидеть, что bounded быстрее (cache-friendly).

6.5 EBR в Go

Реализуйте упрощённую EBR (enter/exit/retire/advance). Используйте её в lock-free стэке (даже если GC делает то же, для образовательных целей).

6.6 RCU map

Реализуйте RCUMap[K, V] через atomic.Pointer[map[K]V] + writer mutex. Бенчмарк 99% reads / 1% writes vs sync.RWMutex.

6.7 Lock-free stack

Реализуйте Treiber stack (lock-free LIFO). Проверьте на ABA: переиспользуйте ноды через sync.Pool и проверьте, проявляется ли ABA.

6.8 Work-stealing run queue

Реализуйте work-stealing queue: local SPSC + steal half через CAS. Похоже на Go runtime’s run queue. Сравните throughput с обычным MPMC.

7. Источники

Maged Michael, Michael Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms” (1996) — classic MS-queue paper.
Dmitry Vyukov, 1024cores.net — MPSC queue, bounded MPMC, разные lock-free алгоритмы.
Maurice Herlihy, Nir Shavit, “The Art of Multiprocessor Programming” — главы про linearizability, wait-free, hazard pointers.
Paul McKenney, “Is Parallel Programming Hard?” — глубокое погружение в RCU.
LMAX Disruptor, white paper — pattern для high-throughput pipeline.
Folly UnboundedQueue — Facebook production-grade MPMC.
Crossbeam (Rust) crossbeam-queue — современные реализации SPSC/MPSC/MPMC.
Linux kernel RCU docs, Documentation/RCU/.
Go runtime source — src/runtime/runqueue.go, work-stealing.
Aeron white paper — high-performance messaging library.
Anthony Williams, “C++ Concurrency in Action” — общая теория lock-free.
Kogan, Petrank, “Wait-Free Queues with Multiple Enqueuers and Dequeuers” (2011) — theoretical wait-free MPMC.
Russ Cox, Go Memory Model — official memory ordering.
smallnest/queue Go MPMC implementations — Go-specific.