MSP-1 / Research

KV Cache Memory Costs

A back-of-the-envelope model that makes KV cache memory growth legible, then shows how semantic scoping (MSP-1 style) can reduce the effective tokens carried forward — translating directly into VRAM headroom.

Type: Original analysis Focus: Inference memory Status: Draft

1) KV cache sizing

For a decoder-only transformer, KV cache scales linearly with context length and concurrent sequences.

Core approximation (FP16/BF16)

A commonly used approximation is:

KV_bytes_per_token ≈ 2 × num_layers × num_kv_heads × head_dim × bytes_per_element

The factor 2 accounts for both K and V.

Total KV memory

KV_GB ≈ (KV_bytes_per_token × tokens × concurrency) / 1e9

2) “Real numbers” examples

Concrete reference points you can plug into the model.

Example A — LLaMA 3 70B (published per-token KV size)

KV heads: K = 8
Head dim: H = 128
Layers: L = 80

KV per token ≈ 160 kB/token under the framing:

2 × 8 × 128 × 80 values → ~160 kB/token

At 32k tokens, KV ≈ ~5.3 GB per sequence.

Rule of thumb:

LLaMA 3-70B KV ≈ 0.00016 GB/token ≈ 0.16 MB/token

Notes: Values above are reproduced as presented in the referenced example and are suitable for back-of-the-envelope planning.

Example B — LLaMA 2 7B (common sizing reference)

Using the same general KV sizing approach, a widely cited concrete example is:

~0.5 MB/token for LLaMA-2-7B in FP16

Because this is a per-token figure, memory rises rapidly with long prompts and concurrency.

3) Convert “semantic bloat” into VRAM pressure

Use a crisp published reference number to translate context and concurrency into real memory residency.

Baseline scenario (no semantic scoping)

Context length: 32,768 tokens
Concurrency: 16 simultaneous sequences
KV per sequence: ~5.3 GB

Total KV for serving

Total KV ≈ 5.3 GB × 16 ≈ 84.8 GB

That is KV cache alone — before weights, activations, fragmentation, allocator overhead, and other runtime costs.

4) What MSP-1 changes in the math

MSP-1 doesn’t compress tensors. It reduces the effective tokens that must be carried forward by making intent and scope explicit, so you don’t drag irrelevant context through the working set.

Model it as a reduction in active tokens

tokens_new = tokens_base × (1 − r)

where r is the fractional reduction in required context (e.g., 0.25, 0.50).

KV is linear in tokens

KV_new = KV_base × (1 − r)

Using 5.3 GB/sequence @ 32k tokens

25% token reduction → KV per sequence: ~4.0 GB
50% token reduction → KV per sequence: ~2.65 GB
75% token reduction → KV per sequence: ~1.33 GB

Multiply by concurrency (16 sequences)

25% reduction → ~64 GB total KV (saves ~21 GB)
50% reduction → ~42 GB total KV (saves ~43 GB)
75% reduction → ~21 GB total KV (saves ~64 GB)

These are not “nice to have” savings — they directly determine:

whether you fit on N GPUs
how much batch / concurrency you can serve
whether you spill KV off-device (and pay latency)

5) Economic translation

Inference economics typically hinge on three knobs.

Max concurrency per GPU (higher = cheaper per request)
Context length you can sustain without paging/offload
Latency (paging/offload kills QoS; QoS kills revenue)

KV cache is a first-order term in all three. Even conservative wins (e.g., 20–30% less carried-forward context) can become:

one fewer GPU for the same throughput, or
higher concurrency at the same latency, or
same concurrency at lower latency (less paging/offload)

6) A realistic MSP-1 range to use

When presenting to a skeptical architect, it’s helpful to keep claimed reductions conservative and scenario-dependent.

Messy real-world flows

10–30% reduction in carried-forward context (safe, conservative framing).

Agent handoffs & tool orchestration

30–60% reduction (more plausible when intent is crisp and roles are explicit).

Tightly controlled domains

60%+ only when “semantic manifests” replace large textual context blocks (possible, but claim carefully).