Research · April 2026 · Updated 2026-04-14 · LIG-443 attribution clarified

Tailslayer
Channel Diversity Is Real

We ported LaurieWired's hedged DRAM read technique to Rust, ran 7 discriminating experiments across 2 platforms, got it wrong twice, and proved her core mechanism right.

Tailslayer hero panel: LaurieWired wielding a HEDGED READ sword against DRAM tail latency spikes, with DRAM cells showing refresh (red) and available (green) rows

The Tax You Didn't Know You Were Paying

Panel 1 — The Physics
DRAM cells leak charge

Every transistor in your RAM is a tiny capacitor. Charge leaks. Every 7.8μs, the memory controller pauses to refresh an entire row.

During refresh, reads to that channel stall. This has been true since IBM's original DRAM design in 1966.

Panel 2 — The Symptom
Invisible at p50, lethal at p99.99

At the median, you never see it. Your ~80ns read completes fine. But at the 99.99th percentile, you hit a refresh cycle.

Suddenly: 200–400ns stall. For trading systems, real-time audio, agent dispatch — this is a missed tick, a glitch, a stale signal.

Physics explainer: single channel blocked by refresh vs two channels where the read bypasses to the clear channel

The Trick: Race the Refresh

Step 1
Replicate

Store N copies of your hot data, each placed on a different DRAM channel using undocumented address-bit offsets.

Step 2
Hedge

Issue reads to all replicas simultaneously. Each channel has an independent refresh schedule.

Step 3
Win

Take the first result back. Probability all N channels are refreshing at once: (duty_cycle)N. Exponential suppression.

P(all stalled) = (refresh_duty_cycle)N → 0 as N grows

This is the same conceptual move as hedged requests in distributed systems (Dean & Barroso, "The Tail at Scale"), but applied at the DRAM physical layer. Nobody had done this before LaurieWired.

The Undocumented Part

Which physical address bit selects which DRAM channel? AMD, Intel, and ARM all scramble addresses to distribute load across channels — but the exact bit mapping isn't public.

LaurieWired's trefi_probe discovers this empirically: flush a cache line, reload it, measure the latency. Repeat millions of times. Spike timing reveals the 7.8μs tREFI rhythm. Varying the address bits identifies which one flips the channel.

We ported this to Rust with x86_64 and aarch64 inline assembly. The probe runs on both Intel and Apple Silicon, detecting refresh signatures across DDR5 and LPDDR5.

$ sudo chrt -f 99 taskset -c 3 ./trefi-probe TSC: 2.985 GHz Expected tREFI: 7.8 us = 23279 cycles === CALIBRATING === 500000 probes: median=202 p90=370 p99=1046 p99.9=1876 === PERIODICITY ANALYSIS === 1T (±15%): 11.1% 2T (±15%): 3.0% Harmonic total: 15.2% Histogram peak: 6961 cycles (6.96 us) VERDICT: WEAK SIGNAL — tREFI visible via clflush timing

Results: The Smoking Gun

Our initial results were wrong twice. First we claimed 22× on WSL2 without realizing it was compressing OS noise, not DRAM refresh. Then we got all-zeros on Apple Silicon because dc civac only evicts to SLC, not DRAM. After redesigning the experiments, here's what actually holds up.

The Parallel Discriminator — M4 Max, bare metal

LaurieWired's design uses parallel reads on pinned cores specifically because refresh stalls are correlated — sequential reads can't break the correlation, only simultaneous reads from different channels can. We implemented this with core-pinned worker threads on M4 Max, 128MB working set (forcing real DRAM misses past the 48MB SLC).

Experiment p99.99 Improvement Meaning
Single read baseline 875 ns True DRAM tail (refresh stalls)
Sequential hedged 542 ns 1.6× Temporal diversity only
Parallel — near (64B, same channel) 875 ns 1.0× — NONE Correlated stalls. min(stall, stall) = stall
Parallel — far (1MB, cross channel) 583 ns 1.5× Channel diversity decorrelates refresh
0× → 1.5×
same-channel vs cross-channel
Channel-aware placement isn't cosmetic — it's the entire mechanism

Spike Correlation: Why Channels Matter

Latency spikes on bare metal are 38× more correlated than independence predicts (Pearson r = 0.23). When the memory controller stalls, all reads stall together. This is exactly why LaurieWired designed cross-channel placement — independent refresh schedules break the correlation. Our parallel reads reduced correlation from r = 0.23 to r = 0.15.

WSL2 Caveat — Intel i9-13900K, DDR5

Our initial 29× result on WSL2 is real but misleading. The baseline p99.99 was 9,943 ns — Hyper-V VM exits, not DRAM refresh. Hedging compressed OS noise via order statistics (prediction matched within 1%). This is useful but doesn't test the DRAM refresh mechanism. Bare-metal Linux with known channel hash would be the proper Intel test.

Apple Silicon: What We Learned the Hard Way

dc civac (ARM's cache flush instruction) evicts to Apple's 48MB System Level Cache, not DRAM. SLC latency is ~9ns; DRAM latency is ~114ns. Every clflush-based measurement on Apple Silicon was measuring SLC hits, which is why all results showed zero. Working set overflow (128MB buffer > 48MB SLC) is required to force genuine DRAM access.

Update — April 13-14, 2026: Repaired Pointer-Chase Instrument

After the original dc civac → SLC discovery, we rebuilt the Apple Silicon probe path from scratch as a randomized pointer-chase ring larger than the 48 MB SLC, with data-dependent loads that force the timer to measure real tail behavior instead of collapsing to cache hits. The new binary is tailslayer-dram-discriminator.

We reran this repaired instrument on the local M4 Max on April 13, 2026 and again on April 14, 2026 (same hardware). Both days show a consistent tail-reduction effect under the repaired path. The exact numbers move between runs as thermal state and background noise floor change, which is the honest story and is preserved below rather than averaged away.

Run n Baseline p99.99 Hedged p99.99 Ratio Pearson r
2026-04-13 (200k) 200,000 1860 ns 299 ns 6.2× 0.1178
2026-04-13 (2M) 2,000,000 462 ns 42 ns 11.1× 0.6117
2026-04-14 (200k) 200,000 79 ns 30 ns 2.7× 0.6788
2026-04-14 (2M) 2,000,000 86 ns 27 ns 3.1× 0.4623

Same-path replay control stays near zero across every run (p99.99 ≈ 1 ns), which rules out “the tail is the timer or the chase itself.” Correlation ratios over independence stay well above 1 in every run (492×, 562×, 504×, 562×), so the improvement is not purely the order-statistics collapse an IID tail would predict.

On April 14, 2026 we also reran the full broken-path family on this M4 Max and reproduced each original 2026-04-07 artifact a second time: tailslayer-bench still returns hedged = 0 ns, trefi-probe still reports NO PERIODIC SIGNAL at roughly 11% harmonics, and the 7-experiment tailslayer-discriminator still returns the all-zero same-address control with Pearson r near zero. Those are measurement artifacts of the clflush-style probe path on Apple Silicon, not proofs of a fast or a broken system. They are catalogued here so the failure of the old instrument stays on the record instead of being quietly dropped.

Claim boundary. The repaired pointer-chase instrument shows a real tail-reduction effect on an Apple M4 Max 128 GB, consistent across four runs on two different days. It is not yet a channel-bit proof. The separate parallel-discriminator channel-diversity result in the table above is from April 7, 2026; it has not been re-run on April 14 because the original C binary is no longer on this machine and we did not want to silently substitute a rebuild for the original artifact.

The Trade

You Pay
Memory & cores

N× memory for N replicas. One pinned core per replica for spin-wait workers. Channel-bit discovery per hardware config.

You Get
Decorrelated reads

1.5× p99.99 on M4 Max with unverified channel placement. LaurieWired reports 15× with verified channel bits on Intel/AMD. Full channel discovery is the gap.

This is the same trade distributed systems have made for decades: replicate for reliability. LaurieWired's insight is applying it at the memory level, exploiting the physical fact that channels refresh independently. Our contribution is the Rust port, the Apple Silicon SLC discovery, and the parallel discriminator proving the mechanism.

Code Availability — LaurieWired's C++ is public, our Rust port is not

LaurieWired's original work is a public C++ library at github.com/LaurieWired/tailslayer. This is the upstream inspiration for everything on this page. Cloning that repository gives you LaurieWired's C++ code, which is the work that proved the core mechanism first, and every experiment here is downstream of it.

Our Rust port — including the tailslayer-bench, trefi-probe, and tailslayer-dram-discriminator binaries referenced throughout this page, together with a HedgedReader<T> type backed by hugepage-allocated rings — lives in a private monorepo. It is not available at the LaurieWired repository and is not currently published. The dc civac → SLC discovery, the repaired pointer-chase instrument, and the numbers shown above all belong to the private Rust port.

This clarification closes LIG-443, an internal tracking item opened after noticing that an earlier version of this page described implementation details inline and then footer-linked to LaurieWired's repository in a way that could lead a reader to assume the described Rust binaries were available at that link. They are not.

Implementation

Our Rust port provides a HedgedReader<T> backed by 1GB hugepages (falling back to 2MB or regular pages). Data is replicated across channels with address offsets discovered by trefi-probe.

use tailslayer::{Config, HedgedReader}; let mut reader = HedgedReader::<u64>::new(Config { channel_bit: 8, channel_offset: 256, num_channels: 2, num_replicas: 2, })?; reader.insert(tick_data); // Hedged read: races both replicas, returns fastest let (value, cycles) = unsafe { reader.hedged_read(idx) };

Cross-platform: x86_64 inline assembly (rdtsc, clflush, mfence) and aarch64 equivalents (cntvct_el0, dc civac, dmb sy). Runs on Intel, AMD, Apple Silicon, and Graviton.

1966

IBM invents DRAM. Refresh penalty baked in from day one.

2013

Dean & Barroso publish "The Tail at Scale." Hedged requests become standard in distributed systems.

April 2026

LaurieWired applies hedging at the DRAM physical layer. Parallel reads on pinned cores across verified channels. 15× p99.99 in C++.

April 2026

We port to Rust, discover dc civac → SLC (not DRAM) on Apple Silicon. Parallel discriminator validates channel diversity: same-channel 0×, cross-channel 1.5×.

Corrections Log

This page maintains a dated, append-only corrections log. Every future material change to the claims above will be dated, named, and preserved below as signal about the document's credibility, rather than quietly rewritten.

2026-04-07

Initial publication. Results table claimed channel diversity from a clflush path on Apple Silicon. The bench binary showed hedged p99.99 = 0 ns on M4 Max, which we momentarily interpreted as “TAIL ELIMINATED” instead of recognizing it as the dc civac → SLC measurement artifact it actually was.

2026-04-13

Added the “we got it wrong twice” framing after confirming the April 7 Mac SSH run actually ended NO SIGNIFICANT IMPROVEMENT, not success. Added the Apple Silicon SLC finding as a named failure mode. First repaired pointer-chase runs recorded (200k: 6.2×, 2M: 11.1×).

2026-04-14

Reran the repaired instrument a second day on this M4 Max (200k: 2.7×, 2M: 3.1×). Reran every broken path a second day, each still broken. Promoted the LaurieWired C++ / our Rust port attribution split from a 0.6rem footer line to a dedicated Code Availability section. Added the update block for the repaired instrument numbers. Closed LIG-443.

Original technique, mechanism design, and C++ implementation by LaurieWired
Rust port, Apple Silicon investigation, and parallel discriminator by Danielle Fong + Claude

GitHub: LaurieWired/tailslayer (original C++ lib) · hyperclaude.cc
Rust port is not yet public. Raw data: discriminator-hyperion-2026-04-07.txt, parallel-discriminator-m4max-2026-04-07.txt