We ported LaurieWired's hedged DRAM read technique to Rust, ran 7 discriminating experiments across 2 platforms, got it wrong twice, and proved her core mechanism right.
Every transistor in your RAM is a tiny capacitor. Charge leaks. Every 7.8μs, the memory controller pauses to refresh an entire row.
During refresh, reads to that channel stall. This has been true since IBM's original DRAM design in 1966.
At the median, you never see it. Your ~80ns read completes fine. But at the 99.99th percentile, you hit a refresh cycle.
Suddenly: 200–400ns stall. For trading systems, real-time audio, agent dispatch — this is a missed tick, a glitch, a stale signal.
Store N copies of your hot data, each placed on a different DRAM channel using undocumented address-bit offsets.
Issue reads to all replicas simultaneously. Each channel has an independent refresh schedule.
Take the first result back. Probability all N channels are refreshing at once: (duty_cycle)N. Exponential suppression.
This is the same conceptual move as hedged requests in distributed systems (Dean & Barroso, "The Tail at Scale"), but applied at the DRAM physical layer. Nobody had done this before LaurieWired.
Which physical address bit selects which DRAM channel? AMD, Intel, and ARM all scramble addresses to distribute load across channels — but the exact bit mapping isn't public.
LaurieWired's trefi_probe discovers this empirically: flush a cache line,
reload it, measure the latency. Repeat millions of times. Spike timing reveals the
7.8μs tREFI rhythm. Varying the address bits identifies which one flips the channel.
We ported this to Rust with x86_64 and aarch64 inline assembly. The probe runs on both Intel and Apple Silicon, detecting refresh signatures across DDR5 and LPDDR5.
Our initial results were wrong twice. First we claimed 22× on WSL2 without realizing
it was compressing OS noise, not DRAM refresh. Then we got all-zeros on Apple Silicon because
dc civac only evicts to SLC, not DRAM. After redesigning the experiments, here's
what actually holds up.
LaurieWired's design uses parallel reads on pinned cores specifically because refresh stalls are correlated — sequential reads can't break the correlation, only simultaneous reads from different channels can. We implemented this with core-pinned worker threads on M4 Max, 128MB working set (forcing real DRAM misses past the 48MB SLC).
| Experiment | p99.99 | Improvement | Meaning |
|---|---|---|---|
| Single read baseline | 875 ns | — | True DRAM tail (refresh stalls) |
| Sequential hedged | 542 ns | 1.6× | Temporal diversity only |
| Parallel — near (64B, same channel) | 875 ns | 1.0× — NONE | Correlated stalls. min(stall, stall) = stall |
| Parallel — far (1MB, cross channel) | 583 ns | 1.5× | Channel diversity decorrelates refresh |
Latency spikes on bare metal are 38× more correlated than independence predicts (Pearson r = 0.23). When the memory controller stalls, all reads stall together. This is exactly why LaurieWired designed cross-channel placement — independent refresh schedules break the correlation. Our parallel reads reduced correlation from r = 0.23 to r = 0.15.
Our initial 29× result on WSL2 is real but misleading. The baseline p99.99 was 9,943 ns — Hyper-V VM exits, not DRAM refresh. Hedging compressed OS noise via order statistics (prediction matched within 1%). This is useful but doesn't test the DRAM refresh mechanism. Bare-metal Linux with known channel hash would be the proper Intel test.
dc civac (ARM's cache flush instruction) evicts to Apple's 48MB System Level Cache,
not DRAM. SLC latency is ~9ns; DRAM latency is ~114ns. Every clflush-based
measurement on Apple Silicon was measuring SLC hits, which is why all results showed zero.
Working set overflow (128MB buffer > 48MB SLC) is required to force genuine DRAM access.
After the original dc civac → SLC discovery, we rebuilt the Apple
Silicon probe path from scratch as a randomized pointer-chase ring larger
than the 48 MB SLC, with data-dependent loads that force the timer
to measure real tail behavior instead of collapsing to cache hits. The
new binary is tailslayer-dram-discriminator.
We reran this repaired instrument on the local M4 Max on April 13, 2026 and again on April 14, 2026 (same hardware). Both days show a consistent tail-reduction effect under the repaired path. The exact numbers move between runs as thermal state and background noise floor change, which is the honest story and is preserved below rather than averaged away.
| Run | n | Baseline p99.99 | Hedged p99.99 | Ratio | Pearson r |
|---|---|---|---|---|---|
| 2026-04-13 (200k) | 200,000 | 1860 ns | 299 ns | 6.2× | 0.1178 |
| 2026-04-13 (2M) | 2,000,000 | 462 ns | 42 ns | 11.1× | 0.6117 |
| 2026-04-14 (200k) | 200,000 | 79 ns | 30 ns | 2.7× | 0.6788 |
| 2026-04-14 (2M) | 2,000,000 | 86 ns | 27 ns | 3.1× | 0.4623 |
Same-path replay control stays near zero across every run (p99.99 ≈ 1 ns), which rules out “the tail is the timer or the chase itself.” Correlation ratios over independence stay well above 1 in every run (492×, 562×, 504×, 562×), so the improvement is not purely the order-statistics collapse an IID tail would predict.
On April 14, 2026 we also reran the full broken-path family on this M4 Max and reproduced each original 2026-04-07 artifact a second time:tailslayer-benchstill returns hedged = 0 ns,trefi-probestill reports NO PERIODIC SIGNAL at roughly 11% harmonics, and the 7-experimenttailslayer-discriminatorstill returns the all-zero same-address control with Pearson r near zero. Those are measurement artifacts of the clflush-style probe path on Apple Silicon, not proofs of a fast or a broken system. They are catalogued here so the failure of the old instrument stays on the record instead of being quietly dropped.
Claim boundary. The repaired pointer-chase instrument shows a real tail-reduction effect on an Apple M4 Max 128 GB, consistent across four runs on two different days. It is not yet a channel-bit proof. The separate parallel-discriminator channel-diversity result in the table above is from April 7, 2026; it has not been re-run on April 14 because the original C binary is no longer on this machine and we did not want to silently substitute a rebuild for the original artifact.
N× memory for N replicas. One pinned core per replica for spin-wait workers. Channel-bit discovery per hardware config.
1.5× p99.99 on M4 Max with unverified channel placement. LaurieWired reports 15× with verified channel bits on Intel/AMD. Full channel discovery is the gap.
This is the same trade distributed systems have made for decades: replicate for reliability. LaurieWired's insight is applying it at the memory level, exploiting the physical fact that channels refresh independently. Our contribution is the Rust port, the Apple Silicon SLC discovery, and the parallel discriminator proving the mechanism.
LaurieWired's original work is a public C++ library at github.com/LaurieWired/tailslayer. This is the upstream inspiration for everything on this page. Cloning that repository gives you LaurieWired's C++ code, which is the work that proved the core mechanism first, and every experiment here is downstream of it.
Our Rust port — including the
tailslayer-bench, trefi-probe, and
tailslayer-dram-discriminator binaries referenced throughout
this page, together with a HedgedReader<T> type backed
by hugepage-allocated rings — lives in a private monorepo. It is
not available at the LaurieWired repository and is not
currently published. The dc civac → SLC discovery,
the repaired pointer-chase instrument, and the numbers shown above all
belong to the private Rust port.
This clarification closes LIG-443, an internal tracking item opened after noticing that an earlier version of this page described implementation details inline and then footer-linked to LaurieWired's repository in a way that could lead a reader to assume the described Rust binaries were available at that link. They are not.
Our Rust port provides a HedgedReader<T> backed by 1GB hugepages
(falling back to 2MB or regular pages). Data is replicated across channels with
address offsets discovered by trefi-probe.
Cross-platform: x86_64 inline assembly (rdtsc, clflush,
mfence) and aarch64 equivalents (cntvct_el0,
dc civac, dmb sy). Runs on Intel, AMD, Apple Silicon,
and Graviton.
IBM invents DRAM. Refresh penalty baked in from day one.
Dean & Barroso publish "The Tail at Scale." Hedged requests become standard in distributed systems.
LaurieWired applies hedging at the DRAM physical layer. Parallel reads on pinned cores across verified channels. 15× p99.99 in C++.
We port to Rust, discover dc civac → SLC (not DRAM) on Apple Silicon. Parallel discriminator validates channel diversity: same-channel 0×, cross-channel 1.5×.
This page maintains a dated, append-only corrections log. Every future material change to the claims above will be dated, named, and preserved below as signal about the document's credibility, rather than quietly rewritten.
Initial publication. Results table claimed channel diversity from a clflush path on Apple Silicon. The bench binary showed hedged p99.99 = 0 ns on M4 Max, which we momentarily interpreted as “TAIL ELIMINATED” instead of recognizing it as the dc civac → SLC measurement artifact it actually was.
Added the “we got it wrong twice” framing after confirming the April 7 Mac SSH run actually ended NO SIGNIFICANT IMPROVEMENT, not success. Added the Apple Silicon SLC finding as a named failure mode. First repaired pointer-chase runs recorded (200k: 6.2×, 2M: 11.1×).
Reran the repaired instrument a second day on this M4 Max (200k: 2.7×, 2M: 3.1×). Reran every broken path a second day, each still broken. Promoted the LaurieWired C++ / our Rust port attribution split from a 0.6rem footer line to a dedicated Code Availability section. Added the update block for the repaired instrument numbers. Closed LIG-443.