Research · May 26, 2026 · OCR Engine Benchmark

Four ways to OCR a physics book.

We had a 443-page Oxford monograph on radiation trapping in atomic vapours, scanned to PDF, and an honest question: which 2026-vintage OCR engine should read it? We ran the same 10-page sample through four engines on three machines and one cloud, then ran the winner across the full book. The results were less close than the marketing.

olmOCR-2 FP8 · RTX 5090
13.0s
Per page via vLLM on Hyperion. Equations as LaTeX. Watermark stripped.
olmOCR-2 mlx-8bit · M4 Max
17.5s
Per page via mlx-vlm. Same model, half the speed. Free + portable.
Gemini 3.1 Pro Preview · cloud
29.2s
Per page via REST. Best out-of-box. ~$5–15 per book.
book-convert / Tesseract · local
Zero equations recovered. Watermark gibberish on every page. Don't use for STEM.

An image PDF and a watermark that keeps showing up sideways.

Radiation Trapping in Atomic Vapours (Molisch & Oehry, Oxford University Press, 1999) is a 443-page graduate-level monograph on imprisoned resonance radiation — the physics that makes high-density sodium vapour glow for milliseconds longer than its bare-atom radiative lifetime suggests. Body prose, dense LaTeX, a few figures, and the usual rotated Downloaded from https://academic.oup.com/... watermark stamped down the side of every page by Oxford's web platform.

pdftotext on the raw PDF returns the watermark and nothing else. The page content is rasterized image bytes — exactly the kind of thing where the 2023–2026 transition from layout-then-OCR to VLM-based document understanding made the biggest difference.

We split the PDF to 200 dpi PNGs, picked a 10-page benchmark sample covering the table of contents (p-005, p-015), chapter prose with display equations (p-050, p-100, p-150), mid-book technical content (p-200, p-250), and late-chapter / appendix material (p-300, p-350, p-400), then ran each through every engine.

Three vision-language models, one Tesseract pipeline, one same prompt.

We picked engines from the three categories that exist for OCR in 2026:

VLM, self-hosted

  • olmOCR-2-7B-1025-FP8 on an RTX 5090 via vLLM. AI2's October 2025 release. 82.4 on olmOCR-Bench.
  • olmOCR-2-7B-1025-mlx-8bit on an M4 Max via mlx-vlm. Same weights, MLX kernels.

VLM, cloud

  • Gemini 3.1 Pro Preview via generativelanguage.googleapis.com. Single REST call per page, inline base64 image.

Traditional + LLM-prep

  • book-convert by @AndySparks, recommended on X — PyMuPDF text layer first, Tesseract 5.5.2 OCR fallback, Markdown output. Pitched for Claude Projects / NotebookLM ingestion.

Same prompt across the three VLM engines: transcribe the page as Markdown, math as LaTeX, headers as ##/###, strip the OUP watermark, record the page number as an HTML comment, no commentary. Differences in output are differences in how the model chose to obey, not in what we asked.

Three VLM engines agree. Tesseract doesn't see math.

Aggregate across 10 sample pages, per-engine totals:

Engine Chars Inline eq Display eq WM leak Artifacts s / page
olmOCR-2 FP8 (Hyperion 5090) 23,532 60 15 0 0 13.0
olmOCR-2 mlx-8bit (M4 Max) 23,123 94 15 0 0 17.5
Gemini 3.1 Pro Preview 24,333 116 11 0 0 29.2
book-convert (Tesseract 5.5.2) 23,110 0 0 36 gib. 9 n/a

Character counts agree within ~5% across all four engines — the body prose is being recovered nearly identically. The differentiator is structure: equations, layout fidelity, and watermark handling.

The VLM engines diverge on inline equation count (60 to 116) because they disagree on whether n(r,t) mid-sentence is an equation or prose. That's a notation-density choice, not a quality gap. Display-equation counts agree within 11–15, which is the actual math content. The pairwise diff between any two VLM engines on a full-content page is ~24–38 lines out of ~30 — i.e., a few lines of whitespace and equation-delimiter style.

Tesseract has zero equations. It has 36 fragments of the OUP watermark OCR'd as gibberish — the rotated sideways text reads as 9zoz Aew 9z uo sasn Asesqry Agjeyseg when forced through a left-to-right recogniser. It has 9 letter-digit-letter artifacts of the n1ethod for method variety.

Page 100: the same equation, four ways.

Page 100 of the book is the introduction to numerical methods for the Holstein equation — equations 5.57 through 5.59, a fairly standard forward-Euler discretisation. The print is clean, the equations are moderately complex, and the page has body text wrapping around them. This is what each engine returned for equation 5.57 (the discretised time derivative):

olmOCR-2 (Hyperion · 5090)

The time derivative of the
excited-state distribution
is approximated as

\[
\frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
        - n(\mathbf{r}, t)}{\Delta t}
\]

Gemini 3.1 Pro Preview

The time derivative of the
excited-state distribution
is approximated as

$$ \frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
        - n(\mathbf{r}, t)}{\Delta t} $$ (5.57)

olmOCR-2 (M4 Max · MLX 8-bit)

The time derivative of the
excited-state distribution
is approximated as

\[
\frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
        - n(\mathbf{r}, t)}{\Delta t}
\] (5.57)

book-convert / Tesseract

The time derivative of the
excited-state distribution
is approximated as

an(r,t) _ n(r,t+At)—n(r,t)
ot At (5.57)

The three VLM engines produce LaTeX. Two of them use the \[ \] display convention (olmOCR-2's natural style); one uses $$ $$ (markdown style, what we asked for in the prompt — Gemini was the only engine that obeyed). All three are immediately renderable. Tesseract returned a string that is almost the equation if you squint, but is not equation enough to be parsed, rendered, or trusted.

Per-engine notes for the next person doing this.

olmOCR-2 FP8 on an RTX 5090 (vLLM)

olmOCR-2 mlx-8bit on an M4 Max (mlx-vlm)

Gemini 3.1 Pro Preview (REST, free-tier Vertex)

book-convert / Tesseract 5.5.2

The Hyperion FP8 model extracted fewer inline equations.

We expected the cloud model with the most capacity to do the best. It did, by inline equation count: Gemini found 116 expressions worth wrapping in $...$; the M4 Max MLX found 94; the Hyperion FP8 found 60. The difference isn't equations missed — it's equations recognised as prose. A reference like n(r, t) in the middle of a sentence can reasonably be left as prose or promoted to math. The three models drew the line differently.

Same model, different quantisation: the mlx-8bit version was more eager to wrap inline math (94) than the FP8 version (60). It's not clear whether that's quantisation-induced or just sampling noise across two different inference stacks. Display equations — the actual numbered formulas — agreed within ±2 between MLX and FP8.

The interesting thing about a 2026-vintage OCR benchmark is that every engine got the math right. The disagreement is on typography, not transcription.

What we'd reach for next time.

One-shot OCR of one scanned book with equations: Gemini 3.1 Pro Preview via REST. Zero install, ~$5–15 for the book, no model management. The recitation filter is annoying but recoverable.

High volume, offline, or privacy-sensitive: olmOCR-2-7B-1025-FP8 on an RTX 5090 via vLLM. 13 s/page, free, local, fits in a 12 GB GPU per AI2's recommendation. Same engine on an L40S or A100 is similar; an H100 is wasted on a 7B model.

Mac-only, no cloud: olmOCR-2-7B-1025-mlx-8bit via mlx-vlm. 17.5 s/page on an M4 Max with 128 GB. The model is small enough that anyone with 16 GB+ of unified memory can run it. No GPU needed.

Tesseract / book-convert: a fine pipeline pointed at a 1980s engine. Use it for clean prose. Do not use it for anything where typography or equations matter.

The full 443-page book Gemini extraction is 1,024,254 bytes of Markdown. It renders cleanly in NotebookLM, Claude Projects, and an Obsidian vault. The equations are LaTeX. The watermark is gone. It took 30 minutes wall time and exactly zero re-OCR runs.

The complete bench harness, hosted.

Three artifacts live on this domain so you don't have to take our word for any of it:

The harness page has the full source of each script with copy buttons. The reader page is the same Markdown you'd ingest into any LLM tool, just rendered for human reading. The raw .md is what the four-way bench produced as its canonical full-book artifact.

Each engine's OCR script accepts image.png and writes image.md + image.meta.json. bench.py reads all four engine output directories, computes per-page metrics (length, equation count by notation, watermark leaks, OCR artifacts, wall time), pairwise diff line counts, and aggregates.

The diff-counting metric is unified-diff line count — a noisy signal for total divergence, but the right shape: when two engines agree closely, the count drops below 30 lines on a full-content page; when one engine is producing garbage relative to the other, the count climbs into the 100s.

An honest note about the prior.

Before running this we expected the cloud Gemini Pro to be the clear winner — it's the largest model, the newest, the one with the trillion-token training corpus. It was the best, by inline equation density and by absolute character count. But it lost on speed (29 s/page) and on cost (real dollars if you're not on free Vertex), and the gap in output quality versus a 7B model running locally was small enough that for any high-volume use case the 7B model wins.

The 2026 OCR landscape is one where a 7B vision-language model fine-tuned on documents matches the best cloud frontier model on the task it was fine-tuned for. The interesting question is no longer which model is best but what's the right deployment. For Mac-native, MLX. For batch, vLLM on a consumer GPU. For one-shot, the cloud. For Tesseract — pick a different book.

The right move for a single scanned physics book in 2026 is a single REST call. The right move for a hundred is a 7B local model. Tesseract isn't on the map any more.

Source PDF: Radiation Trapping in Atomic Vapours, A. F. Molisch & B. P. Oehry, Oxford University Press 1999. Engines benched 2026-05-26 on Apple M4 Max 128 GB, RTX 5090 32 GB on a 13900K Linux/WSL2 host, and Google Vertex free tier. Full bench harness at /gifts/ocr-bench/harness/, OCR'd book at /gifts/ocr-bench/read/. Thanks to @andysparks for the book-convert recommendation that gave us the Tesseract baseline, and to @nbaschez for the introduction.