Four Ways to OCR a Physics Book

The Source

An image PDF and a watermark that keeps showing up sideways.

Radiation Trapping in Atomic Vapours (Molisch & Oehry, Oxford University Press, 1999) is a 443-page graduate-level monograph on imprisoned resonance radiation — the physics that makes high-density sodium vapour glow for milliseconds longer than its bare-atom radiative lifetime suggests. Body prose, dense LaTeX, a few figures, and the usual rotated Downloaded from https://academic.oup.com/... watermark stamped down the side of every page by Oxford's web platform.

pdftotext on the raw PDF returns the watermark and nothing else. The page content is rasterized image bytes — exactly the kind of thing where the 2023–2026 transition from layout-then-OCR to VLM-based document understanding made the biggest difference.

We split the PDF to 200 dpi PNGs, picked a 10-page benchmark sample covering the table of contents (p-005, p-015), chapter prose with display equations (p-050, p-100, p-150), mid-book technical content (p-200, p-250), and late-chapter / appendix material (p-300, p-350, p-400), then ran each through every engine.

The Field

Three vision-language models, one Tesseract pipeline, one same prompt.

We picked engines from the three categories that exist for OCR in 2026:

VLM, self-hosted

olmOCR-2-7B-1025-FP8 on an RTX 5090 via vLLM. AI2's October 2025 release. 82.4 on olmOCR-Bench.
olmOCR-2-7B-1025-mlx-8bit on an M4 Max via mlx-vlm. Same weights, MLX kernels.

VLM, cloud

Gemini 3.1 Pro Preview via generativelanguage.googleapis.com. Single REST call per page, inline base64 image.

Traditional + LLM-prep

book-convert by @AndySparks, recommended on X — PyMuPDF text layer first, Tesseract 5.5.2 OCR fallback, Markdown output. Pitched for Claude Projects / NotebookLM ingestion.

Same prompt across the three VLM engines: transcribe the page as Markdown, math as LaTeX, headers as ##/###, strip the OUP watermark, record the page number as an HTML comment, no commentary. Differences in output are differences in how the model chose to obey, not in what we asked.

The Numbers

Three VLM engines agree. Tesseract doesn't see math.

Aggregate across 10 sample pages, per-engine totals:

Engine	Chars	Inline eq	Display eq	WM leak	Artifacts	s / page
olmOCR-2 FP8 (Hyperion 5090)	23,532	60	15	0	0	13.0
olmOCR-2 mlx-8bit (M4 Max)	23,123	94	15	0	0	17.5
Gemini 3.1 Pro Preview	24,333	116	11	0	0	29.2
book-convert (Tesseract 5.5.2)	23,110	0	0	36 gib.	9	n/a

Character counts agree within ~5% across all four engines — the body prose is being recovered nearly identically. The differentiator is structure: equations, layout fidelity, and watermark handling.

The VLM engines diverge on inline equation count (60 to 116) because they disagree on whether n(r,t) mid-sentence is an equation or prose. That's a notation-density choice, not a quality gap. Display-equation counts agree within 11–15, which is the actual math content. The pairwise diff between any two VLM engines on a full-content page is ~24–38 lines out of ~30 — i.e., a few lines of whitespace and equation-delimiter style.

Tesseract has zero equations. It has 36 fragments of the OUP watermark OCR'd as gibberish — the rotated sideways text reads as 9zoz Aew 9z uo sasn Asesqry Agjeyseg when forced through a left-to-right recogniser. It has 9 letter-digit-letter artifacts of the n1ethod for method variety.

The Side-by-Side

Page 100: the same equation, four ways.

Page 100 of the book is the introduction to numerical methods for the Holstein equation — equations 5.57 through 5.59, a fairly standard forward-Euler discretisation. The print is clean, the equations are moderately complex, and the page has body text wrapping around them. This is what each engine returned for equation 5.57 (the discretised time derivative):

olmOCR-2 (Hyperion · 5090)

The time derivative of the
excited-state distribution
is approximated as

\[
\frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
        - n(\mathbf{r}, t)}{\Delta t}
\]

Gemini 3.1 Pro Preview

The time derivative of the
excited-state distribution
is approximated as

$$ \frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
        - n(\mathbf{r}, t)}{\Delta t} $$ (5.57)

olmOCR-2 (M4 Max · MLX 8-bit)

The time derivative of the
excited-state distribution
is approximated as

\[
\frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
        - n(\mathbf{r}, t)}{\Delta t}
\] (5.57)

book-convert / Tesseract

The time derivative of the
excited-state distribution
is approximated as

an(r,t) _ n(r,t+At)—n(r,t)
ot At (5.57)

The three VLM engines produce LaTeX. Two of them use the \[ \] display convention (olmOCR-2's natural style); one uses $$ $$ (markdown style, what we asked for in the prompt — Gemini was the only engine that obeyed). All three are immediately renderable. Tesseract returned a string that is almost the equation if you squint, but is not equation enough to be parsed, rendered, or trusted.

The Engines

Per-engine notes for the next person doing this.

olmOCR-2 FP8 on an RTX 5090 (vLLM)

Install: uv pip install "olmocr[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128. Bring poppler-utils system-wide for the canonical pipeline; or skip it and call vLLM directly over the OpenAI-compatible API.
Serve: vllm serve allenai/olmOCR-2-7B-1025-FP8 --max-model-len 16384 --gpu-memory-utilization 0.6. Footprint: ~19 GB of 32 GB on the 5090.
vLLM binds to localhost. If your client is on a different host, tunnel: ssh -L 30024:localhost:30024. Don't try to expose port 30024 directly.
13.0 s / page single-threaded, range 4–16 s. We did not test concurrency — vLLM batches well, so a parallel client would likely do 2–4× better.
Uses $...$ / \[...\] LaTeX notation. You can prompt for $...$ / $$...$$ but it tends to drift back to its training distribution.

olmOCR-2 mlx-8bit on an M4 Max (mlx-vlm)

Install: uv pip install mlx-vlm torch torchvision. torch + torchvision are CPU-only here — required because the Qwen2-VL processor has a soft dependency on a video processor that imports torchvision at load time, even though no video is involved.
Weights: mlx-community/olmOCR-2-7B-1025-mlx-8bit, 8.4 GB on disk, ~10 GB resident in unified memory during inference.
17.5 s / page average (range 10–23 s) on M4 Max. Model load 1.1 s on warm cache, ~30 s cold.
Same model as the FP8 CUDA version, so the output is essentially identical: $...$ / \[...\] notation, watermark stripped, headings preserved.
Runs on any M-series Mac with 16+ GB. The headline portability story for self-hosted OCR.

Gemini 3.1 Pro Preview (REST, free-tier Vertex)

Submit each page as inline base64 PNG to v1beta/models/gemini-3.1-pro-preview:generateContent. We ran 8 parallel curl jobs against a 443-page book and finished in ~30 minutes wall time.
3 of 443 full-book pages hit finishReason: RECITATION, which is Google's separate-from-safety copyright filter. Fallback to gemini-2.5-pro on the same page recovers all of them. The fallback was the difference between a complete book and a book with three holes.
~5,000 tokens per page average (image input + text output). At Gemini 3.1 Pro pricing of roughly $1.25 / $10 per million tokens (input / output), the 443-page run is in the $5–15 range depending on thinking-token spend. We were on free Vertex, so didn't pay.
The only engine that obeyed the prompt instruction to use $...$ / $$...$$ markdown LaTeX style, which is what NotebookLM and Claude Projects expect.

book-convert / Tesseract 5.5.2

Install: brew install tesseract poppler, then pip install -r requirements-ocr.txt in the cloned repo. Smooth path.
The tool is well-built — it auto-detects whether to use PyMuPDF text-layer extraction or fall back to OCR, has a quality scorer, supports a Marker backend for higher-quality runs. The pipeline is good. The OCR engine inside it is the bottleneck for this domain.
For a scanned monograph with equations, Tesseract's character-level model has no notion of layout, math, or rotated marginalia. It will OCR a sideways watermark as if it were Latin script and inject the result into the page.
It scored 1.0 on book-convert's own quality scorer — which is honest: the prose IS recovered. The scorer doesn't know that the equations are gone.
Right tool for: clean prose books, novels, PDFs with broken font encodings. Not for: STEM, anything with equations, anything with rotated marginalia.

The Surprise

The Hyperion FP8 model extracted fewer inline equations.

We expected the cloud model with the most capacity to do the best. It did, by inline equation count: Gemini found 116 expressions worth wrapping in $...$ ; the M4 Max MLX found 94; the Hyperion FP8 found 60. The difference isn't equations missed — it's equations recognised as prose. A reference like n(r, t) in the middle of a sentence can reasonably be left as prose or promoted to math. The three models drew the line differently.

Same model, different quantisation: the mlx-8bit version was more eager to wrap inline math (94) than the FP8 version (60). It's not clear whether that's quantisation-induced or just sampling noise across two different inference stacks. Display equations — the actual numbered formulas — agreed within ±2 between MLX and FP8.

The interesting thing about a 2026-vintage OCR benchmark is that every engine got the math right. The disagreement is on typography, not transcription.

The Recommendation

What we'd reach for next time.

One-shot OCR of one scanned book with equations: Gemini 3.1 Pro Preview via REST. Zero install, ~$5–15 for the book, no model management. The recitation filter is annoying but recoverable.

High volume, offline, or privacy-sensitive: olmOCR-2-7B-1025-FP8 on an RTX 5090 via vLLM. 13 s/page, free, local, fits in a 12 GB GPU per AI2's recommendation. Same engine on an L40S or A100 is similar; an H100 is wasted on a 7B model.

Mac-only, no cloud: olmOCR-2-7B-1025-mlx-8bit via mlx-vlm. 17.5 s/page on an M4 Max with 128 GB. The model is small enough that anyone with 16 GB+ of unified memory can run it. No GPU needed.

Tesseract / book-convert: a fine pipeline pointed at a 1980s engine. Use it for clean prose. Do not use it for anything where typography or equations matter.

The full 443-page book Gemini extraction is 1,024,254 bytes of Markdown. It renders cleanly in NotebookLM, Claude Projects, and an Obsidian vault. The equations are LaTeX. The watermark is gone. It took 30 minutes wall time and exactly zero re-OCR runs.

Reproducing

The complete bench harness, hosted.

Three artifacts live on this domain so you don't have to take our word for any of it:

Read the OCR'd book — 443 pages, KaTeX-rendered, with a "copy TeX" button on every display equation.
Download the .md — 1 MB of Markdown with LaTeX equations and per-page anchors. Feed it into NotebookLM, Claude Projects, an Obsidian vault, or your own tooling.
Bench harness — every script we used, ready to copy. Gemini REST OCR, MLX batch, vLLM client, parallel runner, the diff/aggregate script.

The harness page has the full source of each script with copy buttons. The reader page is the same Markdown you'd ingest into any LLM tool, just rendered for human reading. The raw .md is what the four-way bench produced as its canonical full-book artifact.

Each engine's OCR script accepts image.png and writes image.md + image.meta.json. bench.py reads all four engine output directories, computes per-page metrics (length, equation count by notation, watermark leaks, OCR artifacts, wall time), pairwise diff line counts, and aggregates.

The diff-counting metric is unified-diff line count — a noisy signal for total divergence, but the right shape: when two engines agree closely, the count drops below 30 lines on a full-content page; when one engine is producing garbage relative to the other, the count climbs into the 100s.

Coda

An honest note about the prior.

Before running this we expected the cloud Gemini Pro to be the clear winner — it's the largest model, the newest, the one with the trillion-token training corpus. It was the best, by inline equation density and by absolute character count. But it lost on speed (29 s/page) and on cost (real dollars if you're not on free Vertex), and the gap in output quality versus a 7B model running locally was small enough that for any high-volume use case the 7B model wins.

The 2026 OCR landscape is one where a 7B vision-language model fine-tuned on documents matches the best cloud frontier model on the task it was fine-tuned for. The interesting question is no longer which model is best but what's the right deployment. For Mac-native, MLX. For batch, vLLM on a consumer GPU. For one-shot, the cloud. For Tesseract — pick a different book.

The right move for a single scanned physics book in 2026 is a single REST call. The right move for a hundred is a 7B local model. Tesseract isn't on the map any more.

Source PDF: Radiation Trapping in Atomic Vapours, A. F. Molisch & B. P. Oehry, Oxford University Press 1999. Engines benched 2026-05-26 on Apple M4 Max 128 GB, RTX 5090 32 GB on a 13900K Linux/WSL2 host, and Google Vertex free tier. Full bench harness at /gifts/ocr-bench/harness/, OCR'd book at /gifts/ocr-bench/read/. Thanks to @andysparks for the book-convert recommendation that gave us the Tesseract baseline, and to @nbaschez for the introduction.