VLM, self-hosted
- olmOCR-2-7B-1025-FP8 on an RTX 5090 via vLLM. AI2's October 2025 release. 82.4 on olmOCR-Bench.
- olmOCR-2-7B-1025-mlx-8bit on an M4 Max via mlx-vlm. Same weights, MLX kernels.
We had a 443-page Oxford monograph on radiation trapping in atomic vapours, scanned to PDF, and an honest question: which 2026-vintage OCR engine should read it? We ran the same 10-page sample through four engines on three machines and one cloud, then ran the winner across the full book. The results were less close than the marketing.
Radiation Trapping in Atomic Vapours (Molisch & Oehry,
Oxford University Press, 1999) is a 443-page graduate-level monograph
on imprisoned resonance radiation — the physics that makes high-density
sodium vapour glow for milliseconds longer than its bare-atom radiative
lifetime suggests. Body prose, dense LaTeX, a few figures, and the
usual rotated Downloaded from
https://academic.oup.com/... watermark stamped down the side of
every page by Oxford's web platform.
pdftotext on the raw PDF returns the watermark and nothing
else. The page content is rasterized image bytes — exactly the kind of
thing where the 2023–2026 transition from layout-then-OCR to VLM-based
document understanding made the biggest difference.
We split the PDF to 200 dpi PNGs, picked a 10-page benchmark
sample covering the table of contents (p-005, p-015),
chapter prose with display equations (p-050, p-100, p-150),
mid-book technical content (p-200, p-250), and
late-chapter / appendix material (p-300, p-350, p-400),
then ran each through every engine.
We picked engines from the three categories that exist for OCR in 2026:
generativelanguage.googleapis.com. Single REST call per page, inline base64 image.
Same prompt across the three VLM engines: transcribe the page as
Markdown, math as LaTeX, headers as ##/###,
strip the OUP watermark, record the page number as an HTML comment,
no commentary. Differences in output are differences in how the model
chose to obey, not in what we asked.
Aggregate across 10 sample pages, per-engine totals:
| Engine | Chars | Inline eq | Display eq | WM leak | Artifacts | s / page |
|---|---|---|---|---|---|---|
| olmOCR-2 FP8 (Hyperion 5090) | 23,532 | 60 | 15 | 0 | 0 | 13.0 |
| olmOCR-2 mlx-8bit (M4 Max) | 23,123 | 94 | 15 | 0 | 0 | 17.5 |
| Gemini 3.1 Pro Preview | 24,333 | 116 | 11 | 0 | 0 | 29.2 |
| book-convert (Tesseract 5.5.2) | 23,110 | 0 | 0 | 36 gib. | 9 | n/a |
Character counts agree within ~5% across all four engines — the body prose is being recovered nearly identically. The differentiator is structure: equations, layout fidelity, and watermark handling.
The VLM engines diverge on inline equation count (60 to 116) because
they disagree on whether n(r,t) mid-sentence is an
equation or prose. That's a notation-density choice, not a quality
gap. Display-equation counts agree within 11–15, which is the actual
math content. The pairwise diff between any two VLM engines on a
full-content page is ~24–38 lines out of ~30 — i.e., a few lines of
whitespace and equation-delimiter style.
Tesseract has zero equations. It has 36 fragments of the
OUP watermark OCR'd as gibberish — the rotated sideways text reads
as 9zoz Aew 9z uo sasn Asesqry Agjeyseg when forced
through a left-to-right recogniser. It has 9 letter-digit-letter
artifacts of the n1ethod for method
variety.
Page 100 of the book is the introduction to numerical methods for the Holstein equation — equations 5.57 through 5.59, a fairly standard forward-Euler discretisation. The print is clean, the equations are moderately complex, and the page has body text wrapping around them. This is what each engine returned for equation 5.57 (the discretised time derivative):
The time derivative of the
excited-state distribution
is approximated as
\[
\frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
- n(\mathbf{r}, t)}{\Delta t}
\]
The time derivative of the
excited-state distribution
is approximated as
$$ \frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
- n(\mathbf{r}, t)}{\Delta t} $$ (5.57)
The time derivative of the
excited-state distribution
is approximated as
\[
\frac{\partial n(\mathbf{r}, t)}{\partial t}
\approx \frac{n(\mathbf{r}, t + \Delta t)
- n(\mathbf{r}, t)}{\Delta t}
\] (5.57)
The time derivative of the
excited-state distribution
is approximated as
an(r,t) _ n(r,t+At)—n(r,t)
ot At (5.57)
The three VLM engines produce LaTeX. Two of them use the
\[ \] display convention (olmOCR-2's natural style); one
uses $$ $$ (markdown style, what we asked for in the
prompt — Gemini was the only engine that obeyed). All three are
immediately renderable. Tesseract returned a string that is almost
the equation if you squint, but is not equation enough to be parsed,
rendered, or trusted.
uv pip install "olmocr[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128. Bring poppler-utils system-wide for the canonical pipeline; or skip it and call vLLM directly over the OpenAI-compatible API.vllm serve allenai/olmOCR-2-7B-1025-FP8 --max-model-len 16384 --gpu-memory-utilization 0.6. Footprint: ~19 GB of 32 GB on the 5090.ssh -L 30024:localhost:30024. Don't try to expose port 30024 directly.\(...\) / \[...\] LaTeX notation. You can prompt for $...$ / $$...$$ but it tends to drift back to its training distribution.uv pip install mlx-vlm torch torchvision. torch + torchvision are CPU-only here — required because the Qwen2-VL processor has a soft dependency on a video processor that imports torchvision at load time, even though no video is involved.mlx-community/olmOCR-2-7B-1025-mlx-8bit, 8.4 GB on disk, ~10 GB resident in unified memory during inference.\(...\) / \[...\] notation, watermark stripped, headings preserved.v1beta/models/gemini-3.1-pro-preview:generateContent. We ran 8 parallel curl jobs against a 443-page book and finished in ~30 minutes wall time.finishReason: RECITATION, which is Google's separate-from-safety copyright filter. Fallback to gemini-2.5-pro on the same page recovers all of them. The fallback was the difference between a complete book and a book with three holes.$...$ / $$...$$ markdown LaTeX style, which is what NotebookLM and Claude Projects expect.brew install tesseract poppler, then pip install -r requirements-ocr.txt in the cloned repo. Smooth path.
We expected the cloud model with the most capacity to do the best.
It did, by inline equation count: Gemini found 116 expressions
worth wrapping in $...$; the M4 Max MLX found 94; the
Hyperion FP8 found 60. The difference isn't equations missed — it's
equations recognised as prose. A reference like n(r, t)
in the middle of a sentence can reasonably be left as prose or
promoted to math. The three models drew the line differently.
Same model, different quantisation: the mlx-8bit version was more eager to wrap inline math (94) than the FP8 version (60). It's not clear whether that's quantisation-induced or just sampling noise across two different inference stacks. Display equations — the actual numbered formulas — agreed within ±2 between MLX and FP8.
The interesting thing about a 2026-vintage OCR benchmark is that every engine got the math right. The disagreement is on typography, not transcription.
One-shot OCR of one scanned book with equations: Gemini 3.1 Pro Preview via REST. Zero install, ~$5–15 for the book, no model management. The recitation filter is annoying but recoverable.
High volume, offline, or privacy-sensitive: olmOCR-2-7B-1025-FP8 on an RTX 5090 via vLLM. 13 s/page, free, local, fits in a 12 GB GPU per AI2's recommendation. Same engine on an L40S or A100 is similar; an H100 is wasted on a 7B model.
Mac-only, no cloud: olmOCR-2-7B-1025-mlx-8bit via mlx-vlm. 17.5 s/page on an M4 Max with 128 GB. The model is small enough that anyone with 16 GB+ of unified memory can run it. No GPU needed.
Tesseract / book-convert: a fine pipeline pointed at a 1980s engine. Use it for clean prose. Do not use it for anything where typography or equations matter.
The full 443-page book Gemini extraction is 1,024,254 bytes of Markdown. It renders cleanly in NotebookLM, Claude Projects, and an Obsidian vault. The equations are LaTeX. The watermark is gone. It took 30 minutes wall time and exactly zero re-OCR runs.
Three artifacts live on this domain so you don't have to take our word for any of it:
The harness page has the full source of each script with copy
buttons. The reader page is the same Markdown you'd ingest into
any LLM tool, just rendered for human reading. The raw .md
is what the four-way bench produced as its canonical full-book
artifact.
Each engine's OCR script accepts image.png and writes
image.md + image.meta.json. bench.py
reads all four engine output directories, computes per-page metrics
(length, equation count by notation, watermark leaks, OCR
artifacts, wall time), pairwise diff line counts, and aggregates.
The diff-counting metric is unified-diff line count — a noisy signal for total divergence, but the right shape: when two engines agree closely, the count drops below 30 lines on a full-content page; when one engine is producing garbage relative to the other, the count climbs into the 100s.
Before running this we expected the cloud Gemini Pro to be the clear winner — it's the largest model, the newest, the one with the trillion-token training corpus. It was the best, by inline equation density and by absolute character count. But it lost on speed (29 s/page) and on cost (real dollars if you're not on free Vertex), and the gap in output quality versus a 7B model running locally was small enough that for any high-volume use case the 7B model wins.
The 2026 OCR landscape is one where a 7B vision-language model fine-tuned on documents matches the best cloud frontier model on the task it was fine-tuned for. The interesting question is no longer which model is best but what's the right deployment. For Mac-native, MLX. For batch, vLLM on a consumer GPU. For one-shot, the cloud. For Tesseract — pick a different book.
The right move for a single scanned physics book in 2026 is a single REST call. The right move for a hundred is a 7B local model. Tesseract isn't on the map any more.