Extract Text from Complex PDFs with Natural Reading Order

olmOCR is an open-source toolkit built for converting PDFs and other documents into plain text at scale. This isn’t your average OCR tool; it’s designed to handle complex documents (including tables, equations, and handwritten content) while maintaining the natural reading order.

Traditional OCR tools often disrupt reading order or mishandle non-standard elements, creating barriers for researchers, developers, and data analysts working with large document volumes. olmOCR addresses this by using a unique prompting technique to increase accuracy and reduce hallucinations. It has been meticulously trained on academic papers, technical documentation, and reference content.

Visit olmOCR

Features

High-throughput conversion of PDFs and documents to plain text
Natural reading order preservation for more intuitive text output
Support for complex elements including tables, equations, and handwriting
Specialized training on academic papers and technical documentation
Unique prompting technique to enhance accuracy and reduce hallucinations
Full toolkit deployment option on your own GPUs
Cost-effective processing at approximately $190 USD per million pages
Multi-node/cluster support for processing millions of documents
AWS S3 integration for coordinated large-scale document processing
Beaker integration for efficient PDF linearization
Comprehensive filtering by language and SEO spam removal
Fine-tuning capabilities for Qwen2-VL and Molmo-O models
Side-by-side evaluation toolkit for comparing pipeline versions
Dolma viewer for examining processed PDF documents

Use Cases

Research Data Extraction: A researcher can quickly convert a collection of scanned research papers into text for analysis.
Technical Manual Processing: An engineer can transform complex technical manuals into easily searchable text formats.
Large-Scale Document Archiving: Organizations with vast archives of scanned documents can use olmOCR to make their content searchable and accessible.
Data Analysis and Mining: Data scientists can feed the extracted text into analysis tools to uncover insights.
Content Creation from Legacy Documents: A content creator can repurpose old reports or documents by extracting and refreshing the content.

Installation

1. Ensure you have the required hardware:

Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
30GB of free disk space

2. Install necessary system dependencies (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

3. Set up a conda environment and install olmOCR:

conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

4. Install sglang with flashinfer for GPU inference:

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Basic Usage

For quick testing without local setup, you can use the web demo. To run locally with GPU support:

Convert a single PDF:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

Convert multiple PDFs:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

View results:

Check extracted text in JSON format:

 cat localworkspace/results/output_*.jsonl

View results side-by-side with original PDFs:

 python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

Then open the generated HTML file (e.g., ./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html) in your browser.

Advanced Usage

For processing millions of PDFs across multiple nodes:

1. Set up a work queue in AWS S3 on your first worker node:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

2. Add subsequent worker nodes to process from the same queue:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace

3. For Ai2 users, use Beaker for efficient processing:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4

Pros

Accuracy: Superior text extraction, even from difficult documents.
Scalability: Handles large volumes of documents efficiently.
Open Source: Full control and customization options.
Cost-Effective: Low processing cost per page.

Cons

GPU Requirement: Needs a relatively powerful NVIDIA GPU.
Technical Setup: Requires some technical knowledge to install and configure.
English Focus: Primarily trained on English documents.

Pricing

olmOCR is free to use as an open-source tool. The estimated cost of $190 per million pages refers to the infrastructure costs (GPU usage) when you deploy it yourself.

Related Resources

GitHub Repository: https://github.com/allenai/olmocr
Web Demo: https://olmocr.allenai.org/
sglang: https://github.com/sgl-project/sglang
Dolma:https://github.com/allenai/dolma
flashinfer: https://github.com/flashinfer-ai/flashinfer

FAQs

Q: What types of GPUs are supported?
A: olmOCR has been tested on NVIDIA RTX 4090, L40S, A100, and H100 GPUs.

Q: How do I view the extracted text?
A: The extracted text is stored in Dolma-style JSONL format. You can view it directly or use the dolmaviewer command for a side-by-side comparison with the original PDF.

Q: Is there a way to convert multiple PDF files at once with olmOCR?
A: Yes, you can specify a directory or use glob patterns to process multiple PDFs in a single command.

Q: How does olmOCR improve text extraction accuracy?
A: olmOCR uses a unique prompting technique to increase accuracy and decrease hallucinations. It has been specifically trained on academic papers, technical documentation, and reference content, which helps it better understand and extract text from complex documents while preserving natural reading order.

Ready to transform your document processing? Try olmOCR today! Visit the GitHub repository to get started, or test it out with the online demo. Share your experiences and feedback in the comments below!