olmOCR is an open-source toolkit built for converting PDFs and other documents into plain text at scale. This isn’t your average OCR tool; it’s designed to handle complex documents (including tables, equations, and handwritten content) while maintaining the natural reading order.
Traditional OCR tools often disrupt reading order or mishandle non-standard elements, creating barriers for researchers, developers, and data analysts working with large document volumes. olmOCR addresses this by using a unique prompting technique to increase accuracy and reduce hallucinations. It has been meticulously trained on academic papers, technical documentation, and reference content.
Features
- High-throughput conversion of PDFs and documents to plain text
- Natural reading order preservation for more intuitive text output
- Support for complex elements including tables, equations, and handwriting
- Specialized training on academic papers and technical documentation
- Unique prompting technique to enhance accuracy and reduce hallucinations
- Full toolkit deployment option on your own GPUs
- Cost-effective processing at approximately $190 USD per million pages
- Multi-node/cluster support for processing millions of documents
- AWS S3 integration for coordinated large-scale document processing
- Beaker integration for efficient PDF linearization
- Comprehensive filtering by language and SEO spam removal
- Fine-tuning capabilities for Qwen2-VL and Molmo-O models
- Side-by-side evaluation toolkit for comparing pipeline versions
- Dolma viewer for examining processed PDF documents
Use Cases
- Research Data Extraction: A researcher can quickly convert a collection of scanned research papers into text for analysis.
- Technical Manual Processing: An engineer can transform complex technical manuals into easily searchable text formats.
- Large-Scale Document Archiving: Organizations with vast archives of scanned documents can use olmOCR to make their content searchable and accessible.
- Data Analysis and Mining: Data scientists can feed the extracted text into analysis tools to uncover insights.
- Content Creation from Legacy Documents: A content creator can repurpose old reports or documents by extracting and refreshing the content.
Installation
1. Ensure you have the required hardware:
- Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
- 30GB of free disk space
2. Install necessary system dependencies (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools3. Set up a conda environment and install olmOCR:
conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .4. Install sglang with flashinfer for GPU inference:
pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/Basic Usage
For quick testing without local setup, you can use the web demo. To run locally with GPU support:
- Convert a single PDF:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf- Convert multiple PDFs:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf- View results:
- Check extracted text in JSON format:
cat localworkspace/results/output_*.jsonl- View results side-by-side with original PDFs:
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonlThen open the generated HTML file (e.g., ./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html) in your browser.
Advanced Usage
For processing millions of PDFs across multiple nodes:
1. Set up a work queue in AWS S3 on your first worker node:
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf2. Add subsequent worker nodes to process from the same queue:
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace3. For Ai2 users, use Beaker for efficient processing:
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4Pros
- Accuracy: Superior text extraction, even from difficult documents.
- Scalability: Handles large volumes of documents efficiently.
- Open Source: Full control and customization options.
- Cost-Effective: Low processing cost per page.
Cons
- GPU Requirement: Needs a relatively powerful NVIDIA GPU.
- Technical Setup: Requires some technical knowledge to install and configure.
- English Focus: Primarily trained on English documents.
Pricing
olmOCR is free to use as an open-source tool. The estimated cost of $190 per million pages refers to the infrastructure costs (GPU usage) when you deploy it yourself.
Related Resources
- GitHub Repository: https://github.com/allenai/olmocr
- Web Demo: https://olmocr.allenai.org/
- sglang: https://github.com/sgl-project/sglang
- Dolma:https://github.com/allenai/dolma
- flashinfer: https://github.com/flashinfer-ai/flashinfer
FAQs
Q: What types of GPUs are supported?
A: olmOCR has been tested on NVIDIA RTX 4090, L40S, A100, and H100 GPUs.
Q: How do I view the extracted text?
A: The extracted text is stored in Dolma-style JSONL format. You can view it directly or use the dolmaviewer command for a side-by-side comparison with the original PDF.
Q: Is there a way to convert multiple PDF files at once with olmOCR?
A: Yes, you can specify a directory or use glob patterns to process multiple PDFs in a single command.
Q: How does olmOCR improve text extraction accuracy?
A: olmOCR uses a unique prompting technique to increase accuracy and decrease hallucinations. It has been specifically trained on academic papers, technical documentation, and reference content, which helps it better understand and extract text from complex documents while preserving natural reading order.
Ready to transform your document processing? Try olmOCR today! Visit the GitHub repository to get started, or test it out with the online demo. Share your experiences and feedback in the comments below!









