Nemotron OCR v2: NVIDIA's Free Multilingual OCR Model

Nemotron OCR v2 is a free, open-source, ultra-fast OCR model from NVIDIA that extracts text from images of documents, signs, charts, tables, and handwritten pages.

It currently supports 6 languages (English, Chinese (Simplified and Traditional), Japanese, Korean, and Russian) and is licensed under the NVIDIA Open Model License.

This model is great for developers and teams that need OCR inside document ingestion, retrieval, RAG, or agent workflows.

It helps most when a project needs structured OCR output, multilingual support, or local deployment on NVIDIA hardware.

Try it Online

Features

Detects text regions, transcribes them, and analyzes document layout and reading order in a single end-to-end pipeline.
Comes with two variants: v2_english for English-only word-level OCR (53.8M parameters) and v2_multilingual for six-language line-level OCR (83.9M parameters).
Processes both single images and batches with automatic multi-scale resizing.
Accepts RGB images in PNG or JPEG format, with float32 or uint8 pixel values.
Returns bounding box coordinates, recognized text strings, and per-region confidence scores for every detected text region.
Supports three aggregation levels for output: word, sentence, or paragraph.
Includes a detector-only inference mode that skips recognition and uses ~37% less GPU memory.
Includes a skip-relational mode that drops reading-order analysis and uses ~35% less GPU memory.
Runs on NVIDIA Ampere, Hopper, Lovelace, and Blackwell GPU architectures.

Use Cases

Feed scanned contracts, invoices, or forms into a RAG pipeline by extracting structured text and bounding boxes from each page.
Process multilingual product packaging, signage, or restaurant menus that mix Japanese, Chinese, Korean, Russian, and English text on a single image.
Convert charts, infographics, and tables in business reports into text for downstream search indexing or data analysis.
Run batch OCR across thousands of archival document scans at 34+ pages per second on a single GPU.
Extract handwritten notes from photographed pages for digitization workflows in research or education settings.

Benchmark Results

NVIDIA published benchmark numbers on two standard OCR evaluation datasets. All scores use Normalized Edit Distance (NED), where lower numbers mean better accuracy. Speed was measured on a single A100 GPU.

OmniDocBench Results

This benchmark covers English, Chinese, and mixed-language documents across different backgrounds and text orientations.

Model	Pages/s	EN	ZH	Mixed	Normal	Rotate90	Rotate270
Nemotron OCR v2 (EN)	40.7	0.038	0.830	0.437	0.353	0.232	0.827
PaddleOCR v5 (server)	1.2	0.027	0.037	0.041	0.031	0.116	0.897
OpenOCR (server)	1.5	0.024	0.033	0.049	0.028	0.042	0.761
EasyOCR	0.4	0.095	0.117	0.326	0.110	0.987	0.979
Nemotron OCR v1	39.3	0.038	0.876	0.436	0.482	0.358	0.871

Nemotron OCR v2 (multilingual) is roughly 29x faster than PaddleOCR and 87x faster than EasyOCR. PaddleOCR and OpenOCR still post lower NED scores on English and Chinese text in normal orientation, so they remain more accurate per-character in those specific categories.

Nemotron v2’s advantage shows up in speed and in handling rotated text at 90° and 270°, where it outperforms most competitors by a wide margin.

SynthDoG Results

This synthetic benchmark tests per-language accuracy across six languages.

Language	Nemotron OCR v2 (multilingual)	PaddleOCR (specialized)	OpenOCR (server)	Nemotron OCR v1
English	0.069	0.096	0.105	0.078
Japanese	0.046	0.201	0.586	0.723
Korean	0.047	0.133	0.837	0.923
Russian	0.043	0.163	0.950	0.564
Chinese (Simplified)	0.035	0.054	0.061	0.784
Chinese (Traditional)	0.065	0.094	0.127	0.700

The multilingual variant dominates across every language in this benchmark. Japanese, Korean, and Russian scores are dramatically better than any competitor. The v2 multilingual model scores 0.046 on Japanese compared to PaddleOCR’s 0.201 and OpenOCR’s 0.586.

How to Use Nemotron OCR v2

Table Of Contents

Try the Free Demo
System Requirements
Installation via pip
Installation via Docker
Running Inference
Constructor Parameters
Inference Modes
Model Architecture Reference
Input and Output Specification
Training Data

Try the Free Demo

The Hugging Face Spaces demo at huggingface.co/spaces/nvidia/nemotron-ocr-v2 lets you upload an image and get OCR results without any local setup.

System Requirements

Requirement	Details
Operating System	Linux amd64
GPU	NVIDIA GPU (Ampere, Hopper, Lovelace, or Blackwell)
CUDA Toolkit	Must match your PyTorch CUDA version (same major version)
Python	3.12 (requires `>=3.12,<3.13`)
Build Tools	GCC/G++ with C++17 support, CUDA headers, OpenMP
Runtime Engine	PyTorch

Supported GPU hardware: H100 PCIe/SXM, A100 PCIe/SXM, L40S, L4, A10G, H200 NVL, B200, and RTX PRO 6000 Blackwell Server Edition.

Installation via pip

Install git-lfs first, then clone the repository:

git lfs install
git clone https://huggingface.co/nvidia/nemotron-ocr-v2

Create a Python 3.12 environment and install PyTorch for your CUDA version:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

Install the Nemotron OCR package. The --no-build-isolation flag is required so the C++ CUDA extension compiles against your existing PyTorch:

cd nemotron-ocr
pip install --no-build-isolation -v .

Verify the installation:

python -c "from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2; print('OK')"

Installation via Docker

Confirm Docker can access your GPU:

docker run --rm --gpus all nvcr.io/nvidia/pytorch:25.09-py3 nvidia-smi

Run the example directly from the repo root:

docker compose run --rm nemotron-ocr \
  bash -lc "python example.py ocr-example-input-1.png --merge-level paragraph"

This builds a container from the included Dockerfile (based on nvcr.io/nvidia/pytorch), mounts the repo at /workspace, and runs example.py. The multilingual model downloads from Hugging Face on first run. Output saves as <name>-annotated.<ext> alongside your input image.

Running Inference

The main entry point is NemotronOCRV2 from nemotron_ocr.inference.pipeline_v2. The default behavior downloads and loads the multilingual variant.

from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2
# Default: multilingual v2
ocr = NemotronOCRV2()
predictions = ocr("ocr-example-input-1.png")
for pred in predictions:
    print(
        f"  - Text: '{pred['text']}', "
        f"Confidence: {pred['confidence']:.2f}, "
        f"Bbox: [left={pred['left']:.4f}, upper={pred['upper']:.4f}, "
        f"right={pred['right']:.4f}, lower={pred['lower']:.4f}]"
    )

Constructor Parameters

Parameter	Values	Effect
`lang=None` (default)	`None`, `"multi"`, `"multilingual"`	Loads v2 multilingual from Hugging Face Hub
`lang="en"`	`"en"`, `"english"`	Loads v2 English (word-level) from Hub
`lang="v1"`	`"v1"`, `"legacy"`	Loads v1 English-only model from `nvidia/nemotron-ocr-v1` for backward compatibility
`model_dir="./path"`	Local directory path	Loads from a local checkpoint folder containing `detector.pth`, `recognizer.pth`, `relational.pth`, and `charset.txt`. Overrides `lang` when the folder is complete
`detector_only=True`	Boolean	Runs the detector only. Returns bounding boxes with no text recognition. Uses ~37% less GPU memory and runs ~20% faster
`skip_relational=True`	Boolean	Skips the relational model. Returns per-word text with no reading-order grouping. Uses ~35% less GPU memory and runs ~8% faster
`verbose_post=True`	Boolean	Enables per-phase CUDA-synced timing in the logs (profiling mode). Requires `logging.basicConfig(level=logging.INFO)`

The model_dir parameter takes priority over lang. If you pass model_dir but the checkpoint folder is incomplete, loading falls back to Hub resolution using lang (which defaults to multilingual when set to None).

Inference Modes

Full pipeline (default): Detects text, recognizes it, and groups results by reading order. Each prediction returns text, confidence, left, right, upper, and lower.

Detector only (detector_only=True): Returns bounding boxes without running recognition. Each prediction returns confidence, left, right, upper, lower, and quad.

ocr_det = NemotronOCRV2(detector_only=True)
boxes = ocr_det("page.png")

Skip relational (skip_relational=True): Returns per-word text without grouping it into reading order. Call with merge_level="word" for word-level output.

ocr_fast = NemotronOCRV2(skip_relational=True)
words = ocr_fast("page.png", merge_level="word")

Profiling mode (verbose_post=True): Logs per-phase CUDA-synced timing.

import logging
logging.basicConfig(level=logging.INFO)
ocr_profile = NemotronOCRV2(verbose_post=True)

Model Architecture Reference

Both variants use a three-component architecture trained end-to-end:

Text Detector: A RegNetX-8GF convolutional backbone that localizes text regions in the image.
Text Recognizer: A pre-norm Transformer-based sequence model that transcribes detected regions.
Relational Model: A multi-layer global relational module that predicts reading order, logical groupings, and layout relationships across detected text elements.

Recognizer Spec Comparison

Spec	v2_english	v2_multilingual
Transformer layers	3	6
Hidden dimension (`d_model`)	256	512
FFN width (`dim_feedforward`)	1024	2048
Attention heads	8	8
Max sequence length	32	128
Character set size	855	14,244

Total Parameter Counts

Component	v2_english	v2_multilingual
Detector	45,445,259	45,445,259
Recognizer	6,130,657	36,119,598
Relational model	2,255,419	2,288,187
Total	53,831,335	83,853,044

Input and Output Specification

Input:

Property	Value
Format	RGB image (PNG or JPEG), float32 or uint8
Dimensions	3 × H × W (single) or B × 3 × H × W (batch)
Pixel range	[0, 1] for float32 or [0, 255] for uint8 (auto-converted)
Aggregation levels	word, sentence, or paragraph

Output:

Property	Value
Bounding boxes	1D list of coordinate tuples (floats)
Recognized text	1D list of strings
Confidence scores	1D list of floats

Training Data

The model was trained on approximately 12 million images: roughly 680,000 real-world images (scene text, charts, tables, handwritten pages, multilingual documents) and over 11 million synthetic rendered pages across six languages. Synthetic data includes historical document crops with degradation effects for archaic character support.

Pros

20x+ faster than PaddleOCR and OpenOCR on the same hardware.
The multilingual variant handles six languages with a single model load.
Detector-only and skip-relational modes let you trade features for speed and memory savings on constrained hardware.
The relational model preserves reading order and document structure.

Cons

Runs only on NVIDIA GPUs with CUDA.
Linux-only. No Windows or macOS support.
Requires Python, CUDA, and local build tooling.

Related Resources

Hugging Face Model Page: Download model weights, read the full model card, and access both v2_english and v2_multilingual checkpoints.
NVIDIA Open Model License Agreement: Check commercial-use terms and redistribution rules before production rollout.
NVIDIA Build Platform: Access the model via NVIDIA’s hosted API endpoint.
OmniDocBench: Review the benchmark dataset used to evaluate Nemotron OCR v2 against other models.
SynthDoG: Explore the synthetic document generator used for the multilingual benchmark evaluation.
PyTorch Installation Guide: Match your PyTorch install to your CUDA toolkit version before installing Nemotron OCR.

Nemotron OCR v2: NVIDIA’s Free Multilingual OCR Model

Features

Use Cases

Benchmark Results

OmniDocBench Results

SynthDoG Results

How to Use Nemotron OCR v2

Try the Free Demo

System Requirements

Installation via pip

Installation via Docker

Running Inference

Constructor Parameters

Inference Modes

Model Architecture Reference

Recognizer Spec Comparison

Total Parameter Counts

Input and Output Specification

Training Data

Pros

Cons

Related Resources

Leave a ReplyCancel Reply

Osaurus: Free Local AI Agents for Mac

NullClaw: Free, Lightweight OpenClaw Alternative in Zig

Apple Intelligence & Siri AI Timeline

Moltis: Personal AI Agent for Secure Local Automation (OpenClaw Alternative)

Monogram AI for iPhone: AI Answers With a Generated Visual UI

Get the latest & top AI tools sent directly to your email.

Features

Use Cases

Benchmark Results

OmniDocBench Results

SynthDoG Results

How to Use Nemotron OCR v2

Try the Free Demo

System Requirements

Installation via pip

Installation via Docker

Running Inference

Constructor Parameters

Inference Modes

Model Architecture Reference

Recognizer Spec Comparison

Total Parameter Counts

Input and Output Specification

Training Data

Pros

Cons

Related Resources

More Like This

Leave a ReplyCancel Reply

Trending now

Get the latest & top AI tools sent directly to your email.