PDF OCR: Extract Text from Scanned Documents with Tesseract

What OCR Does

A scanned PDF is essentially a sequence of images. The text you see is pixels, not characters — you can't select, copy, or search it. OCR analyzes the image and classifies each glyph into a character, rebuilding a text layer that can be selected, copied, and indexed by search engines.

How Tesseract Works

Tesseract is the leading open-source OCR engine, originally developed by HP and maintained by Google. Its pipeline:

Binarization — convert to black/white to distinguish ink from background
Layout analysis — detect columns, paragraphs, and text lines
Character segmentation — isolate individual glyphs
Classification — match each glyph to a character using a trained LSTM model
Post-processing — dictionary lookup and language model correction

Accuracy Tips

Resolution — scan at 300 DPI minimum. Low-resolution images produce poor OCR.
Clean background — avoid shadows, creases, and stains in the scan
Font — printed serif and sans-serif fonts are most accurate. Handwriting and stylized fonts are harder.
Language selection — Tesseract is trained per language. Select the correct language for better accuracy.

Supported Languages

Tesseract supports 100+ languages including English, Spanish, French, German, Chinese (Simplified/Traditional), Japanese, Arabic, Hindi, and more. Language data files must be loaded — in-browser tools typically include the most common ones.

Output Formats

Plain text — extracted text, copy-pasteable
Searchable PDF — original scanned image with an invisible text layer added (best for archiving)
HOCR — HTML with word bounding boxes (for developers)

Run OCR on a PDF

Open ToolsVito's PDF OCR, upload your scanned PDF, select language, and extract the text — Tesseract.js runs entirely in your browser, nothing is uploaded.