What OCR Does
A scanned PDF is essentially a sequence of images. The text you see is pixels, not characters — you can't select, copy, or search it. OCR analyzes the image and classifies each glyph into a character, rebuilding a text layer that can be selected, copied, and indexed by search engines.
How Tesseract Works
Tesseract is the leading open-source OCR engine, originally developed by HP and maintained by Google. Its pipeline:
- Binarization — convert to black/white to distinguish ink from background
- Layout analysis — detect columns, paragraphs, and text lines
- Character segmentation — isolate individual glyphs
- Classification — match each glyph to a character using a trained LSTM model
- Post-processing — dictionary lookup and language model correction
Accuracy Tips
- Resolution — scan at 300 DPI minimum. Low-resolution images produce poor OCR.
- Clean background — avoid shadows, creases, and stains in the scan
- Font — printed serif and sans-serif fonts are most accurate. Handwriting and stylized fonts are harder.
- Language selection — Tesseract is trained per language. Select the correct language for better accuracy.
Supported Languages
Tesseract supports 100+ languages including English, Spanish, French, German, Chinese (Simplified/Traditional), Japanese, Arabic, Hindi, and more. Language data files must be loaded — in-browser tools typically include the most common ones.
Output Formats
- Plain text — extracted text, copy-pasteable
- Searchable PDF — original scanned image with an invisible text layer added (best for archiving)
- HOCR — HTML with word bounding boxes (for developers)
Run OCR on a PDF
Open ToolsVito's PDF OCR, upload your scanned PDF, select language, and extract the text — Tesseract.js runs entirely in your browser, nothing is uploaded.