Skip to content
PDF 6 min read

PDF OCR: Extract Text from Scanned Documents with Tesseract

Recognize and extract text from scanned PDFs and images using OCR. Learn how Tesseract works, accuracy tips, supported languages, and in-browser OCR without uploads.

ToolsVito Team

What OCR Does

A scanned PDF is essentially a sequence of images. The text you see is pixels, not characters — you can't select, copy, or search it. OCR analyzes the image and classifies each glyph into a character, rebuilding a text layer that can be selected, copied, and indexed by search engines.

How Tesseract Works

Tesseract is the leading open-source OCR engine, originally developed by HP and maintained by Google. Its pipeline:

  1. Binarization — convert to black/white to distinguish ink from background
  2. Layout analysis — detect columns, paragraphs, and text lines
  3. Character segmentation — isolate individual glyphs
  4. Classification — match each glyph to a character using a trained LSTM model
  5. Post-processing — dictionary lookup and language model correction

Accuracy Tips

  • Resolution — scan at 300 DPI minimum. Low-resolution images produce poor OCR.
  • Clean background — avoid shadows, creases, and stains in the scan
  • Font — printed serif and sans-serif fonts are most accurate. Handwriting and stylized fonts are harder.
  • Language selection — Tesseract is trained per language. Select the correct language for better accuracy.

Supported Languages

Tesseract supports 100+ languages including English, Spanish, French, German, Chinese (Simplified/Traditional), Japanese, Arabic, Hindi, and more. Language data files must be loaded — in-browser tools typically include the most common ones.

Output Formats

  • Plain text — extracted text, copy-pasteable
  • Searchable PDF — original scanned image with an invisible text layer added (best for archiving)
  • HOCR — HTML with word bounding boxes (for developers)

Run OCR on a PDF

Open ToolsVito's PDF OCR, upload your scanned PDF, select language, and extract the text — Tesseract.js runs entirely in your browser, nothing is uploaded.

Try it now — free, runs in your browser

PDF OCR

Extract text from scans