How to Extract Text from a Scanned PDF — OCR Guide

A scanned PDF is a photograph of a document. The text you see on screen is just pixels — you can't click on it, search it, copy it, or edit it. OCR (Optical Character Recognition) is the technology that reads those pixels and converts them into real, selectable text. Whether you're dealing with a bank statement, a government letter, an old contract, or a faxed invoice, this guide shows you exactly how to make it searchable — entirely in your browser.

What Is OCR and How Does It Work?

OCR stands for Optical Character Recognition. At its core, it's a machine learning process that analyses the shapes of characters in an image and maps them to the corresponding letters, numbers, and symbols in a character set. Modern OCR engines like Tesseract (which powers Rifix's OCR tool) are trained on millions of document images and can recognise text in dozens of languages, even in documents with mixed fonts or slightly skewed alignment.

When you run OCR on a scanned PDF, the tool adds a hidden text layer behind the visual image. The scan looks identical, but the text is now selectable, searchable, and copyable. The original image quality is preserved — OCR doesn't modify the visible content of the PDF.

OCR converts scanned images into selectable, searchable text

📌 When You Definitely Need OCR

You need OCR if: you try to select text in the PDF and can't, Ctrl+F (search) returns no results, copying text pastes nothing or gibberish, or the PDF was created by scanning a physical document or fax rather than by exporting from software.

Before You Run OCR — Clean Up the Scan First

OCR accuracy depends heavily on scan quality. A crisp, high-contrast scan of black text on white paper will achieve 98–99% character accuracy. A grey, slightly blurry, or skewed scan might drop to 85–90%, which means one in ten characters is wrong — enough to make the output unreliable for important documents.

Before running OCR on a poor-quality scan, use Rifix Scan Cleanup to remove grey backgrounds, boost contrast, and sharpen text. This step alone can dramatically improve OCR accuracy on difficult documents. Then run the cleaned version through OCR.

For best scan quality at the source: scan at 300 DPI minimum (600 DPI for small print), ensure good lighting if photographing with a phone, and keep the document flat to avoid distortion at the edges.

Step-by-Step: Run OCR on a Scanned PDF with Rifix

Open Rifix OCR Scan in your browser on any device.
Load your scanned PDF. It stays entirely on your device — no upload to any server.
Select your document language. Accuracy improves when the OCR engine knows what language to expect, especially for non-Latin scripts.
Click Run OCR. Processing time depends on the number of pages and your device speed — typically 2–10 seconds per page.
Review the output. The OCR layer is added to the PDF, keeping your original scan intact with searchable text underneath.
Download the searchable PDF.

What to Do with the OCR Output

Once OCR has run, your options expand significantly:

Search the document — Use Ctrl+F (or Cmd+F on Mac) in any PDF viewer to find specific words or phrases instantly across hundreds of pages.
Copy text to clipboard — Select and copy passages to paste into emails, reports, or other documents.
Convert to editable Word document — Use PDF to Word on the OCR'd PDF to get an editable DOCX with the extracted text formatted as a document.
Extract plain text — Use PDF to Text to pull all the text content out as a .txt file, useful for data processing, archiving, or importing into other systems.

💡 Accuracy Expectations by Document Type

Laser-printed text on white paper: 98–99% accuracy — very reliable.

Inkjet-printed documents: 95–98% — occasional errors on thin strokes.

Handwritten text: 60–80% — OCR is not designed for handwriting; results are unreliable.

Old faxes or photocopies: 80–90% — quality varies significantly by original resolution.

OCR for Multi-Page Documents

OCR works page by page — the accuracy on each page is independent of the others. If your document has some clearly printed pages and some poor-quality pages, the good pages will OCR accurately while the poorer pages may have more errors. You can check quality by selecting text on a processed page and comparing it visually to what's printed.

For very long documents (50+ pages), the processing time in a browser can be significant. Break the PDF into smaller sections using Rifix Split PDF, OCR each section, then merge them back into one searchable document. This also allows you to prioritise the sections you need most urgently.

Privacy: Why Browser-Based OCR Matters

Scanned documents are often the most sensitive type — pay stubs, medical records, bank statements, legal letters, and identity documents. Running OCR through a cloud service means uploading those documents to someone else's server. Rifix's OCR runs entirely in your browser using the Tesseract.js engine. The characters in your bank statement are read by JavaScript running on your own device, not on a remote server in another country. For sensitive documents, this distinction matters.

The Challenge of Scanned PDF Text

A scanned PDF is fundamentally different from a digitally created PDF. When a document is printed and then scanned, the result is a PDF where each page is a photograph of paper — the "text" you see is actually pixel patterns that happen to look like letters, not real character data. You cannot select this text, copy it, or search for it. From a software perspective, the document contains no words at all — only an image. To extract readable, editable text from a scanned PDF, you need OCR (Optical Character Recognition) — software that analyses the image and attempts to identify and transcribe the characters it sees.

Cleaning and enhancing scanned PDFs for professional output

How OCR Works

Modern OCR uses deep learning models trained on millions of documents in many languages. The process: the image is analysed to detect text regions (areas of the page containing characters, as opposed to margins, graphics, or empty space). Within each text region, individual characters are identified and classified by comparing pixel patterns to trained character models. Words are assembled from character sequences. Lines and paragraphs are formed from word groups. The entire text is then reconstructed as a machine-readable layer. OCR accuracy depends on scan quality — a clean, straight, high-resolution scan of printed text can achieve over 99% character accuracy. A low-resolution, skewed scan of handwritten notes may achieve 60–70%.

Extracting Text from Scanned PDFs at rifix.xyz

Open rifix.xyz/ocr. Upload your scanned PDF. The OCR tool processes each page, creating a searchable PDF with a text layer added beneath the visible scan image. Download the result. You now have a PDF that looks identical to the original scan but contains real text data that can be selected, copied, and searched. Open the OCR PDF, use Ctrl+F or Cmd+F to search for a word — if OCR succeeded, the word is found and highlighted. Right-click and select text to copy it to the clipboard for use in another document. The visible scan image is unchanged — the OCR text layer is added as an invisible searchable overlay.

Converting OCR Output to Word

If you need the extracted text in Word format for editing rather than just searchable PDF, run the OCR first, then convert the resulting searchable PDF to Word at rifix.xyz/pdf2word. This two-step process — OCR to create searchable PDF, then PDF to Word conversion — produces editable Word text from a scanned document. The accuracy of the final Word document depends on the OCR accuracy in step one. For a high-quality scan of cleanly printed text, the resulting Word document typically requires only minor correction. For poor quality scans or complex layouts, manual correction of OCR errors may be significant — compare the Word output against the original scan page by page.

Improving OCR Accuracy

Several factors affect OCR accuracy and can be improved before running OCR. Scan at higher resolution: 300DPI is the minimum for reliable OCR; 400–600DPI produces better results for documents with small text. Use black and white (not colour or greyscale) for text-only documents — this increases contrast and helps OCR identify character boundaries. Ensure pages are straight — skewed text reduces accuracy significantly. Use rifix.xyz/scanclean to deskew (straighten) pages and improve contrast before running OCR if your scan shows these issues. For very important documents where accuracy is critical, re-scan at higher resolution rather than trying to improve a low-quality scan.

Languages and Character Sets

OCR tools are trained on specific languages. English is supported by all mainstream OCR engines. Major European languages (French, German, Spanish, Italian, Portuguese) are widely supported. Arabic, Chinese, Japanese, Korean, and other non-Latin scripts are supported by advanced OCR engines including those used at rifix.xyz. Mixed-language documents — a French contract with English annexes, or an international document with sections in multiple languages — are more challenging and may produce lower accuracy in the less-dominant language sections. For non-Latin script documents, verify OCR accuracy carefully by comparing a sample section of extracted text against the original.

Handwritten Text

Handwritten text recognition (sometimes called ICR — Intelligent Character Recognition) is significantly more challenging than printed text OCR. Casual handwriting with varied letter sizes, connected characters, and non-standard letterforms challenges even advanced recognition systems. Modern AI-based handwriting recognition (like that in Microsoft Lens or Google's Document AI) handles neat handwriting reasonably well but still produces errors on unusual letterforms, cursive script, and any non-standard writing. For handwritten forms where the handwriting is neat and in block capitals, OCR accuracy may be acceptable. For cursive personal handwriting, manual transcription is typically faster than correcting OCR output.

Nowsath Rifaya · Founder, Rifix PDF Editor

Operations professional based in Singapore. Built Rifix to solve a real work problem — handling confidential PDF documents without uploading them to unknown servers. Writes from direct experience using these tools daily.

Try It Free — Right Now

Make any scanned document searchable and copyable — runs locally in your browser, zero upload.

Run OCR on Your Scanned PDF →