How to Copy Text From a PDF That Won't Let You Select It
Copy text from a PDF that blocks selection by diagnosing image-based pages, copy restrictions, and OCR options.
PDFs that don't allow text selection fall into two categories: scanned image PDFs (no text layer exists) and copy-protected PDFs (a text layer exists but is locked). For image-based PDFs, OCR is the solution. For copy-protected PDFs, you can screenshot pages and run OCR — though unlocking intentionally protected PDFs may not be appropriate depending on context.
The starting point is diagnosis. The two types require different approaches, and diagnosing them takes about 10 seconds.
How to Tell Which Type of PDF You Have
Open the PDF in any viewer and try to click on a word with your cursor.
If you can highlight characters but can't copy them: The PDF has a text layer. The document creator applied copy-restriction permissions. The text exists in the file — you just can't extract it through the standard selection mechanism. This is a copy-protected PDF with an existing text layer.
If you can't select anything at all — the cursor shows a crosshair, a hand, or an arrow rather than a text I-beam: The pages are image files. There is no text layer. The document was either scanned from paper or exported from a system that renders documents as flat images. This is an image-based PDF.
The distinction matters because the solutions are different. For image-based PDFs, OCR is the only way to extract the text. For copy-protected PDFs with existing text layers, there are approaches that don't involve OCR at all.
For Image-Based PDFs: OCR Method
This is the more common case. Many older documents, faxed materials, scanned contracts, and officially filed government documents are image-based PDFs.
Step 1: Convert pages to images
You need the pages as image files before you can run OCR on them. Options:
- Take screenshots of each page in your PDF viewer. Zoom the page to fill your screen to maximize resolution. Use Win+Shift+S (Windows), Cmd+Shift+4 (macOS), or your system's screenshot tool.
- Use your PDF viewer's export function if it has one. Adobe Acrobat and many PDF tools can export pages as PNG or JPEG directly.
- On macOS, Preview can export pages: File > Export, then choose PNG or JPEG.
For a multi-page document: decide whether you need all pages or just specific sections before committing to the full screenshot-per-page process. A 40-page contract where you need two specific clauses requires only 2 screenshots, not 40.
Step 2: Run OCR
Upload the page images to the OCR tool. The tool runs Tesseract.js entirely in your browser — the images don't get sent to a server. This matters for confidential documents.
Process pages one at a time. If you need the text from multiple pages in sequence, process them in order and assemble the results.
Step 3: Review and clean up the output
OCR on a scanned PDF typically achieves 95-99% accuracy for standard printed documents. That sounds high until you realize that 1% errors in a 500-word page means 5 character-level errors — enough to corrupt critical details in a contract, a name, a date, or a number.
Review the output against the source page. The errors are systematic, not random:
- "0" (zero) misread as "O" (letter O) — common in financial documents with lots of numbers
- "1" misread as "l" or "I" — common in document IDs, addresses, part numbers
- Tables lose their structure: OCR reads left-to-right across rows, but the relationship between columns and headers is lost
- Hyphenated line breaks from typeset text appear as literal hyphens mid-word
For documents where precision matters — legal contracts, medical records, financial statements — review the full output before using it.
For Copy-Protected PDFs: Three Approaches
Copy-protected PDFs have a text layer that's flagged as non-extractable. The protection is applied through PDF permission flags, not encryption of the text data itself. Several approaches work around this.
Approach 1: Print to PDF
Most operating systems and PDF viewers support printing to a virtual PDF printer. The process re-renders the document as a new PDF, and the resulting file typically has copy restrictions stripped because the new PDF is freshly generated by the print system.
On Windows: Open the PDF, press Ctrl+P, select "Microsoft Print to PDF" as the printer, and print. Open the resulting file — text selection usually works.
On macOS: Open the PDF in Preview or Chrome, press Cmd+P, click the PDF dropdown in the bottom-left of the print dialog, select "Save as PDF." Open the result and test selection.
This is not circumventing encryption — it's re-rendering the document. The PDF protection flags are annotations in the original file; the print-to-PDF process creates a new document that inherits the rendered content but not the original file's permission flags.
Approach 2: Try a Different PDF Viewer
Different PDF viewers implement permission flags with different levels of strictness. Adobe Acrobat tends to enforce copy restrictions. Google Chrome's built-in PDF viewer often allows text selection even when Acrobat doesn't, because Chrome's renderer applies its own interpretation of how strict to be with permission flags.
Open the PDF in Chrome (drag and drop onto a Chrome tab or use File > Open). Try selecting text. Chrome's PDF renderer applies its own interpretation of permission flags, so a substantial portion of copy-protected PDFs that Acrobat blocks are selectable in Chrome.
Approach 3: Screenshot + OCR
If the above approaches don't work, screenshot and OCR works regardless of PDF protection type. You're extracting text from the visual rendering, not from the PDF's text layer. The protection flags don't affect what the renderer displays on screen.
Screenshot each page (or just the pages you need), run through the OCR tool, and review the output. Accuracy for copy-protected PDFs via OCR is very high because these PDFs have text layers — they were created from digital text, not from a physical scan. The rendering on screen is crisp, clean, and well-suited for OCR.
What You Can't Do
Open a password-locked PDF with OCR. If the PDF requires a password to open the file at all, you can't see the content, so OCR can't help. OCR extracts text from visible content. If the content isn't visible, there's nothing to extract.
Extract text from a PDF with DRM (Digital Rights Management). Some PDFs, particularly from content publishers, use DRM that goes beyond permission flags. The content may be encrypted such that the PDF viewer renders it but the underlying data isn't accessible. Screenshots + OCR still work in this case (you're extracting from the visual rendering), but taking screenshots of DRM-protected content may violate the terms of the content license. This is a legal question, not a technical one.
The OCR Accuracy Edge Case: Tables
Tables are the most common accuracy problem in scanned PDFs. Tesseract reads in reading order — left to right across the full page width, then down. A table with three columns is read as three elements on the same horizontal line, then three elements on the next line, and so on. The column structure is preserved horizontally but the column-to-header relationship is lost.
If you're extracting a table from a scanned PDF, you'll get the right numbers but in a flat list. Reconstructing the table means either manually reformatting the output or using a tool that supports structured table OCR specifically. For financial statements, data tables, and forms, expect to do some reformatting.
Multi-column page layouts (newsletter-style two-column text, academic papers) have the same issue. Tesseract may read column 1, line 1 — then column 2, line 1 — then column 1, line 2 — rather than completing column 1 before moving to column 2. Check column order carefully in the output.
Privacy and Confidentiality
The OCR tool on instantly.tools runs entirely in your browser. Images are processed by Tesseract.js locally — nothing is transmitted to a server. This is the relevant fact when you're extracting text from a confidential document: a contract, a medical record, financial statements, an NDA.
For standard printed text in common languages, Tesseract.js handles production use cases reliably. Local processing is the correct choice for confidential documents.