CONVERTING SCANNED PDF TO TEXT MADE SIMPLER BY PYTHON OCRmyPDF
joseph Okiro BSc., MSc., PMP?, PMI-ACP?, CISSP, CCSP, Smartsheet Proj. Mgmt. Certified.
Digital Transformation |Innovation | Cyber Security | Program & Strategy Delivery | ICT & Data Management | Financial Inclusion | Cloud Architect| ICT Consultant
OCRmyPDF is the most feature-rich and thoroughly tested command line OCR PDF conversion tool.
OCRmyPDF is a Python 3 application and library that adds OCR layers to PDFs. OCRmyPDF uses Tesseract, the best available open-source OCR engine, to perform OCR.
OCRmyPDF is limited by the Tesseract OCR engine, the PDF specification, and Ghostscript limitations.
By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with --output-type pdf option.
If a page in a PDF seems to have text, by default OCRmyPDF will exit without modifying the PDF. This is to ensure that PDFs that were previously OCRed or were “born digital” rather than scanned are not processed.
Some Control of OCR options
--skip-text
No OCR will be performed on pages that already have text.
--output-type pdf
Add an OCR layer and output a standard PDF
--rotate-pages
OCR will attempt to automatic correct the rotation of each page. This can help fix a scanning job that contains a mix of landscape and portrait pages.
-l
OCRmyPDF assumes the document is in English unless told otherwise. OCR quality may be poor if the wrong language is used. Example -l eng+fra
--deskew
correct document skew (crooked scan)
--pages
Tell OCRmyPDF to only apply OCR to certain pages. For example --pages 2,3,13-17, Hyphens denote a range of pages and commas separate page numbers
--optimize
Controls optimization. Optimization is performed even if no OCR text is found
--optimize 0: Disables optimization.
--optimize 1: Enables lossless optimizations, such as transcoding images to more efficient formats.
--optimize 2: All of the above, and enables lossy optimizations and color quantization.
--optimize 3: All of the above, and enables more aggressive optimizations and targets lower image quality.
Example conversion using Jupiter Notebook (Anaconda)
import ocrmypdf
!ocrmypdf --skip-text --deskew --rotate-pages --clean --optimize 0 input.pdf output.pdf
The output.pdf can thereafter be processed by any pdf to text libraries.
If the PDF contained tables, then you can use Camelot or tabular to extract the tables from output.pdf for further processing.