CONVERTING SCANNED PDF TO TEXT MADE SIMPLER BY PYTHON OCRmyPDF

OCRmyPDF is the most feature-rich and thoroughly tested command line OCR PDF conversion tool.

OCRmyPDF is a Python 3 application and library that adds OCR layers to PDFs. OCRmyPDF uses Tesseract, the best available open-source OCR engine, to perform OCR.

OCRmyPDF is limited by the Tesseract OCR engine, the PDF specification, and Ghostscript limitations.

By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with --output-type pdf option.

If a page in a PDF seems to have text, by default OCRmyPDF will exit without modifying the PDF. This is to ensure that PDFs that were previously OCRed or were “born digital” rather than scanned are not processed.

Some Control of OCR options

--skip-text

No OCR will be performed on pages that already have text.  

--output-type pdf

Add an OCR layer and output a standard PDF

--rotate-pages

OCR will attempt to automatic correct the rotation of each page. This can help fix a scanning job that contains a mix of landscape and portrait pages.

-l

OCRmyPDF assumes the document is in English unless told otherwise. OCR quality may be poor if the wrong language is used. Example -l eng+fra

--deskew

 correct document skew (crooked scan)

--pages

Tell OCRmyPDF to only apply OCR to certain pages. For example --pages 2,3,13-17, Hyphens denote a range of pages and commas separate page numbers

--optimize

Controls optimization. Optimization is performed even if no OCR text is found

--optimize 0: Disables optimization.

--optimize 1: Enables lossless optimizations, such as transcoding images to more efficient formats.  

--optimize 2: All of the above, and enables lossy optimizations and color quantization.

--optimize 3: All of the above, and enables more aggressive optimizations and targets lower image quality.

Example conversion using Jupiter Notebook (Anaconda)

import ocrmypdf

!ocrmypdf --skip-text --deskew --rotate-pages --clean --optimize 0 input.pdf  output.pdf

The output.pdf can thereafter be processed by any pdf to text libraries.

If the PDF contained tables, then you can use Camelot or tabular to extract the tables from output.pdf for further processing.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了