登录查看更多内容

CONVERTING SCANNED PDF TO TEXT MADE SIMPLER BY PYTHON OCRmyPDF

joseph Okiro BSc., MSc., PMP?, PMI-ACP?, CISSP, CCSP, Smartsheet Proj. Mgmt. Certified.

Digital Transformation |Innovation | Cyber Security | Program & Strategy Delivery | ICT & Data Management | Financial Inclusion | Cloud Architect| ICT Consultant

发布日期: 2021年5月2日

OCRmyPDF is the most feature-rich and thoroughly tested command line OCR PDF conversion tool.

OCRmyPDF is a Python 3 application and library that adds OCR layers to PDFs. OCRmyPDF uses Tesseract, the best available open-source OCR engine, to perform OCR.

OCRmyPDF is limited by the Tesseract OCR engine, the PDF specification, and Ghostscript limitations.

By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with --output-type pdf option.

If a page in a PDF seems to have text, by default OCRmyPDF will exit without modifying the PDF. This is to ensure that PDFs that were previously OCRed or were “born digital” rather than scanned are not processed.

Some Control of OCR options

--skip-text

No OCR will be performed on pages that already have text.

--output-type pdf

Add an OCR layer and output a standard PDF

--rotate-pages

OCR will attempt to automatic correct the rotation of each page. This can help fix a scanning job that contains a mix of landscape and portrait pages.

-l

OCRmyPDF assumes the document is in English unless told otherwise. OCR quality may be poor if the wrong language is used. Example -l eng+fra

--deskew

correct document skew (crooked scan)

--pages

Tell OCRmyPDF to only apply OCR to certain pages. For example --pages 2,3,13-17, Hyphens denote a range of pages and commas separate page numbers

--optimize

Controls optimization. Optimization is performed even if no OCR text is found

--optimize 0: Disables optimization.

--optimize 1: Enables lossless optimizations, such as transcoding images to more efficient formats.

--optimize 2: All of the above, and enables lossy optimizations and color quantization.

--optimize 3: All of the above, and enables more aggressive optimizations and targets lower image quality.

Example conversion using Jupiter Notebook (Anaconda)

import ocrmypdf

!ocrmypdf --skip-text --deskew --rotate-pages --clean --optimize 0 input.pdf output.pdf

The output.pdf can thereafter be processed by any pdf to text libraries.

If the PDF contained tables, then you can use Camelot or tabular to extract the tables from output.pdf for further processing.

要查看或添加评论，请登录

查看全部

CONVERTING SCANNED PDF TO TEXT MADE SIMPLER BY PYTHON OCRmyPDF

joseph Okiro BSc., MSc., PMP?, PMI-ACP?, CISSP, CCSP, Smartsheet Proj. Mgmt. Certified.

Digital Transformation |Innovation | Cyber Security | Program & Strategy Delivery | ICT & Data Management | Financial Inclusion | Cloud Architect| ICT Consultant

更多精彩文章

社区洞察

其他会员也浏览了

IV Implementing a Systemic Dimensional Cyberprofiling Model in Python

Python Cheatsheet for AI Enthusiasts

Why Python is Better for Machine Learning and AI?

Python Interview Questions Set 6

The Machine Learning Tool Chain: A Deep Dive into Platforms, Tools, and IDEs

Python: Empowering Innovation, Revolutionizing the World

The Ultimate Guide To Speech Recognition With Python

Converting Perl to Python Code: RAG Importance with LLM - Part 2

Developing a Stock Market Prediction Tool with Python

Supervised Machine Learning With Python: Regression. Simple Linear Regression

DEPLOYING DJANGO ON A SHARED HOSTING – LEARNING BY DOING

2022年8月25日

SUPERVISORY TECHNOLOGY (SUPTECH): NATIONAL BANK OF RWANDA (BNR) LEADS THE WAY IN AFRICA

2022年3月17日

TAILORING BANKING SOFTWARE FOR GROUP-BASED MICROFINANCE AND FINANCIAL INCLUSION

2021年11月14日

A PROJECT MANAGER REQUIRES TECHNICAL KNOWLEDGE AND UNDERSTANDING.

2021年11月3日

EXCEL VBA: COMBINING EXCEL FILES TO A SINGLE WORKSHEET IN A NEW FILE

2021年4月24日

PYTHON LIBRARIES FOR TEXT-BASED PDF DATA EXTRACTIONS

2021年4月18日

THE RACE FOR SPREADSHEETS TOP CHOICE BETWEEN MICROSOFT EXCEL AND GOOGLE SHEETS

2021年3月29日

MICROSOFT EXCEL IS INCREASINGLY UBIQUITOUS WITH VBA: USER DEFINED FUNCTIONS(UDFs)

2021年2月20日

TOWARDS DATA SCIENCE: COLLABORATIVE ANALYTICS

2020年12月31日

PMI-ACP CERTIFICATION JOURNEY: JOSEPH OKIRO

2020年2月10日