登录查看更多内容

Extracting Text from Image/PDF

Khushboo Gehi, MSc

Doctoral Researcher at SnT, Interdisciplinary Center for Security, Reliability and Trust

发布日期: 2022年2月3日

PDF or portable document format is a complex structure. Since the it is binary structure, the raw text is difficult to access and cannot be done without hampering the original file. A PDF file contains objects for every element inside, for example a font object, form object, page object, and so on. The objects refer other objects and they are structured hierarchically similar to a tree structure allowing pages to be navigated easily and efficiently.

PDF files contain images, figures and tables. Images inside a PDF are stored in compressed format like DCTDecode or JPXDecode, with pixel and color information. The pixel and color information isn't generally used for analysis, however, the textual content contained inside the tables, images and figures are valuable. The complexity of the document depends on how the file is created.

Processing PDF documents involves converting the files using computational resources. PDF to HTML / XML conversion changes the document structure useful to directly deploy them using the HTML/XML structures. In cases, where the document content is largely text PDF to TEXT is more useful. In all three cases, images are separated from the converted documents. A good option is to convert the PDF document to image format and use optical character recognition (OCR) to extract text. Here are the steps to do this with a combination of Linux command-line utilities and Python -

Install the required modules -

!sudo?apt-get?install?poppler-utils
!sudo?apt?install?tesseract-ocr
!pip?install?pytesseract

poppler-utils is a utility for processing pdf documents, commonly used on Linux systems. Tesseract is an optical character recognition engine?and works on various OS. py-tesseract is an OCR tool for python.

2. Convert the target pdf file to image format -

领英推荐

An Introduction to Regular Expressions (Regex) and How…

JavaScript Developer WorldWide 1 年前

String Formatting WITH Problems and Solution

Rushikesh J. 8 个月前

Streamlit

NISHI KUMARI 2 个月前

!pdftoppm?-jpeg?-r?300?target.pdf?output

This uses the linux command-line utility and converts the each page inside the document into a separate jpeg image.

3. Import python libraries -

from?PIL?import?Image
import?pytesseract?as?pt
import?os

4. Define the function to extract text from converted images -

def?image2txt()
??#?specify path?for?image folder
??p?="/content/images"


??#?specify the output text file path
??ofp?="/content/text/output.txt"


??#?iterating?the?images?inside?the?folder
??for?image?in?os.listdir(p):
????in?=?os.path.join(p,?image)
????img?=?Image.open(in)
??
????# this line extracts the text from image
????text?=?pt.image_to_string(img,?lang?="eng")


????# this line?saves?the?text?into the?output.txt?file
????f1?=?open(ofp,?"a+")


????#?this line writes the text content in the image to the file
????f1.write(text+"\n")
????f1.close()


??#?this line displays the retrieved text
??f2?=?open(fullTempPath,?'r')
??print(f2.read())
??f2.close()???

5. Call the function

image2txt()

The above code can be run inside colab notebook.

要查看或添加评论，请登录

Khushboo Gehi, MSc的更多文章

Building Smart Contracts with Rust

2022年8月17日

Building Smart Contracts with Rust

Smart contracts are event-driven programs stored on a blockchain that trigger action when certain conditions are met…
Privacy Enhancement using Federated Learning & Blockchain

2022年5月12日

Privacy Enhancement using Federated Learning & Blockchain

Machine learning pipelines executed within cloud infrastructure, are computationally intensive & use specialized…
Hierarchical & K-means Clustering

2022年2月13日

Hierarchical & K-means Clustering

Hierarchical clustering is a good way to segment data. It provides a tree structure to the data based on similarities.
Covid Vaccine Response Agent

2022年1月24日

Covid Vaccine Response Agent

This article is about a conversational agent trained to respond to frequently asked questions about covid-19 vaccines…
Scraping Data Off the Web

2022年1月19日

Scraping Data Off the Web

Web scrapping is a process in which data is extracted from websites for various purposes like marketing, research, data…
A not-so-perfect Chatbot

2022年1月14日

A not-so-perfect Chatbot

Chatbots are almost everywhere, most of us have interacted with at least one chatbot by now through voice/text…
Basic Face Detection

2022年1月12日

Basic Face Detection

Facial detection and recognition are commonly applied in security bio-metrics, social media, personal devices,etc…
Time Series Analysis using CNNs

2022年1月6日

Time Series Analysis using CNNs

Time series analysis can be implemented in multiple ways including linear models in machine learning, recurrent neural…
Using LSTM To Navigate Robots

2022年1月2日

Using LSTM To Navigate Robots

Long Short-Term Memory networks or LSTMs are a type of Recurrent Neural Networks that reuse the output of a previous…
Churn Prediction using Deep Neural Network

2021年12月28日

Churn Prediction using Deep Neural Network

Customer churn is essential in business as getting new customers requires more efforts and investment. This business…

3 条评论

See all articles

Extracting Text from Image/PDF

Khushboo Gehi, MSc

Doctoral Researcher at SnT, Interdisciplinary Center for Security, Reliability and Trust

领英推荐

Khushboo Gehi, MSc的更多文章

社区洞察

其他会员也浏览了

Empowering Data Visualization: Building and Publishing Interactive Dashboards with Plotly and Dash

The most beautiful charts you can create in python

How to generate CAPTCHA in Python?

Understanding YAML

Mastering Interactive Data Visualization with Plotly in Python

How AI Engineers Can Use the Streamlit Library

A Beginner's Guide to Regular Expressions (Regex)

Understanding Regular Expressions (Regex) and Key Metacharacters

Interactive Graph

Interactive Data Visualization with Python Using?Bokeh

领英推荐

Khushboo Gehi, MSc的更多文章

Building Smart Contracts with Rust

Privacy Enhancement using Federated Learning & Blockchain

Hierarchical & K-means Clustering

Covid Vaccine Response Agent

Scraping Data Off the Web

A not-so-perfect Chatbot

Basic Face Detection

Time Series Analysis using CNNs

Using LSTM To Navigate Robots

Churn Prediction using Deep Neural Network

社区洞察

其他会员也浏览了

Empowering Data Visualization: Building and Publishing Interactive Dashboards with Plotly and Dash

The most beautiful charts you can create in python

How to generate CAPTCHA in Python?

Understanding YAML

Mastering Interactive Data Visualization with Plotly in Python

How AI Engineers Can Use the Streamlit Library

A Beginner's Guide to Regular Expressions (Regex)

Understanding Regular Expressions (Regex) and Key Metacharacters

Interactive Graph

Interactive Data Visualization with Python Using?Bokeh