Extracting Text from Image/PDF

PDF or portable document format is a complex structure. Since the it is binary structure, the raw text is difficult to access and cannot be done without hampering the original file. A PDF file contains objects for every element inside, for example a font object, form object, page object, and so on. The objects refer other objects and they are structured hierarchically similar to a tree structure allowing pages to be navigated easily and efficiently.

PDF files contain images, figures and tables. Images inside a PDF are stored in compressed format like DCTDecode or JPXDecode, with pixel and color information. The pixel and color information isn't generally used for analysis, however, the textual content contained inside the tables, images and figures are valuable. The complexity of the document depends on how the file is created.

Processing PDF documents involves converting the files using computational resources. PDF to HTML / XML conversion changes the document structure useful to directly deploy them using the HTML/XML structures. In cases, where the document content is largely text PDF to TEXT is more useful. In all three cases, images are separated from the converted documents. A good option is to convert the PDF document to image format and use optical character recognition (OCR) to extract text. Here are the steps to do this with a combination of Linux command-line utilities and Python -

  1. Install the required modules -

!sudo?apt-get?install?poppler-utils
!sudo?apt?install?tesseract-ocr
!pip?install?pytesseract        

poppler-utils is a utility for processing pdf documents, commonly used on Linux systems. Tesseract is an optical character recognition engine?and works on various OS. py-tesseract is an OCR tool for python.

2. Convert the target pdf file to image format -

!pdftoppm?-jpeg?-r?300?target.pdf?output        

This uses the linux command-line utility and converts the each page inside the document into a separate jpeg image.

3. Import python libraries -

from?PIL?import?Image
import?pytesseract?as?pt
import?os        

4. Define the function to extract text from converted images -

def?image2txt()
??#?specify path?for?image folder
??p?="/content/images"


??#?specify the output text file path
??ofp?="/content/text/output.txt"


??#?iterating?the?images?inside?the?folder
??for?image?in?os.listdir(p):
????in?=?os.path.join(p,?image)
????img?=?Image.open(in)
??
????# this line extracts the text from image
????text?=?pt.image_to_string(img,?lang?="eng")


????# this line?saves?the?text?into the?output.txt?file
????f1?=?open(ofp,?"a+")


????#?this line writes the text content in the image to the file
????f1.write(text+"\n")
????f1.close()


??#?this line displays the retrieved text
??f2?=?open(fullTempPath,?'r')
??print(f2.read())
??f2.close()???
        

5. Call the function

image2txt()        

The above code can be run inside colab notebook.

要查看或添加评论,请登录

Khushboo Gehi, MSc的更多文章

  • Building Smart Contracts with Rust

    Building Smart Contracts with Rust

    Smart contracts are event-driven programs stored on a blockchain that trigger action when certain conditions are met…

  • Privacy Enhancement using Federated Learning & Blockchain

    Privacy Enhancement using Federated Learning & Blockchain

    Machine learning pipelines executed within cloud infrastructure, are computationally intensive & use specialized…

  • Hierarchical & K-means Clustering

    Hierarchical & K-means Clustering

    Hierarchical clustering is a good way to segment data. It provides a tree structure to the data based on similarities.

  • Covid Vaccine Response Agent

    Covid Vaccine Response Agent

    This article is about a conversational agent trained to respond to frequently asked questions about covid-19 vaccines…

  • Scraping Data Off the Web

    Scraping Data Off the Web

    Web scrapping is a process in which data is extracted from websites for various purposes like marketing, research, data…

  • A not-so-perfect Chatbot

    A not-so-perfect Chatbot

    Chatbots are almost everywhere, most of us have interacted with at least one chatbot by now through voice/text…

  • Basic Face Detection

    Basic Face Detection

    Facial detection and recognition are commonly applied in security bio-metrics, social media, personal devices,etc…

  • Time Series Analysis using CNNs

    Time Series Analysis using CNNs

    Time series analysis can be implemented in multiple ways including linear models in machine learning, recurrent neural…

  • Using LSTM To Navigate Robots

    Using LSTM To Navigate Robots

    Long Short-Term Memory networks or LSTMs are a type of Recurrent Neural Networks that reuse the output of a previous…

  • Churn Prediction using Deep Neural Network

    Churn Prediction using Deep Neural Network

    Customer churn is essential in business as getting new customers requires more efforts and investment. This business…

    3 条评论

社区洞察

其他会员也浏览了