Extracting Text from Image/PDF
Khushboo Gehi, MSc
Doctoral Researcher at SnT, Interdisciplinary Center for Security, Reliability and Trust
PDF or portable document format is a complex structure. Since the it is binary structure, the raw text is difficult to access and cannot be done without hampering the original file. A PDF file contains objects for every element inside, for example a font object, form object, page object, and so on. The objects refer other objects and they are structured hierarchically similar to a tree structure allowing pages to be navigated easily and efficiently.
PDF files contain images, figures and tables. Images inside a PDF are stored in compressed format like DCTDecode or JPXDecode, with pixel and color information. The pixel and color information isn't generally used for analysis, however, the textual content contained inside the tables, images and figures are valuable. The complexity of the document depends on how the file is created.
Processing PDF documents involves converting the files using computational resources. PDF to HTML / XML conversion changes the document structure useful to directly deploy them using the HTML/XML structures. In cases, where the document content is largely text PDF to TEXT is more useful. In all three cases, images are separated from the converted documents. A good option is to convert the PDF document to image format and use optical character recognition (OCR) to extract text. Here are the steps to do this with a combination of Linux command-line utilities and Python -
!sudo?apt-get?install?poppler-utils
!sudo?apt?install?tesseract-ocr
!pip?install?pytesseract
poppler-utils is a utility for processing pdf documents, commonly used on Linux systems. Tesseract is an optical character recognition engine?and works on various OS. py-tesseract is an OCR tool for python.
2. Convert the target pdf file to image format -
领英推荐
!pdftoppm?-jpeg?-r?300?target.pdf?output
This uses the linux command-line utility and converts the each page inside the document into a separate jpeg image.
3. Import python libraries -
from?PIL?import?Image
import?pytesseract?as?pt
import?os
4. Define the function to extract text from converted images -
def?image2txt()
??#?specify path?for?image folder
??p?="/content/images"
??#?specify the output text file path
??ofp?="/content/text/output.txt"
??#?iterating?the?images?inside?the?folder
??for?image?in?os.listdir(p):
????in?=?os.path.join(p,?image)
????img?=?Image.open(in)
??
????# this line extracts the text from image
????text?=?pt.image_to_string(img,?lang?="eng")
????# this line?saves?the?text?into the?output.txt?file
????f1?=?open(ofp,?"a+")
????#?this line writes the text content in the image to the file
????f1.write(text+"\n")
????f1.close()
??#?this line displays the retrieved text
??f2?=?open(fullTempPath,?'r')
??print(f2.read())
??f2.close()???
5. Call the function
image2txt()
The above code can be run inside colab notebook.