How to Segment Figures and Text Region in Newspaper using Layout Parser

How to Segment Figures and Text Region in Newspaper using Layout Parser

Today I will demonstrate in this detailed article how to use Python Layout-Parser module to segment figures and text from any document image and extract these data from it. As example we will work on newspaper document image because it's a good real life test case of using image processing technology to extract data from image documents.

From a while I came across this question on Stack Overflow

How to detect figures in newspaper image?

then I started to search for simple approach to achieve that using Python and luckily I came with efficient solution to this problem.

First of all what is Layout Parser ?

Layout Parser?is a unified toolkit for Deep Learning Based Document Image Analysis.

It provides the following functionality:

  • layoutparser?can be used for conveniently OCR documents and convert the output in to structured data.
  • With the help of Deep Learning,?layoutparser?supports the analysis of very complex documents and processing of the hierarchical structure in the layouts.

You can check full documentation of the project on this GitHub repository.

In the next section I will explain how to install and use this toolkit on newspaper document image example.

Newspaper Case Study

No alt text provided for this image

This image is for newspaper document we will process it using layout-parser deep learning model in order to segment and extract figures and text from it (separating images and text regions on the newspaper image).

Toolkit Installation Guide

It's recommended to use Jupyter Notebook on Linux or macOS because layout-parser isn't supported on windows OS though there are some workarounds for setting it up on windows but it's complex and require many steps which is time consuming so I don't recommend to try it on Win OS. Alternatively you can use Google Colab (or any cloud service for running Jupyter notebook) which I personally used for direct running of the toolkit.

First we need to install python packages for layout parser

pip install layoutparser # Install the base layoutparser library with  
pip install "layoutparser[layoutmodels]" # Install DL layout model toolkit 
pip install "layoutparser[ocr]" # Install OCR toolkit        

If you are using Google Colab or any cloud service for running Jupyter notebook, you should install layoutparser[ocr] dependency with this command


!sudo?apt?install?tesseract-ocr        

Then we need to install the?detectron2?deep learning model backend dependencies


pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"            

How to Use Toolkit

After successfully installing the toolkit we will import the required python packages on our code


import?layoutparser?as?lp

import?cv2

import?matplotlib.pyplot?as?plt        

Then reading the newspaper image and initializing the layout parser model


# Convert the image from BGR (cv2 default loading style)
# to RGB
image = cv2.imread("test.jpg")
image = image[..., ::-1] 

# Load the deep layout model from the layoutparser API 
# For all the supported model, please check the Model 
# Zoo Page: https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html
       
model = lp.models.Detectron2LayoutModel('lp://PrimaLayout/mask_rcnn_R_50_FPN_3x/config', 
                                 extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.7],
                                 label_map={1:"TextRegion", 2:"ImageRegion", 3:"TableRegion", 4:"MathsRegion", 5:"SeparatorRegion", 6:"OtherRegion"})        

After that we will detect figures and text layout structure in the image and draw the result of processing the image


# Detect the layout of the input image
layout = model.detect(image)
   
# Show the detected layout of the input image
lp.draw_box(image, layout, box_width=3)        
No alt text provided for this image

From the result image we can see how it detected all the text regions and figure regions on the newspaper document.

  • Orange boxes for text regions.
  • White boxes for figures regions.

Very interesting result in a few steps using layout parser toolkit.

Now we will save figures and text segmentation result of the layout parser in variables to organize them and show the extracted data for newspaper image document


text_blocks?=?lp.Layout([b?for?b?in?layout?if?b.type=='TextRegion']

figure_blocks?=?lp.Layout([b?for?b?in?layout?if?b.type=='ImageRegion']))        

Then we will plot extracted figure images using matplotlib module


fig?=?plt.figure(figsize=(30,?15))
i?=?1
for?figure?in?figure_blocks:
????segment_image?=?(figure
???????????????????????.pad(left=5,?right=5,?top=5,?bottom=5)
???????????????????????.crop_image(image))
????????#?add?padding?in?each?image?segment?can?help
????????#?improve?robustness?
????fig.add_subplot(3,?1,?i)
????plt.imshow(segment_image)
????i?+=?1        
No alt text provided for this image

As we see the figures from the newspaper extracted successfully.

Also we will plot first 25 extracted text regions (the complete number of text blocks is 39)



fig?=?plt.figure(figsize=(20,?10)
for?i?in?range(1,?25):
????segment_image?=?(text_blocks[i]
???????????????????????.pad(left=5,?right=5,?top=5,?bottom=5)
???????????????????????.crop_image(image))
????????#?add?padding?in?each?image?segment?can?help
????????#?improve?robustness?
????fig.add_subplot(6,?4,?i)
????plt.imshow(segment_image))        
No alt text provided for this image

we can see from the result the first 25 extracted text regions. It's amazing deep learning toolkit and very useful.

After that we can use any machine learning model for OCR in order to recognize text in the image and convert it to actual text data.

There are different models available for object character recognition. check text-recognition-resnet-fc OpenVINO model which available on open model zoo.

Another easy to use option is Tesseract?OCR?python module which we will use it to recognize text regions on the newspaper.

On the installation guide we previously installed the python package required for running tesseract ocr.

First We need to initialize tesseract?ocr engine


#?Initialize?the?tesseract?ocr?engine.?You?might?need?
#?to?install?the?OCR?components?in?layoutparser:
#?pip?install?layoutparser[ocr]
ocr_agent?=?lp.TesseractAgent(languages='eng')?        

Then we recognize text in each text block that we segmented using layout parser


for?block?in?text_blocks
????segment_image?=?(block
???????????????????????.pad(left=5,?right=5,?top=5,?bottom=5)
???????????????????????.crop_image(image))
????????#?add?padding?in?each?image?segment?can?help
????????#?improve?robustness?
????????
????text?=?ocr_agent.detect(segment_image)
????block.set(text=text,?inplace=True):        

After that printing the first 10 recognized text from the newspaper


for?i?in?range(10)
????print(text_blocks.get_texts()[i],?end='\n---\n'):        

Output Text


--
“All the News
‘That's Fit to Print”

---
 

NEW YORK, TUESDAY, JANUARY 23, 2018

---
VOL.CLXVII... No. 57,851

 

---
U.S. Watching
While 2 Allies
Clash in Syria

---
Turks’ Attack on Kurds
Upsets ISIS Fight

---
By MARK LANDLER
‘and CARLOTTA GALL

---
WASHINGTON — When Presi-
ent Trump met with Turkey’,
President, Recep Tayyip Erdogan,
tthe Unlied Nations last Septem:
ber ne embraced him asa friend
and declared, “We're as close as
"We've ever been” Five months a
Turkey s waging an all-out as
‘aultagainst Syrian Kurds, mer
fas closest allies in the war
_againat the Islamic State.

---
_ The Turkish offensive, carried

out over the protestsof the United
States but withthe apparent 3s
Sent of Russia, marks a perilous
new phase in'elations between
tivo NATO alles — bringing thei
Interests nto direct conflict on the
Datleied Tt ays bare how much
leverage the United States has
lost in Syria, where its single
minded focus has been on van
quishing Islamist milants.

---
As Turkish troops advanced
Monday on the Kurdish town of
Alri, northsvest_Syria, the
White House warned Turkey not
to take its eye off the campaign
‘against the Islamic State. But it
‘Stopped short of rebuking Tires,
and” acknowledged ts. security
‘concems about the Kurds, whom
“Turkey considers terrorists and a
threat to. is territorial sover-
‘eigaty.

 

----        

Finally we can see it's exact text boxes of the newspaper document image. we successfully detected text regions and extracted it.

Thanks for reading this article on how to use layout parser to segment and extract figures and text from document image. For the complete notebook of this article tutorial check this live link on Google Colab.

Also don't forget to share it and follow me on LinkedIn for more interesting technology articles :)

Maksym Stetsenko

Ph.D. (Eng. Sc.); Data Scientist/Machine Learning Engineer; Chief Engineer

9 个月

Thank you. Perhaps, a first real example of layout parser usage

要查看或添加评论,请登录

Mohammad Oghli的更多文章

社区洞察

其他会员也浏览了