登录查看更多内容

How to Segment Figures and Text Region in Newspaper using Layout Parser

Mohammad Oghli

Software Lead @ Archireef | Data Solutions | MLOps | Tech Author

发布日期: 2022年11月25日

Today I will demonstrate in this detailed article how to use Python Layout-Parser module to segment figures and text from any document image and extract these data from it. As example we will work on newspaper document image because it's a good real life test case of using image processing technology to extract data from image documents.

From a while I came across this question on Stack Overflow

How to detect figures in newspaper image?

then I started to search for simple approach to achieve that using Python and luckily I came with efficient solution to this problem.

First of all what is Layout Parser ?

Layout Parser?is a unified toolkit for Deep Learning Based Document Image Analysis.

It provides the following functionality:

layoutparser?can be used for conveniently OCR documents and convert the output in to structured data.
With the help of Deep Learning,?layoutparser?supports the analysis of very complex documents and processing of the hierarchical structure in the layouts.

You can check full documentation of the project on this GitHub repository.

In the next section I will explain how to install and use this toolkit on newspaper document image example.

Newspaper Case Study

This image is for newspaper document we will process it using layout-parser deep learning model in order to segment and extract figures and text from it (separating images and text regions on the newspaper image).

Toolkit Installation Guide

It's recommended to use Jupyter Notebook on Linux or macOS because layout-parser isn't supported on windows OS though there are some workarounds for setting it up on windows but it's complex and require many steps which is time consuming so I don't recommend to try it on Win OS. Alternatively you can use Google Colab (or any cloud service for running Jupyter notebook) which I personally used for direct running of the toolkit.

First we need to install python packages for layout parser

pip install layoutparser # Install the base layoutparser library with  
pip install "layoutparser[layoutmodels]" # Install DL layout model toolkit 
pip install "layoutparser[ocr]" # Install OCR toolkit

If you are using Google Colab or any cloud service for running Jupyter notebook, you should install layoutparser[ocr] dependency with this command


!sudo?apt?install?tesseract-ocr

Then we need to install the?detectron2?deep learning model backend dependencies


pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"

How to Use Toolkit

After successfully installing the toolkit we will import the required python packages on our code


import?layoutparser?as?lp

import?cv2

import?matplotlib.pyplot?as?plt

Then reading the newspaper image and initializing the layout parser model


# Convert the image from BGR (cv2 default loading style)
# to RGB
image = cv2.imread("test.jpg")
image = image[..., ::-1] 

# Load the deep layout model from the layoutparser API 
# For all the supported model, please check the Model 
# Zoo Page: https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html
       
model = lp.models.Detectron2LayoutModel('lp://PrimaLayout/mask_rcnn_R_50_FPN_3x/config', 
                                 extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.7],
                                 label_map={1:"TextRegion", 2:"ImageRegion", 3:"TableRegion", 4:"MathsRegion", 5:"SeparatorRegion", 6:"OtherRegion"})

After that we will detect figures and text layout structure in the image and draw the result of processing the image


# Detect the layout of the input image
layout = model.detect(image)
   
# Show the detected layout of the input image
lp.draw_box(image, layout, box_width=3)

领英推荐

Things You Can Do with Python: Advanced and Special…

Towards Data Science 1 年前

AI Text Detection in Python: How to Identify…

Asp.net with c# 3 个月前

How to build Gradient Boosting Regressor in?Python?

Leonardo A. 3 年前

From the result image we can see how it detected all the text regions and figure regions on the newspaper document.

Orange boxes for text regions.
White boxes for figures regions.

Very interesting result in a few steps using layout parser toolkit.

Now we will save figures and text segmentation result of the layout parser in variables to organize them and show the extracted data for newspaper image document


text_blocks?=?lp.Layout([b?for?b?in?layout?if?b.type=='TextRegion']

figure_blocks?=?lp.Layout([b?for?b?in?layout?if?b.type=='ImageRegion']))

Then we will plot extracted figure images using matplotlib module


fig?=?plt.figure(figsize=(30,?15))
i?=?1
for?figure?in?figure_blocks:
????segment_image?=?(figure
???????????????????????.pad(left=5,?right=5,?top=5,?bottom=5)
???????????????????????.crop_image(image))
????????#?add?padding?in?each?image?segment?can?help
????????#?improve?robustness?
????fig.add_subplot(3,?1,?i)
????plt.imshow(segment_image)
????i?+=?1

As we see the figures from the newspaper extracted successfully.

Also we will plot first 25 extracted text regions (the complete number of text blocks is 39)



fig?=?plt.figure(figsize=(20,?10)
for?i?in?range(1,?25):
????segment_image?=?(text_blocks[i]
???????????????????????.pad(left=5,?right=5,?top=5,?bottom=5)
???????????????????????.crop_image(image))
????????#?add?padding?in?each?image?segment?can?help
????????#?improve?robustness?
????fig.add_subplot(6,?4,?i)
????plt.imshow(segment_image))

we can see from the result the first 25 extracted text regions. It's amazing deep learning toolkit and very useful.

After that we can use any machine learning model for OCR in order to recognize text in the image and convert it to actual text data.

There are different models available for object character recognition. check text-recognition-resnet-fc OpenVINO model which available on open model zoo.

Another easy to use option is Tesseract?OCR?python module which we will use it to recognize text regions on the newspaper.

On the installation guide we previously installed the python package required for running tesseract ocr.

First We need to initialize tesseract?ocr engine


#?Initialize?the?tesseract?ocr?engine.?You?might?need?
#?to?install?the?OCR?components?in?layoutparser:
#?pip?install?layoutparser[ocr]
ocr_agent?=?lp.TesseractAgent(languages='eng')?

Then we recognize text in each text block that we segmented using layout parser


for?block?in?text_blocks
????segment_image?=?(block
???????????????????????.pad(left=5,?right=5,?top=5,?bottom=5)
???????????????????????.crop_image(image))
????????#?add?padding?in?each?image?segment?can?help
????????#?improve?robustness?
????????
????text?=?ocr_agent.detect(segment_image)
????block.set(text=text,?inplace=True):

After that printing the first 10 recognized text from the newspaper


for?i?in?range(10)
????print(text_blocks.get_texts()[i],?end='\n---\n'):

Output Text


--
“All the News
‘That's Fit to Print”

---
 

NEW YORK, TUESDAY, JANUARY 23, 2018

---
VOL.CLXVII... No. 57,851

 

---
U.S. Watching
While 2 Allies
Clash in Syria

---
Turks’ Attack on Kurds
Upsets ISIS Fight

---
By MARK LANDLER
‘and CARLOTTA GALL

---
WASHINGTON — When Presi-
ent Trump met with Turkey’,
President, Recep Tayyip Erdogan,
tthe Unlied Nations last Septem:
ber ne embraced him asa friend
and declared, “We're as close as
"We've ever been” Five months a
Turkey s waging an all-out as
‘aultagainst Syrian Kurds, mer
fas closest allies in the war
_againat the Islamic State.

---
_ The Turkish offensive, carried

out over the protestsof the United
States but withthe apparent 3s
Sent of Russia, marks a perilous
new phase in'elations between
tivo NATO alles — bringing thei
Interests nto direct conflict on the
Datleied Tt ays bare how much
leverage the United States has
lost in Syria, where its single
minded focus has been on van
quishing Islamist milants.

---
As Turkish troops advanced
Monday on the Kurdish town of
Alri, northsvest_Syria, the
White House warned Turkey not
to take its eye off the campaign
‘against the Islamic State. But it
‘Stopped short of rebuking Tires,
and” acknowledged ts. security
‘concems about the Kurds, whom
“Turkey considers terrorists and a
threat to. is territorial sover-
‘eigaty.

 

----

Finally we can see it's exact text boxes of the newspaper document image. we successfully detected text regions and extracted it.

Thanks for reading this article on how to use layout parser to segment and extract figures and text from document image. For the complete notebook of this article tutorial check this live link on Google Colab.

Also don't forget to share it and follow me on LinkedIn for more interesting technology articles :)

Maksym Stetsenko

Ph.D. (Eng. Sc.); Data Scientist/Machine Learning Engineer; Chief Engineer

9 个月

Thank you. Perhaps, a first real example of layout parser usage

1 次回应

查看更多评论

要查看或添加评论，请登录

Mohammad Oghli的更多文章

Deploying and Running CVAT with SAM Integration using Docker and Nuclio

2025年3月20日

Deploying and Running CVAT with SAM Integration using Docker and Nuclio

Today in this complete guide I will demonstrate in details how to deploy and run CVAT annotation application with Meta…
Build AI RAG Chatbot with Ollama and LangChain

2024年8月29日

Build AI RAG Chatbot with Ollama and LangChain

Today I will demonstrate in this article how to build your own AI chatbot on your customized dataset using…

4 条评论
ML Models Containerization using Docker [MLOps]

2023年11月21日

ML Models Containerization using Docker [MLOps]

Today I will demonstrate in this detailed article how to containerize Machine Learning Model and deploy it using Docker…
Integrating Multi Container Docker Compose Volume with AWS S3

2023年7月30日

Integrating Multi Container Docker Compose Volume with AWS S3

In this article I will demonstrate how to mount docker container volume (EBS storage on AWS EC2 by default) of Docker…

7 条评论
The End of Traditional Computer Programming and the Emerging of AI Powered NLP Programming

2023年6月8日

The End of Traditional Computer Programming and the Emerging of AI Powered NLP Programming

In this article I will discuss how the rapid advancement of Artificial Intelligence today has greatly impacted one of…
Deploying Multi Container Docker Compose Application on AWS EC2

2023年5月16日

Deploying Multi Container Docker Compose Application on AWS EC2

Today in this detailed article I will demonstrate how to deploy simple Voting App running across multiple Docker…

11 条评论
Leveraging Python FAAS to Deploy Impactful Data and ML Services

2022年11月1日

Leveraging Python FAAS to Deploy Impactful Data and ML Services

In this article I will talk about my experience with Daisi Platform as software developer and as a participant in The…
??????? ??? ????? Software ????? ???? Hardware ???? ????? ?????????

2021年6月12日

??????? ??? ????? Software ????? ???? Hardware ???? ????? ?????????

??????? ???? ????? ?? ????? ????? ?? ???? ?? ??? ??????? ?? ???? ??????????? ????? ??????? ?????? ?????? ???? ?????…

See all articles

How to Segment Figures and Text Region in Newspaper using Layout Parser

Mohammad Oghli

Software Lead @ Archireef | Data Solutions | MLOps | Tech Author

First of all what is Layout Parser ?

Newspaper Case Study

Toolkit Installation Guide

How to Use Toolkit

领英推荐

Mohammad Oghli的更多文章

社区洞察

其他会员也浏览了

A Gentle Introduction to XGBoost for Applied Machine Learning

A detailed K-nearest Neighbors classifier in Python

Top 5 Python Frameworks For Machine Learning

Python scikit-learn Toolkit

My Top Python Libraries

Python for Machine Learning: Getting Started with Scikit-learn and TensorFlow

Introducing Libraries: Tools for AI Wizards

Unleashing AI Power: A Beginner's Guide to OpenAI API Calls in Python

Machine Learning Made Fun: Your First ML Model with Python Magic!

Creating AI Linear Regressions with Python for AI

First of all what is Layout Parser ?

Newspaper Case Study

Toolkit Installation Guide

How to Use Toolkit

领英推荐

Mohammad Oghli的更多文章

Deploying and Running CVAT with SAM Integration using Docker and Nuclio

Build AI RAG Chatbot with Ollama and LangChain

ML Models Containerization using Docker [MLOps]

Integrating Multi Container Docker Compose Volume with AWS S3

The End of Traditional Computer Programming and the Emerging of AI Powered NLP Programming

Deploying Multi Container Docker Compose Application on AWS EC2

Leveraging Python FAAS to Deploy Impactful Data and ML Services

??????? ??? ????? Software ????? ???? Hardware ???? ????? ?????????

社区洞察

其他会员也浏览了

A Gentle Introduction to XGBoost for Applied Machine Learning

A detailed K-nearest Neighbors classifier in Python

Top 5 Python Frameworks For Machine Learning

Python scikit-learn Toolkit

My Top Python Libraries

Python for Machine Learning: Getting Started with Scikit-learn and TensorFlow

Introducing Libraries: Tools for AI Wizards

Unleashing AI Power: A Beginner's Guide to OpenAI API Calls in Python

Machine Learning Made Fun: Your First ML Model with Python Magic!

Creating AI Linear Regressions with Python for AI