Extract Tables From PDFs With tabula-py

It is not uncommon for us to need to extract tabular data from a PDF. For small PDFs with minimal data, it's fairly straightforward to extract the data manually by simply copying and pasting the data you need. However, when the tabular data is quite big and the rows are quite long in the PDF, or the numerical values come with double precision, or you have to extract tables from a directory instead of a single file, it is definitely not a job to be done manually.

The PDF format has no internal representation of a table structure (unlike html or other markup languages where tables and their internal structure are well defined), which makes it difficult to extract tables for analysis. That is, the data inside a PDF is designed for human consumptions (reading), not for computers (analysis). Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place! The best advice about parsing PDFs is don't do it unless you have to. When have to scrape the spreadsheet data in a PDF, the best advice is: don’t do it. According to my own experience: Do not hesitate to call the authors. In most of time, they are very friendly, and you may be lucky enough to get the raw data directly.

While there are a number of different python packages for extracting tabular information from pdf documents, such as pdftables and Camelot, only tabula-py basically satisfied our requirements.

In this tutorial, we will not go through all ways of scraping tables from PDFs with Python. We will only give a shot to tabula-py, as I successfully used it to scrape tables from PDFs, recently. tabula-py is a simple Python wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. With it, you also can extract tables from PDF into CSV, TSV or JSON file. Note, this options will only work for PDFs that are typed – not scanned-in images. (If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based) .

We also clean the data to make it more readable as the extracted result failed to completely reflect the structure of the original table. The demonstration table inside our PDF looks like

No alt text provided for this image
import tabula

import warnings
warnings.filterwarnings('ignore')

Extract the tables on the specific page

Only the first table on Page 6 is required.

infile  = "data/demo_pdf_with_complicated_table_format.pdf" 
df_data = tabula.read_pdf(infile, 
                          pages = "6",
                          multiple_tables = False, 
                          #pandas_options={'skiprows':1}
                          #pandas_options={'header': [0,1]}
                         )[0]
df_data.head()
No alt text provided for this image

You can see that tabula-py perform pretty good and basically extract the data we required. We just need do some clean work. First of all, the header is not that correct and some get into the data space. Let's use the pandas_options to skip the first row.

df_data = tabula.read_pdf(infile, 
                          pages = "6",
                          multiple_tables = False, 
                          pandas_options={'skiprows':1}
                         )[0]
df_data.head()
No alt text provided for this image

Much better! The data all get into right place. Next, let's get rid of NaN columns.

df_data.dropna(axis=1, how='any', inplace=True)
df_data.head()
No alt text provided for this image

Much much better! However, some columns tangle together. let's split them into new columns.

# new data frame with split value columns 
new = df_data["2X1 (%) X2 (%)"].str.split(" ", n = 1, expand = True) 
df_data["X01"]= new[0]   
df_data["X02"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["2X1 (%) X2 (%)"], inplace = True) 


# new data frame with split value columns 
new = df_data["X6 (%) X7 (%)"].str.split(" ", n = 1, expand = True)   
df_data["X06"]= new[0]   
df_data["X07"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["X6 (%) X7 (%)"], inplace = True) 

# new data frame with split value columns 
new = df_data["X11 (%) X12 (–)"].str.split(" ", n = 1, expand = True)   
df_data["X11"]= new[0]   
df_data["X12"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["X11 (%) X12 (–)"], inplace = True) 

# new data frame with split value columns 
new = df_data["X13 (–) X14 (–)"].str.split(" ", n = 1, expand = True)   
df_data["X13"]= new[0]   
df_data["X14"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["X13 (–) X14 (–)"], inplace = True) 

df_data.head()
No alt text provided for this image

Much much much better! Finally, let's make the column names more tidy.

df_data = df_data.rename({'Unnamed: 0':'Province', 'X3 (%)':'X03', 'X4 (km/km':'X04', 'X5 (%)':'X05', 
                          'X8 (–)':'X08', 'X9 (year)':'X09', 'X10 (USD)':'X10', 'X15 (–)':'X15'}, 
                         axis=1).set_index('Province').sort_index(axis=1).astype('float32')

df_data.head()
No alt text provided for this image

Now it looks totally like the original table and we can do further analysis. For example,

df_data.describe()

or have a visualization.

df_data.X15.plot(kind='bar', figsize=(15,7), title='X15')
No alt text provided for this image

If you are interested in this topic, it is better to also try other libraries.

References

https://pandas.pydata.org/

https://matplotlib.org/?

https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

https://medium.com/@marizu_makozi/extracting-tables-in-pdf-using-python-d520b6d8a66

https://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/

https://www.binpress.com/manipulate-pdf-python/

https://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/

https://dzone.com/articles/announcing-camelot-a-python-library-to-extract-tab


Tanvi Jain

Penultimate Year Student at NTU | Robotics Intern at Hyundai | Curation at TEDxSingapore

1 年

Hi, wonderful article! I was wondering, could this be used to extract a similarly lineless table from a pdf document if the exact column/row names and the page of the pdf on which the table is located aren't known?

回复
Rahul Krishna

Director: YARS | Consultant & Strategist | AI & Climate Sustainability | AI Model Training for various usecases, Automation Specialist | LLM Fine-tuning | Green and Nature Based Credits| Alumni: IIM Lucknow,IIT Kharagpur

2 年

what if columns data is getting cut

回复
Ayushi Garg

Data Engineer @Sikka.ai | MSCS Grad, CSU East Bay | Skilled in Scalable Data Pipelines & Software Development | Actively Seeking Full-time Software Engineer & Data Engineer Roles

2 年

Very very very helpful. Thanks a lot

回复
Gathenya James

Senior Systems Developer at Central Bank of Kenya

2 年

Very helpful. Thanks.

回复
Ali Zhadigerov

Data Scientist/ML Engineer

3 年

This solution is ok, but renaming the column names manually is odd. What If all your tables have different column names and different structure?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了