登录查看更多内容

Extract Tables From PDFs With tabula-py

Chonghua Yin

Climate Scientist | Data Scientist | Developer

发布日期: 2020年9月15日

It is not uncommon for us to need to extract tabular data from a PDF. For small PDFs with minimal data, it's fairly straightforward to extract the data manually by simply copying and pasting the data you need. However, when the tabular data is quite big and the rows are quite long in the PDF, or the numerical values come with double precision, or you have to extract tables from a directory instead of a single file, it is definitely not a job to be done manually.

The PDF format has no internal representation of a table structure (unlike html or other markup languages where tables and their internal structure are well defined), which makes it difficult to extract tables for analysis. That is, the data inside a PDF is designed for human consumptions (reading), not for computers (analysis). Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place! The best advice about parsing PDFs is don't do it unless you have to. When have to scrape the spreadsheet data in a PDF, the best advice is: don’t do it. According to my own experience: Do not hesitate to call the authors. In most of time, they are very friendly, and you may be lucky enough to get the raw data directly.

While there are a number of different python packages for extracting tabular information from pdf documents, such as pdftables and Camelot, only tabula-py basically satisfied our requirements.

In this tutorial, we will not go through all ways of scraping tables from PDFs with Python. We will only give a shot to tabula-py, as I successfully used it to scrape tables from PDFs, recently. tabula-py is a simple Python wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. With it, you also can extract tables from PDF into CSV, TSV or JSON file. Note, this options will only work for PDFs that are typed – not scanned-in images. (If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based) .

We also clean the data to make it more readable as the extracted result failed to completely reflect the structure of the original table. The demonstration table inside our PDF looks like

import tabula

import warnings
warnings.filterwarnings('ignore')

Extract the tables on the specific page

Only the first table on Page 6 is required.

infile  = "data/demo_pdf_with_complicated_table_format.pdf" 
df_data = tabula.read_pdf(infile, 
                          pages = "6",
                          multiple_tables = False, 
                          #pandas_options={'skiprows':1}
                          #pandas_options={'header': [0,1]}
                         )[0]
df_data.head()

You can see that tabula-py perform pretty good and basically extract the data we required. We just need do some clean work. First of all, the header is not that correct and some get into the data space. Let's use the pandas_options to skip the first row.

df_data = tabula.read_pdf(infile, 
                          pages = "6",
                          multiple_tables = False, 
                          pandas_options={'skiprows':1}
                         )[0]
df_data.head()

Much better! The data all get into right place. Next, let's get rid of NaN columns.

df_data.dropna(axis=1, how='any', inplace=True)
df_data.head()

Much much better! However, some columns tangle together. let's split them into new columns.

# new data frame with split value columns 
new = df_data["2X1 (%) X2 (%)"].str.split(" ", n = 1, expand = True) 
df_data["X01"]= new[0]   
df_data["X02"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["2X1 (%) X2 (%)"], inplace = True) 


# new data frame with split value columns 
new = df_data["X6 (%) X7 (%)"].str.split(" ", n = 1, expand = True)   
df_data["X06"]= new[0]   
df_data["X07"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["X6 (%) X7 (%)"], inplace = True) 

# new data frame with split value columns 
new = df_data["X11 (%) X12 (–)"].str.split(" ", n = 1, expand = True)   
df_data["X11"]= new[0]   
df_data["X12"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["X11 (%) X12 (–)"], inplace = True) 

# new data frame with split value columns 
new = df_data["X13 (–) X14 (–)"].str.split(" ", n = 1, expand = True)   
df_data["X13"]= new[0]   
df_data["X14"]= new[1]   
# Dropping old Name columns 
df_data.drop(columns =["X13 (–) X14 (–)"], inplace = True) 

df_data.head()

Much much much better! Finally, let's make the column names more tidy.

df_data = df_data.rename({'Unnamed: 0':'Province', 'X3 (%)':'X03', 'X4 (km/km':'X04', 'X5 (%)':'X05', 
                          'X8 (–)':'X08', 'X9 (year)':'X09', 'X10 (USD)':'X10', 'X15 (–)':'X15'}, 
                         axis=1).set_index('Province').sort_index(axis=1).astype('float32')

df_data.head()

Now it looks totally like the original table and we can do further analysis. For example,

df_data.describe()

or have a visualization.

df_data.X15.plot(kind='bar', figsize=(15,7), title='X15')

If you are interested in this topic, it is better to also try other libraries.

References

https://pandas.pydata.org/

https://matplotlib.org/?

https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

https://medium.com/@marizu_makozi/extracting-tables-in-pdf-using-python-d520b6d8a66

https://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/

https://www.binpress.com/manipulate-pdf-python/

https://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/

https://dzone.com/articles/announcing-camelot-a-python-library-to-extract-tab

Tanvi Jain

Penultimate Year Student at NTU | Robotics Intern at Hyundai | Curation at TEDxSingapore

1 年

Hi, wonderful article! I was wondering, could this be used to extract a similarly lineless table from a pdf document if the exact column/row names and the page of the pdf on which the table is located aren't known?

Rahul Krishna

2 年

what if columns data is getting cut

Ayushi Garg

Data Engineer @Sikka.ai | MSCS Grad, CSU East Bay | Skilled in Scalable Data Pipelines & Software Development | Actively Seeking Full-time Software Engineer & Data Engineer Roles

2 年

Very very very helpful. Thanks a lot

Gathenya James

Senior Systems Developer at Central Bank of Kenya

2 年

Very helpful. Thanks.

Ali Zhadigerov

Data Scientist/ML Engineer

3 年

This solution is ok, but renaming the column names manually is odd. What If all your tables have different column names and different structure?

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Extract Tables From PDFs With tabula-py

Chonghua Yin

Climate Scientist | Data Scientist | Developer

Extract the tables on the specific page

References

更多精彩文章

社区洞察

其他会员也浏览了

Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018 -Trends and Analysis

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Cleaning Data with Pandas

Exploratory Data Analysis Using D-Tale Library

Introduction to Pandas

Integrating PyCaret time-series module into Power BI

SQL Challenge: Weekly Orders Report

Boost Your Data Analysis with These 30 Essential Pandas Tricks!

Extract the tables on the specific page

References

Direct Access to NetCDF Files in TAR Archives

2024年8月30日

Merge Overlapping Rasters Using R and Terra

2024年3月22日

Merge Overlapping Rasters Using python and rioxarray

2024年3月8日

Merge Overlapping Rasters Using Python and GDAL VRT Pixel Functions

2024年1月29日

Spatial Stats of Raster Upon Polygons - Zonal Operations

2023年12月20日

An Introduction To Data Analysis Workflow

2023年12月16日

Revisit Degree Days

2023年11月18日

Efficient Point Data Extraction from Zarr Datasets with FastAPI, Dask, and Xarray

2023年11月4日

Developing Domain Knowledge: Your Own Business

2023年10月20日

Domain Knowledge: A Distinctive Necessity for Data Scientists

2023年10月14日

社区洞察

其他会员也浏览了

Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018 -Trends and Analysis

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Cleaning Data with Pandas

Exploratory Data Analysis Using D-Tale Library

Introduction to Pandas

Integrating PyCaret time-series module into Power BI

SQL Challenge: Weekly Orders Report

Boost Your Data Analysis with These 30 Essential Pandas Tricks!