登录查看更多内容

Using Artificial Intelligence to Parse PDFs from performance.gov

Isabel Metzger

Senior Data Scientist & Lead AI Engineer in US Government l AI/ML - Natural Language Processing & Computational Linguistics

发布日期: 2021年11月2日

Yes, models can be trained to detect things such as sections, tables, etc. luckily AllenAI has an opensource model just for that.

This mini-tutorial will basically be turning PDFs and powerpoints into data in JSON format using artificial intelligence and natural language processing and then applying some other cool stuff :)

Requirements:

A MacOS or Linux based system (64 bit)
Python 3.6.5 or higher (I used Python 3.7)
allennlp 2.1.0
allennlp-models==2.1.0

As the Presidential Management Agenda (PMA) Launch approaches, I decided to finally write this tutorial to parse some of the various PDFs from PMA before that I've downloaded.

So what can AI and natural language processing do about PDFs? Let's explore~~

The model to use is called SCIENCEPARSE from AI2. Note, it is NOT optimized for processing nonpaper academic documents such as dissertations, reports, slides, etc. However, it still does pretty okay on these data strategy PDFs and I think it illustrates what you can do with models trained to parse PDFs. Future project for y'all- train a model like ScienceParse on pdf reports and slides~

Science Parse normally parses the following.

Using the command-line interface, I run the pre-trained model on the 4 random pdfs into JSON representations. Basically, once you have a folder with the PDFs and create a second folder where you want your JSON outputs to be, you will run the command:

RunSP -o <directory> -f <output directory>

GSAPerformance.pdf --> GSAPerformance.json
Category_Management.pdf --> Category_Management.pdf
Sharing_Quality_Services.pdf --> Sharing_Quality_Services.pdf
Federal-Data-Strategy-Action-Plan.pdf --> Federal-Data-Strategy-Action-Plan.json

The model was not able to parse out sections from any of the first three pdfs, but it did parse out 23 sections from Federal-Data-Strategy-Action-Plan with headers, selection sections shown in the image below:

From Sharing_Quality_Services, the NLP model (SCIPARSE) was able to parse the title 'Sharing Quality Services: Improving Efficiency and Effectiveness of Mission Support Services Across Government', the year "2020", a list of authors, but unfortunately couldn't sectionize the PDF like the federal data strategy output.

Now we've got out PDFs into JSON format, what other Natural Language Processing models should we experiment with?

Let's go with using the allennlp models because they are extremely efficient and easy to use~ You can absolutely use your own reading comprehension and NER models, but I always suggest trying with already pre-trained deep learning models to save on computational resources.

Above is an example that illustrates how tables are often parsed with this type of paper/PDF data. That being said, it is still a sufficient way to transform a PDF to text data, although some post-processing and cleaning might be needed in order to parse it out perfectly. What do you all think?

I definitely recommend & no GPU is required unless you are retraining new models and need it.

Left: A page from the federal data strategy development team. Below: Parsed from SciParse model into text format

I still see many names quite clearly in the cropped image, e.g., Tiffany Julian, Data Scientist, National Center for Science and Engineering Statistics, National Science Foundation.

[Mini shout out to federal data scientists]

Diving into Reading Comprehension Models with two paragraphs~

Reading Comprehension Visualized with AllenNLP demo page

QUESTION 1: How many people were a part of the FDS team?

QUESTION 2: What is OMB doing in respect to artificial intelligence?

I know there are other parts of that PDF that discussed artificial intelligence policy and research, but I love that this section focuses on preparing the data.

领英推荐

Using and Finetuning Pretrained Transformers

Sebastian Raschka, PhD 11 个月前

A deep dive on Vector Search and its implementation

Siddharth Asthana 7 个月前

AI2’s AllenNLP, Grover, and GPT-2 For Practical…

Tristan Sigerson 5 年前

Pretty cool right! model is BiDAF model with ELMo embeddings. The basic layout is pretty simple: encode words as a combination of word embeddings and a character-level encoder, pass the word representations through a bi-LSTM/GRU, use a matrix of attentions to put question information into the passage word representations (this is the only part that is at all non-standard), pass this through another few layers of bi-LSTMs/GRUs, and do a softmax over span start and span end.

Visualizing it in Code

example_dictionary =  {"passage": "By January 2020, OMB will establish the FDPC that will help agencies deliver on mission and effectively steward taxpayer dollars by enhancing OMB’s coordination of Federal data policy, governance, and resource considerations. OMB has statutory responsibility and coordinates many government-wide priorities and functions, many of which have a datarelated dimension. The FDPC will be a mechanism to coordinate OMB’s own data policy development and implementation activities for the Federal Government, including those necessary for the executive branch to meet existing and new legal requirements as well as addressing emerging priority data governance areas such as preparing data for use in artificial intelligence. Over time, the FDPC will also provide a forum for OMB offices to address selected data issues that cross agencies or span executive councils’ responsibilities. The FDPC is responsible for governmentwide management, governance, and resource priorities for data management standardization and use, including by contributing to the FDS’s annual action plans and align transformation efforts to reduce costs, duplication, and burden. The FDPC will be comprised of senior staff representing OMB’s statutory and programmatic areas, including offices responsible for evaluation, financial management, information technology, performance management, privacy, procurement, regulations, resource management, and statistical policy. The FDPC’s charter will specify roles and responsibilities. OMB’s approach to working across its functional areas will furthermore serve as a model for individual agencies to bridge their own functional silos.",
  "question": "What is OMB doing in respect to artificial intelligence?"
}

from allennlp.predictors.predictor import Predicto
import allennlp_models.rc

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo.2021-02-11.tar.gz")
predictor.predict(
    passage=example_dictionary["passage"],
    question=example_dictionary["question"]
)

Named Entity Recognition Visualized

Fine-Grained Named Entity Recognition Model is a BiLSTM and CRF Tagger. This model identifies a broad range of 16 semantic types in the input text. It is a reimplementation of Lample (2016) and uses a biLSTM with a CRF layer, character embeddings, and ELMo embeddings. Here are the entities it identified, LAW and DATE.

There are so many other great NLP models to share on various tasks other than reading comprehension and NER, such as coreference resolution, text generation, textual entailment, caption generation, etc.

ACRONYMS

NLP - Natural Language Processsing
CRF - Conditional Random Field
BiLSTM - Bi-directional Long Short Term Memory
ELMo - Embeddings from Language Model
NER - Named entity recognition
GSA - General Services Admission
BiLM - deep Bidirectional Language Model
GRU - Gated Recurrent Unit

https://www.performance.gov/

Isabel Metzger

Senior Data Scientist & Lead AI Engineer in US Government l AI/ML - Natural Language Processing & Computational Linguistics

3 年

Neil Miller !! :)

1 次回应

Piper Neddenien, MDA

Leveraging data, analytics, and AI to drive healthcare innovation & improve patient care.

3 年

Love this! Thank you for sharing.

查看更多评论

要查看或添加评论，请登录

Isabel Metzger的更多文章

Visualizing Convolutional Neural Network (CNN) Feature Maps

2022年2月16日

Visualizing Convolutional Neural Network (CNN) Feature Maps

Welcome! ?? Welcome to my newsletter on topics related to Ai & Data Science & Computational Research. Previously, I…

1 条评论
NYC Open Data Week 2021 - Part 3 - Impact Analysis

2021年9月1日

NYC Open Data Week 2021 - Part 3 - Impact Analysis

[Description of session described in the slide above - links to PART 1 and PART 2 at end of document] Causal Impact…
AI for everyone! A recap on NYC Open Data Week Event - PART 2- #covid #nyc #opendata

2021年8月26日

AI for everyone! A recap on NYC Open Data Week Event - PART 2- #covid #nyc #opendata

This work refers to a session Trevor Mitchell and I did for NYC OpenDataWeek. [It follows PART 1, which can be found…
AI for everyone! A recap on NYC Open Data Week Event - PART 1 - #covid #nyc #opendata

2021年3月18日

AI for everyone! A recap on NYC Open Data Week Event - PART 1 - #covid #nyc #opendata

@Trevor Mitchell and I were honored to be guest speakers for this year's 5th NYC Open Data Week/ Festival. These posts…
No, this chart is not my GitHub commits - It is a heroin death heatmap - the need for open data

2020年8月31日

No, this chart is not my GitHub commits - It is a heroin death heatmap - the need for open data

Disclaimer: This blog post represents my views and does not reflect the views and opinions of any company or…
Data Storytelling: Visualization in R #tweets #notes #geolocation #maps

2019年3月26日

Data Storytelling: Visualization in R #tweets #notes #geolocation #maps

Inspired by a recent interview, I decided to create a tutorial dedicated to data story telling with R. I spend a lot of…
Clustering and Classification of Drug Reviews Tutorial from druglib.com Part I

2019年2月17日

Clustering and Classification of Drug Reviews Tutorial from druglib.com Part I

Many organizations rely on reviews to collect invaluable feedback about how their services and products are working for…

2 条评论
Social Media Markers of Suicide? A web-scraping, sentiment analysis, and classification tutorial.

2018年9月10日

Social Media Markers of Suicide? A web-scraping, sentiment analysis, and classification tutorial.

Today is World Suicide Prevention Day. Suicide is the 10th leading cause of death in America.

2 条评论

See all articles

Using Artificial Intelligence to Parse PDFs from performance.gov

Isabel Metzger

Senior Data Scientist & Lead AI Engineer in US Government l AI/ML - Natural Language Processing & Computational Linguistics

Yes, models can be trained to detect things such as sections, tables, etc. luckily AllenAI has an opensource model just for that.

Now we've got out PDFs into JSON format, what other Natural Language Processing models should we experiment with?

Diving into Reading Comprehension Models with two paragraphs~

领英推荐

Visualizing it in Code

Named Entity Recognition Visualized

Isabel Metzger的更多文章

社区洞察

其他会员也浏览了

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

Building a solution to combat Fake News with Machine-Learning

Ready to Train Your Own LLM? Dive In with Code!

Assessing your custom AI with RAGA

Is RAG becoming obsolete?

TensorFlow Vector Representation Of Words

Machine Learning Implementations: Sentiment Analysis

So Many Words

So Many Words.

Implement Natural Language Processing on your Facebook Page to achieve a 100% response rate!

Yes, models can be trained to detect things such as sections, tables, etc. luckily AllenAI has an opensource model just for that.

Now we've got out PDFs into JSON format, what other Natural Language Processing models should we experiment with?

Diving into Reading Comprehension Models with two paragraphs~

领英推荐

Visualizing it in Code

Named Entity Recognition Visualized

Isabel Metzger的更多文章

Visualizing Convolutional Neural Network (CNN) Feature Maps

NYC Open Data Week 2021 - Part 3 - Impact Analysis

AI for everyone! A recap on NYC Open Data Week Event - PART 2- #covid #nyc #opendata

AI for everyone! A recap on NYC Open Data Week Event - PART 1 - #covid #nyc #opendata

No, this chart is not my GitHub commits - It is a heroin death heatmap - the need for open data

Data Storytelling: Visualization in R #tweets #notes #geolocation #maps

Clustering and Classification of Drug Reviews Tutorial from druglib.com Part I

Social Media Markers of Suicide? A web-scraping, sentiment analysis, and classification tutorial.

社区洞察

其他会员也浏览了

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

Building a solution to combat Fake News with Machine-Learning

Ready to Train Your Own LLM? Dive In with Code!

Assessing your custom AI with RAGA

Is RAG becoming obsolete?

TensorFlow Vector Representation Of Words

Machine Learning Implementations: Sentiment Analysis

So Many Words

So Many Words.

Implement Natural Language Processing on your Facebook Page to achieve a 100% response rate!