Using Artificial Intelligence to Parse PDFs from performance.gov
Question & Answering - Reading Comprehension with passage from Federal Government PDF with answer visualized

Using Artificial Intelligence to Parse PDFs from performance.gov

Yes, models can be trained to detect things such as sections, tables, etc. luckily AllenAI has an opensource model just for that.

This mini-tutorial will basically be turning PDFs and powerpoints into data in JSON format using artificial intelligence and natural language processing and then applying some other cool stuff :)

Requirements:

  • A MacOS or Linux based system (64 bit)
  • Python 3.6.5 or higher (I used Python 3.7)
  • allennlp 2.1.0
  • allennlp-models==2.1.0

As the Presidential Management Agenda (PMA) Launch approaches, I decided to finally write this tutorial to parse some of the various PDFs from PMA before that I've downloaded.

So what can AI and natural language processing do about PDFs? Let's explore~~

The model to use is called SCIENCEPARSE from AI2. Note, it is NOT optimized for processing nonpaper academic documents such as dissertations, reports, slides, etc. However, it still does pretty okay on these data strategy PDFs and I think it illustrates what you can do with models trained to parse PDFs. Future project for y'all- train a model like ScienceParse on pdf reports and slides~

Science Parse normally parses the following.

No alt text provided for this image

Using the command-line interface, I run the pre-trained model on the 4 random pdfs into JSON representations. Basically, once you have a folder with the PDFs and create a second folder where you want your JSON outputs to be, you will run the command:

RunSP -o <directory> -f <output directory>        

  • GSAPerformance.pdf --> GSAPerformance.json
  • Category_Management.pdf --> Category_Management.pdf
  • Sharing_Quality_Services.pdf --> Sharing_Quality_Services.pdf
  • Federal-Data-Strategy-Action-Plan.pdf --> Federal-Data-Strategy-Action-Plan.json

The model was not able to parse out sections from any of the first three pdfs, but it did parse out 23 sections from Federal-Data-Strategy-Action-Plan with headers, selection sections shown in the image below:

No alt text provided for this image

From Sharing_Quality_Services, the NLP model (SCIPARSE) was able to parse the title 'Sharing Quality Services: Improving Efficiency and Effectiveness of Mission Support Services Across Government', the year "2020", a list of authors, but unfortunately couldn't sectionize the PDF like the federal data strategy output.

Now we've got out PDFs into JSON format, what other Natural Language Processing models should we experiment with?

Let's go with using the allennlp models because they are extremely efficient and easy to use~ You can absolutely use your own reading comprehension and NER models, but I always suggest trying with already pre-trained deep learning models to save on computational resources.

No alt text provided for this image

Above is an example that illustrates how tables are often parsed with this type of paper/PDF data. That being said, it is still a sufficient way to transform a PDF to text data, although some post-processing and cleaning might be needed in order to parse it out perfectly. What do you all think?

I definitely recommend & no GPU is required unless you are retraining new models and need it.

No alt text provided for this image

Left: A page from the federal data strategy development team. Below: Parsed from SciParse model into text format

I still see many names quite clearly in the cropped image, e.g., Tiffany Julian, Data Scientist, National Center for Science and Engineering Statistics, National Science Foundation.

[Mini shout out to federal data scientists]

No alt text provided for this image

Diving into Reading Comprehension Models with two paragraphs~

Reading Comprehension Visualized with AllenNLP demo page

QUESTION 1: How many people were a part of the FDS team?

QUESTION 2: What is OMB doing in respect to artificial intelligence?

  • I know there are other parts of that PDF that discussed artificial intelligence policy and research, but I love that this section focuses on preparing the data.

No alt text provided for this image
No alt text provided for this image


Pretty cool right! model is BiDAF model with ELMo embeddings. The basic layout is pretty simple: encode words as a combination of word embeddings and a character-level encoder, pass the word representations through a bi-LSTM/GRU, use a matrix of attentions to put question information into the passage word representations (this is the only part that is at all non-standard), pass this through another few layers of bi-LSTMs/GRUs, and do a softmax over span start and span end.

Visualizing it in Code

example_dictionary =  {"passage": "By January 2020, OMB will establish the FDPC that will help agencies deliver on mission and effectively steward taxpayer dollars by enhancing OMB’s coordination of Federal data policy, governance, and resource considerations. OMB has statutory responsibility and coordinates many government-wide priorities and functions, many of which have a datarelated dimension. The FDPC will be a mechanism to coordinate OMB’s own data policy development and implementation activities for the Federal Government, including those necessary for the executive branch to meet existing and new legal requirements as well as addressing emerging priority data governance areas such as preparing data for use in artificial intelligence. Over time, the FDPC will also provide a forum for OMB offices to address selected data issues that cross agencies or span executive councils’ responsibilities. The FDPC is responsible for governmentwide management, governance, and resource priorities for data management standardization and use, including by contributing to the FDS’s annual action plans and align transformation efforts to reduce costs, duplication, and burden. The FDPC will be comprised of senior staff representing OMB’s statutory and programmatic areas, including offices responsible for evaluation, financial management, information technology, performance management, privacy, procurement, regulations, resource management, and statistical policy. The FDPC’s charter will specify roles and responsibilities. OMB’s approach to working across its functional areas will furthermore serve as a model for individual agencies to bridge their own functional silos.",
  "question": "What is OMB doing in respect to artificial intelligence?"
}

from allennlp.predictors.predictor import Predicto
import allennlp_models.rc

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo.2021-02-11.tar.gz")
predictor.predict(
    passage=example_dictionary["passage"],
    question=example_dictionary["question"]
)


        
No alt text provided for this image










Named Entity Recognition Visualized

No alt text provided for this image

Fine-Grained Named Entity Recognition Model is a BiLSTM and CRF Tagger. This model identifies a broad range of 16 semantic types in the input text. It is a reimplementation of Lample (2016) and uses a biLSTM with a CRF layer, character embeddings, and ELMo embeddings. Here are the entities it identified, LAW and DATE.

There are so many other great NLP models to share on various tasks other than reading comprehension and NER, such as coreference resolution, text generation, textual entailment, caption generation, etc.


ACRONYMS

  • NLP - Natural Language Processsing
  • CRF - Conditional Random Field
  • BiLSTM - Bi-directional Long Short Term Memory
  • ELMo - Embeddings from Language Model
  • NER - Named entity recognition
  • GSA - General Services Admission
  • BiLM - deep Bidirectional Language Model
  • GRU - Gated Recurrent Unit


https://www.performance.gov/





Isabel Metzger

Senior Data Scientist & Lead AI Engineer in US Government l AI/ML - Natural Language Processing & Computational Linguistics

3 年

Neil Miller !! :)

Piper Neddenien, MDA

Leveraging data, analytics, and AI to drive healthcare innovation & improve patient care.

3 年

Love this! Thank you for sharing.

回复

要查看或添加评论,请登录

Isabel Metzger的更多文章

社区洞察

其他会员也浏览了