Using Artificial Intelligence to Parse PDFs from performance.gov
Isabel Metzger
Senior Data Scientist & Lead AI Engineer in US Government l AI/ML - Natural Language Processing & Computational Linguistics
Yes, models can be trained to detect things such as sections, tables, etc. luckily AllenAI has an opensource model just for that.
This mini-tutorial will basically be turning PDFs and powerpoints into data in JSON format using artificial intelligence and natural language processing and then applying some other cool stuff :)
Requirements:
As the Presidential Management Agenda (PMA) Launch approaches, I decided to finally write this tutorial to parse some of the various PDFs from PMA before that I've downloaded.
So what can AI and natural language processing do about PDFs? Let's explore~~
The model to use is called SCIENCEPARSE from AI2. Note, it is NOT optimized for processing nonpaper academic documents such as dissertations, reports, slides, etc. However, it still does pretty okay on these data strategy PDFs and I think it illustrates what you can do with models trained to parse PDFs. Future project for y'all- train a model like ScienceParse on pdf reports and slides~
Science Parse normally parses the following.
Using the command-line interface, I run the pre-trained model on the 4 random pdfs into JSON representations. Basically, once you have a folder with the PDFs and create a second folder where you want your JSON outputs to be, you will run the command:
RunSP -o <directory> -f <output directory>
The model was not able to parse out sections from any of the first three pdfs, but it did parse out 23 sections from Federal-Data-Strategy-Action-Plan with headers, selection sections shown in the image below:
From Sharing_Quality_Services, the NLP model (SCIPARSE) was able to parse the title 'Sharing Quality Services: Improving Efficiency and Effectiveness of Mission Support Services Across Government', the year "2020", a list of authors, but unfortunately couldn't sectionize the PDF like the federal data strategy output.
Now we've got out PDFs into JSON format, what other Natural Language Processing models should we experiment with?
Let's go with using the allennlp models because they are extremely efficient and easy to use~ You can absolutely use your own reading comprehension and NER models, but I always suggest trying with already pre-trained deep learning models to save on computational resources.
Above is an example that illustrates how tables are often parsed with this type of paper/PDF data. That being said, it is still a sufficient way to transform a PDF to text data, although some post-processing and cleaning might be needed in order to parse it out perfectly. What do you all think?
I definitely recommend & no GPU is required unless you are retraining new models and need it.
Left: A page from the federal data strategy development team. Below: Parsed from SciParse model into text format
I still see many names quite clearly in the cropped image, e.g., Tiffany Julian, Data Scientist, National Center for Science and Engineering Statistics, National Science Foundation.
[Mini shout out to federal data scientists]
Diving into Reading Comprehension Models with two paragraphs~
Reading Comprehension Visualized with AllenNLP demo page
QUESTION 1: How many people were a part of the FDS team?
QUESTION 2: What is OMB doing in respect to artificial intelligence?
领英推荐
Pretty cool right! model is BiDAF model with ELMo embeddings. The basic layout is pretty simple: encode words as a combination of word embeddings and a character-level encoder, pass the word representations through a bi-LSTM/GRU, use a matrix of attentions to put question information into the passage word representations (this is the only part that is at all non-standard), pass this through another few layers of bi-LSTMs/GRUs, and do a softmax over span start and span end.
Visualizing it in Code
example_dictionary = {"passage": "By January 2020, OMB will establish the FDPC that will help agencies deliver on mission and effectively steward taxpayer dollars by enhancing OMB’s coordination of Federal data policy, governance, and resource considerations. OMB has statutory responsibility and coordinates many government-wide priorities and functions, many of which have a datarelated dimension. The FDPC will be a mechanism to coordinate OMB’s own data policy development and implementation activities for the Federal Government, including those necessary for the executive branch to meet existing and new legal requirements as well as addressing emerging priority data governance areas such as preparing data for use in artificial intelligence. Over time, the FDPC will also provide a forum for OMB offices to address selected data issues that cross agencies or span executive councils’ responsibilities. The FDPC is responsible for governmentwide management, governance, and resource priorities for data management standardization and use, including by contributing to the FDS’s annual action plans and align transformation efforts to reduce costs, duplication, and burden. The FDPC will be comprised of senior staff representing OMB’s statutory and programmatic areas, including offices responsible for evaluation, financial management, information technology, performance management, privacy, procurement, regulations, resource management, and statistical policy. The FDPC’s charter will specify roles and responsibilities. OMB’s approach to working across its functional areas will furthermore serve as a model for individual agencies to bridge their own functional silos.",
"question": "What is OMB doing in respect to artificial intelligence?"
}
from allennlp.predictors.predictor import Predicto
import allennlp_models.rc
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo.2021-02-11.tar.gz")
predictor.predict(
passage=example_dictionary["passage"],
question=example_dictionary["question"]
)
Named Entity Recognition Visualized
Fine-Grained Named Entity Recognition Model is a BiLSTM and CRF Tagger. This model identifies a broad range of 16 semantic types in the input text. It is a reimplementation of Lample (2016) and uses a biLSTM with a CRF layer, character embeddings, and ELMo embeddings. Here are the entities it identified, LAW and DATE.
There are so many other great NLP models to share on various tasks other than reading comprehension and NER, such as coreference resolution, text generation, textual entailment, caption generation, etc.
ACRONYMS
Senior Data Scientist & Lead AI Engineer in US Government l AI/ML - Natural Language Processing & Computational Linguistics
3 年Neil Miller !! :)
Leveraging data, analytics, and AI to drive healthcare innovation & improve patient care.
3 年Love this! Thank you for sharing.