登录查看更多内容

Exploring NLP Parsed Audit Documents

Randy Moore

Software Engineering Manager

发布日期: 2017年11月26日

Learning more Python because my machine is slow

Original post and the Named Entity Explorer (click "Audits" in upper right navigation bar) may be found at randalmoore.me

Inspiration

This project came about for these reasons:

Make a positive impact on the world.

Enable understanding through exploration of hard data as opposed to adhering to a particular belief system.
A first project to gain some experience working with real world open data.

2. Learn some of the technologies used by TravelPerk (my soon to be employer) so I could hit the ground running.

3. Professional Development: gain experience with some of the latest technologies.

The previous version of the site included out-of-the-box Postgres text search functionality, already very powerful. Only documents through mid 2014 were included in the .zip file available for download from archive.org. Searching through the ~35,000 audit documents for the names of my favorite politicians was fun but not not particularly insightful. The goal became how to glean information from such a vast corpus without a huge investment in building and tuning an AI system.

Final Product

After some probing I came across NLTK - a Python library that may be used to parse raw text and reveal it's structure. Of particular interest was Named Entity Recognition (see section 5). Since the context is set in this case as Audit Documents any named entity within a document would be of interest. Combined with the inspiration that simple frequency distribution often yields insight a plan was hatched to explore the audit document corpus based on frequency of Named Entities. After some more thought the plan was refined as follows:

Extract all named entities from each audit document along with their frequency within that document.
Across all audit documents determine what the most frequently occurring named entities are. Provide an ordered list to the user of the most N frequently occurring named entities across audit documents (each document contributes a count of 1 towards a named entity if that named entity occurs more than K times in the document).
Allow the scope to be limited to a user selected set of years so the user can see how subject matter changes across years.
Enable exploration through refinement. After a user selects a named entity the scope is then limited only to Audit Documents that contain that named entity.

The above exploration process allows the user to find a small number of audit documents who all contain the set of named entities the user is interested in. It turns out this will result in a set of documents related to a specific concern. Here is an example search across Audit Documents in 2017:

Landing Page

Click the "Named Entity Exploration" button...

Start of Named Exploration

Click the "Veterans" button...

Documents containing "Veterans"

Click the "VISN" button...

Narrow scope for documents also containing "VISN"

Final Result

Including:

Medical Center
OSC: Acronym for the U.S. Office of Special Council who investigate whistle blower complaints.

The result is a consistent set of Audit Documents investigating the performance of Medical Centers providing care for veterans across various cities.

Offline Processing

As mentioned before, the archive at archive.org only had up through 2014. Obtaining more recent reports required using the software that created the archive in the first place. This software has knowledge of what Inspector General websites exist and how to search each one for audit documents. Each audit document is downloaded, converted to raw text (often poorly, a topic for another post), and parsed for basic meta information (title, publication date, etc). A directory with these files is created for each report with the overall directory structure reflecting the publication year and which website the report came from. Running the tool resulted in a total of 51,295 audit documents to process. The metadata would be used to obtain title, publication date, and source URL while the raw text would be parsed for named entities and indexed by Postgres for text search. Only the distilled data from this processing would be included on the site, to view the document the origin URL would be provided to the end user.

Parsing each audit document using NLTK turned out to be the bottleneck, up to 10 seconds or so for a particularly large text. To finish in a reasonable amount of time it became clear that all 4 of my raging CPU cores (Phenom II X4 965) would need to be fully utilized. Python offers many options for concurrent execution, including asynchronous programming within a thread (supported by the language) and multithreading and multiprocessing. In a previous article I explained the differences between these. In this case since the tasks are CPU bound and taking into consideration the Python Global Interpreter Lock the best option was to use multi processing.

This file populates the Postgres database with metadata, text search, and named entity data for each audit document. It searches through a directory structure at a given root and processes the files associated with each audit document if that data isn't already in the database. The documents are processed in parallel, one per core of the host machine.

The write of results to Postgres was originally written to occur in the same task after processing the document but this resulted in Django layer concurrency errors. It would have been complicated and inefficient to coordinate the DB as a shared resource so I opted instead to use a queue. Each document processing task places the result into a queue. An additional DB dedicated process consumes from the queue and writes each result to the database sequentially. The final code was surprisingly simple:

Top of file:

# Declared as global to be shared between processes
db_queue = None

In __main__ create the DB queue and start it's process:

# Special queue provided for IPC use
db_queue = multiprocessing.Manager().Queue()
# DB dedicated process
multiprocessing.Process(target=save_to_db).start()

In process_documents() the results are placed in the queue:

db_queue.put((doc, named_entities))

In save_to_db() we loop forever consuming from the queue until None is encountered:

while True: 
    # blocks if queue is empty
    doc_tuple = db_queue.get()  
    # Exit function if None found
    # Save to DB

In __main__ all cores of the host machine are loaded up using a multiprocessing Pool and once they are finished None is placed in the DB queue to be found by the DB task causing it to exit:

# By default creates as many processes as available cores
with multiprocessing.Pool() as p:
    p.map(func=process_documents, iterable=files)

db_queue.put(None)

With this my machine was steadily maxed at 100% processor utilization, and took close to a full day to process all of the documents. A fair amount of electricity was used but at least some neat Python tricks were explored and cool words such as "corpus" learned :)

要查看或添加评论，请登录

Randy Moore的更多文章

Top 20 Software Dev Skills Over Time

2021年5月1日

Top 20 Software Dev Skills Over Time

Also posted on my personal site. V2 of my last blog post.
Top 20 Skills Over Time - remoteok.io

2020年8月8日

Top 20 Skills Over Time - remoteok.io

Reposted from https://randalmoore.me/posts/top-20-skills-remoteok/ As a professional developer you must always be…
Keeping it DRY with OAS

2019年12月1日

Keeping it DRY with OAS

Reposted from my blog Don't Repeat Yourself (DRY) is a well known principle in software development. An Open API…
Concept Map: Humble, Powerful

2019年9月8日

Concept Map: Humble, Powerful

Reposted from my personal blog How To Create A Concept Map Begin with brainstorming the list of nouns which may be…
Django Deletion Dragons

2019年8月1日

Django Deletion Dragons

Reposted from my personal blog. Django models offers an ORM API that abstracts the database layer.

1 条评论
Full Stack Walkthrough

2017年10月3日

Full Stack Walkthrough

Original post on the subject stack here. Summary High level development walk through for a toy example of a modern full…
Asynchronous Programming (and why it's all the rage for web services)

2017年6月21日

Asynchronous Programming (and why it's all the rage for web services)

Why Care? Browsing through job postings you often notice a job requirement along the lines of: Able to write highly…

1 条评论

See all articles

Exploring NLP Parsed Audit Documents

Randy Moore

Software Engineering Manager

Learning more Python because my machine is slow

Inspiration

Final Product

Landing Page

Start of Named Exploration

Documents containing "Veterans"

Narrow scope for documents also containing "VISN"

Final Result

Offline Processing

Randy Moore的更多文章

社区洞察

其他会员也浏览了

Artificial Intelligence #207

Artificial Intelligence #207

Breaking the Jargons: Issue 9

15 Machine Learning Libraries and Tools for Java

End to end LLMOps Pipeline - Part 2 - FastAPI

How to Learn AI on Your Own

The Impact of AWS Bedrock and LLMs in Fraud Detection: A Comprehensive Overview with a Python Example

Python and the Democratization of AI: Hands-On Code Examples and Creative Project Ideas (EN-PT)

Introducing the Revolutionary Self-Modifying GPT Python Script!

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

Learning more Python because my machine is slow

Inspiration

Final Product

Landing Page

Start of Named Exploration

Documents containing "Veterans"

Narrow scope for documents also containing "VISN"

Final Result

Offline Processing

Randy Moore的更多文章

Top 20 Software Dev Skills Over Time

Top 20 Skills Over Time - remoteok.io

Keeping it DRY with OAS

Concept Map: Humble, Powerful

Django Deletion Dragons

Full Stack Walkthrough

Asynchronous Programming (and why it's all the rage for web services)

社区洞察

其他会员也浏览了

Artificial Intelligence #207

Artificial Intelligence #207

Breaking the Jargons: Issue 9

15 Machine Learning Libraries and Tools for Java

End to end LLMOps Pipeline - Part 2 - FastAPI

How to Learn AI on Your Own

The Impact of AWS Bedrock and LLMs in Fraud Detection: A Comprehensive Overview with a Python Example

Python and the Democratization of AI: Hands-On Code Examples and Creative Project Ideas (EN-PT)

Introducing the Revolutionary Self-Modifying GPT Python Script!

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python