Sampling Entity tagging with nltk, spaCy and CoreNLP using Flask
Ankur Bhattacharya
Application & Certified Cloud Solution Architect at IBM Global Business Services
Introduction
We were trying to build a microservice to identify and tag PI and SPI elements in unstructured data. We came across several Natural Language Processing toolkit capable of Named Entity Recognition which are either built on Python or could be integrated with Python (we were looking for different options which could be used to deploy in a private cloud - haven't evaluated Watson Anywhere for this case as it still requires ICP for Data). So we wanted a Sampling service developed using Flask, where we could feed unstructured text and the technology we would like to use for named entity tagging, it would return back an annotated list of entities that it could identify in the provided text. We wanted to check the efficacy of the pre-trained models of these NLP toolkit in identifying entities in a specific domain.
Steps
First we need to create a docker container with Flask where we would also install spaCy and nltk (for CoreNLP we would provision a separate container).
FROM python:3 WORKDIR /usr/app COPY requirements.txt ./ RUN pip install -r requirements.txt RUN python -m spacy download en_core_web_sm RUN python -m nltk.downloader punkt averaged_perceptron_tagger maxent_ne_chunker words WORKDIR /usr/app COPY app.py ./ COPY index.html ./templates/ ENTRYPOINT [ "python" ] CMD [ "app.py" ]
Our container image is based on a python image, where we install other Python modules using pip install command. The requirements.txt file contains the different Python modules that would be installed, which in our case are:
flask nltk numpy spacy
Once the core modules are installed, we are going to use the python shells to download pre-trained corpus for spaCy and nltk. We then copy the bootstrap program and the web page for our flask application to the required folders after which we start the bootstrap program.
The code for our Python bootstrap program is as follows:
# flask_web/app.py import nltk from nltk.tokenize import word_tokenize from nltk.tag import pos_tag import requests from flask import Flask, render_template, request import spacy from spacy import displacy from collections import Counter import requests from nltk.parse import CoreNLPParser import en_core_web_sm nlp = en_core_web_sm.load() def stanford_tag(text): parser = CoreNLPParser(url='https://172.17.0.2:9000', tagtype='ner') return list(parser.tag(text.split())) def nltk_tag(sent): sent = nltk.word_tokenize(sent) sent = nltk.pos_tag(sent) sent = nltk.ne_chunk(sent) return sent def spacechunk(sent): doc = nlp(sent) chunk = str([(X.text, X.label_) for X in doc.ents]) return chunk @app.route('/call_form', methods=['GET', 'POST']) def call_form(): if request.method == 'POST': data_box = request.form['data'] nerlib = request.form['nerlibs'] data_text = data_box if nerlib == 'stanford': return str(stanford_tag(data_text)) if nerlib == 'nltk': return str(nltk_tag(data_text)) if nerlib == 'spacy': return spacechunk(data_text) return render_template('index.html') if __name__ == '__main__': app.run(debug=True, host='0.0.0.0')
In app.py, we first import all the required libraries and then define three different functions, one each for Stanford CoreNLP (the CoreNLP server is running in a different docker container on port 9000), nltk and spaCy based NER tagging.
The fourth function is mapped to the '/call_form' route using an annotation. In case a GET request is made using this url, the function serves the html page named 'index.html'.
When the form in the 'index.html' page is submitted, then the same function is called with a POST request. The function retrieves the raw text to be tagged and the toolkit to be used from the form elements. Depending on the toolkit to be used it calls one of the three functions defined earlier to get the annotated text.
Now let's see how the code for the index.html page looks like:
<!DOCTYPE html> <html> <head> <title>NER Tagging</title> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <link rel="stylesheet" media="screen"> <style> .container { max-width: 1000px; } </style> </head> <body> <div class="container"> <h1>NER Tagging</h1> <form role="form" method='POST' action='/call_form'> <div class="form-group"> <textarea name="data" class="form-control" id="data-box" placeholder="Enter text..." style="max-width: 300px;" autofocus required>Type in your text here</textarea> </div> <div class="form-group"> <select name="nerlibs" class="form-control" style="width:25%"> <option value="stanford">stanford</option> <option value="nltk">nltk</option> <option value="spacy">spacy</option> </select> </div> <button type="submit" class="btn btn-default">Submit</button> </form> <br> {% for error in errors %} <h4>{{ error }}</h4> {% endfor %} </div> <script src="https://code.jquery.com/jquery-2.2.1.min.js"></script> <script src="https://netdna.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script> </body> </html>
You can see that the form action binds to the route '/call_form'. The textarea element is used to collect the raw text which needs to undergo NER based tagging. A dropdown control is used to select the NER toolkit to be used.
Testing it out
Now let's run the code and evaluate the output. Let's provide the following news excerpt which I have taken from the BBC World News website and run through the tagged output from each of the three toolkit.
Original text:
Almost 300 of the 3,700 people on the Diamond Princess have been tested so far. The number of infected could rise as testing continues. The checks began after an 80-year-old Hong Kong man who had been on the ship last month fell ill with the virus. All 10 cases are in those over the age of 50, Japanese broadcaster NHK said. Four are in their 50s, four are in their 60s, one is in their 70s, and another one is in their 80s. Two of them are said to be Japanese, and none are in "serious condition", NHK said. Japanese Health Minister Katsunobu Kato said the confirmed cases were among 31 results from 273 people tested so far. "We had them [the ones who tested positive] get off the vessel and... we are sending them to medical organisations," he said at a news conference on Wednesday.
Output using Stanford CoreNLP (here the tag 'O' stands for 'Other'):
[('Almost', 'O'), ('300', 'NUMBER'), ('of', 'O'), ('the', 'O'), ('3,700', 'NUMBER'), ('people', 'O'), ('on', 'O'), ('the', 'O'), ('Diamond', 'ORGANIZATION'), ('Princess', 'ORGANIZATION'), ('have', 'O'), ('been', 'O'), ('tested', 'O'), ('so', 'O'), ('far', 'O'), ('.', 'O'), ('The', 'O'), ('number', 'O'), ('of', 'O'), ('infected', 'O'), ('could', 'O'), ('rise', 'O'), ('as', 'O'), ('testing', 'O'), ('continues', 'O'), ('.', 'O'), ('The', 'O'), ('checks', 'O'), ('began', 'O'), ('after', 'O'), ('an', 'O'), ('80-year-old', 'DURATION'), ('Hong', 'LOCATION'), ('Kong', 'LOCATION'), ('man', 'O'), ('who', 'O'), ('had', 'O'), ('been', 'O'), ('on', 'O'), ('the', 'O'), ('ship', 'O'), ('last', 'DATE'), ('month', 'DATE'), ('fell', 'O'), ('ill', 'O'), ('with', 'O'), ('the', 'O'), ('virus', 'O'), ('.', 'O'), ('All', 'O'), ('10', 'NUMBER'), ('cases', 'O'), ('are', 'O'), ('in', 'O'), ('those', 'O'), ('over', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('50', 'NUMBER'), (',', 'O'), ('Japanese', 'MISC'), ('broadcaster', 'O'), ('NHK', 'ORGANIZATION'), ('said', 'O'), ('.', 'O'), ('Four', 'NUMBER'), ('are', 'O'), ('in', 'O'), ('their', 'O'), ('50s', 'O'), (',', 'O'), ('four', 'NUMBER'), ('are', 'O'), ('in', 'O'), ('their', 'O'), ('60s', 'O'), (',', 'O'), ('one', 'NUMBER'), ('is', 'O'), ('in', 'O'), ('their', 'O'), ('70s', 'O'), (',', 'O'), ('and', 'O'), ('another', 'O'), ('one', 'NUMBER'), ('is', 'O'), ('in', 'O'), ('their', 'O'), ('80s', 'NUMBER'), ('.', 'O'), ('Two', 'NUMBER'), ('of', 'O'), ('them', 'O'), ('are', 'O'), ('said', 'O'), ('to', 'O'), ('be', 'O'), ('Japanese', 'MISC'), (',', 'O'), ('and', 'O'), ('none', 'O'), ('are', 'O'), ('in', 'O'), ('``', 'O'), ('serious', 'O'), ('condition', 'O'), ("''", 'O'), (',', 'O'), ('NHK', 'ORGANIZATION'), ('said', 'O'), ('.', 'O'), ('Japanese', 'MISC'), ('Health', 'O'), ('Minister', 'O'), ('Katsunobu', 'PERSON'), ('Kato', 'PERSON'), ('said', 'O'), ('the', 'O'), ('confirmed', 'O'), ('cases', 'O'), ('were', 'O'), ('among', 'O'), ('31', 'NUMBER'), ('results', 'O'), ('from', 'O'), ('273', 'NUMBER'), ('people', 'O'), ('tested', 'O'), ('so', 'O'), ('far', 'O'), ('.', 'O'), ('``', 'O'), ('We', 'O'), ('had', 'O'), ('them', 'O'), ('-LSB-', 'O'), ('the', 'O'), ('ones', 'O'), ('who', 'O'), ('tested', 'O'), ('positive', 'O'), ('-RSB-', 'O'), ('get', 'O'), ('off', 'O'), ('the', 'O'), ('vessel', 'O'), ('and', 'O'), ('...', 'O'), ('we', 'O'), ('are', 'O'), ('sending', 'O'), ('them', 'O'), ('to', 'O'), ('medical', 'O'), ('organisations', 'O'), (',', 'O'), ("''", 'O'), ('he', 'O'), ('said', 'O'), ('at', 'O'), ('a', 'O'), ('news', 'O'), ('conference', 'O'), ('on', 'O'), ('Wednesday', 'DATE'), ('.', 'O')]
Output using nltk:
(S Almost/RB 300/CD of/IN the/DT 3,700/CD people/NNS on/IN the/DT (ORGANIZATION Diamond/NNP) Princess/NNP have/VBP been/VBN tested/VBN so/RB far/RB ./. The/DT number/NN of/IN infected/VBN could/MD rise/VB as/IN testing/VBG continues/VBZ ./. The/DT checks/NNS began/VBD after/IN an/DT 80-year-old/JJ (GPE Hong/NNP Kong/NNP) man/NN who/WP had/VBD been/VBN on/IN the/DT ship/NN last/JJ month/NN fell/VBD ill/RB with/IN the/DT virus/NN ./. All/DT 10/CD cases/NNS are/VBP in/IN those/DT over/IN the/DT age/NN of/IN 50/CD ,/, (GPE Japanese/JJ) broadcaster/NN (ORGANIZATION NHK/NNP) said/VBD ./. Four/CD are/VBP in/IN their/PRP$ 50s/CD ,/, four/CD are/VBP in/IN their/PRP$ 60s/CD ,/, one/CD is/VBZ in/IN their/PRP$ 70s/CD ,/, and/CC another/DT one/CD is/VBZ in/IN their/PRP$ 80s/CD ./. Two/CD of/IN them/PRP are/VBP said/VBD to/TO be/VB (GPE Japanese/JJ) ,/, and/CC none/NN are/VBP in/IN ``/`` serious/JJ condition/NN ''/'' ,/, (ORGANIZATION NHK/NNP) said/VBD ./. (PERSON Japanese/JJ Health/NNP) Minister/NNP (PERSON Katsunobu/NNP Kato/NNP) said/VBD the/DT confirmed/VBN cases/NNS were/VBD among/IN 31/CD results/NNS from/IN 273/CD people/NNS tested/VBN so/RB far/RB ./. ``/`` We/PRP had/VBD them/PRP [/RP the/DT ones/NNS who/WP tested/VBD positive/JJ ]/NNP get/VB off/RP the/DT vessel/NN and/CC .../: we/PRP are/VBP sending/VBG them/PRP to/TO medical/JJ organisations/NNS ,/, ''/'' he/PRP said/VBD at/IN a/DT news/NN conference/NN on/IN Wednesday/NNP ./.)
Output using spaCy:
[('Almost 300', 'CARDINAL'), ('3,700', 'CARDINAL'), ('Hong Kong', 'GPE'), ('last month', 'DATE'), ('10', 'CARDINAL'), ('the age of 50', 'DATE'), ('Japanese', 'NORP'), ('NHK', 'ORG'), ('Four', 'CARDINAL'), ('four', 'CARDINAL'), ('60s', 'DATE'), ('one', 'CARDINAL'), ('70s', 'DATE'), ('80s', 'DATE'), ('Two', 'CARDINAL'), ('Japanese', 'NORP'), ('NHK', 'ORG'), ('Japanese', 'NORP'), ('Katsunobu Kato', 'PERSON'), ('31', 'CARDINAL'), ('273', 'CARDINAL'), ('Wednesday', 'DATE')]
Now this gives you an idea of the output you can expect from these three different toolkit using their pre-trained corpus.
Caveat: opinions expressed here are author's own.