Sampling Entity tagging with nltk, spaCy and CoreNLP using Flask
Output generated using CoreNLP image motiz88/coreNLP

Sampling Entity tagging with nltk, spaCy and CoreNLP using Flask

Introduction

We were trying to build a microservice to identify and tag PI and SPI elements in unstructured data. We came across several Natural Language Processing toolkit capable of Named Entity Recognition which are either built on Python or could be integrated with Python (we were looking for different options which could be used to deploy in a private cloud - haven't evaluated Watson Anywhere for this case as it still requires ICP for Data). So we wanted a Sampling service developed using Flask, where we could feed unstructured text and the technology we would like to use for named entity tagging, it would return back an annotated list of entities that it could identify in the provided text. We wanted to check the efficacy of the pre-trained models of these NLP toolkit in identifying entities in a specific domain.

Steps

First we need to create a docker container with Flask where we would also install spaCy and nltk (for CoreNLP we would provision a separate container).

FROM python:3

WORKDIR /usr/app

COPY requirements.txt ./

RUN pip install -r requirements.txt

RUN python -m spacy download en_core_web_sm

RUN python -m nltk.downloader punkt averaged_perceptron_tagger maxent_ne_chunker words

WORKDIR /usr/app

COPY app.py ./

COPY index.html ./templates/

ENTRYPOINT [ "python" ]


CMD [ "app.py" ]

Our container image is based on a python image, where we install other Python modules using pip install command. The requirements.txt file contains the different Python modules that would be installed, which in our case are:

flask
nltk
numpy
spacy

Once the core modules are installed, we are going to use the python shells to download pre-trained corpus for spaCy and nltk. We then copy the bootstrap program and the web page for our flask application to the required folders after which we start the bootstrap program.

The code for our Python bootstrap program is as follows:

# flask_web/app.py

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

import requests
from flask import Flask, render_template, request

import spacy
from spacy import displacy
from collections import Counter
import requests
from nltk.parse import CoreNLPParser

import en_core_web_sm
nlp = en_core_web_sm.load()

def stanford_tag(text):
    parser = CoreNLPParser(url='https://172.17.0.2:9000', tagtype='ner')
    return list(parser.tag(text.split()))

def nltk_tag(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    sent = nltk.ne_chunk(sent)
    return sent

def spacechunk(sent):
    doc = nlp(sent)
    chunk = str([(X.text, X.label_) for X in doc.ents])
    return chunk

@app.route('/call_form', methods=['GET', 'POST'])
def call_form():
    if request.method == 'POST':
        data_box = request.form['data']
        nerlib = request.form['nerlibs']
        data_text = data_box
        if nerlib == 'stanford':
            return str(stanford_tag(data_text))
        if nerlib == 'nltk':
            return str(nltk_tag(data_text))
        if nerlib == 'spacy':
            return spacechunk(data_text)
    return render_template('index.html')

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')


In app.py, we first import all the required libraries and then define three different functions, one each for Stanford CoreNLP (the CoreNLP server is running in a different docker container on port 9000), nltk and spaCy based NER tagging.

The fourth function is mapped to the '/call_form' route using an annotation. In case a GET request is made using this url, the function serves the html page named 'index.html'.

When the form in the 'index.html' page is submitted, then the same function is called with a POST request. The function retrieves the raw text to be tagged and the toolkit to be used from the form elements. Depending on the toolkit to be used it calls one of the three functions defined earlier to get the annotated text.

Now let's see how the code for the index.html page looks like:

<!DOCTYPE html>
<html>
  <head>
    <title>NER Tagging</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link  rel="stylesheet" media="screen">
    <style>
      .container {
        max-width: 1000px;
      }
    </style>
  </head>
  <body>
    <div class="container">
      <h1>NER Tagging</h1>
      <form role="form" method='POST' action='/call_form'>
        <div class="form-group">
            <textarea name="data" class="form-control" id="data-box" placeholder="Enter text..." style="max-width: 300px;" autofocus required>Type in your text here</textarea>
        </div>
        <div class="form-group">
          <select name="nerlibs" class="form-control" style="width:25%">
            <option value="stanford">stanford</option>
            <option value="nltk">nltk</option>
            <option value="spacy">spacy</option>
        </select>
        </div>
        
        <button type="submit" class="btn btn-default">Submit</button>
      </form>
      <br>
      {% for error in errors %}
        <h4>{{ error }}</h4>
      {% endfor %}
    </div>
    <script src="https://code.jquery.com/jquery-2.2.1.min.js"></script>
    <script src="https://netdna.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
  </body>
</html>

You can see that the form action binds to the route '/call_form'. The textarea element is used to collect the raw text which needs to undergo NER based tagging. A dropdown control is used to select the NER toolkit to be used.

Testing it out

Now let's run the code and evaluate the output. Let's provide the following news excerpt which I have taken from the BBC World News website and run through the tagged output from each of the three toolkit.

Original text:

Almost 300 of the 3,700 people on the Diamond Princess have been tested so far. The number of infected could rise as testing continues.

The checks began after an 80-year-old Hong Kong man who had been on the ship last month fell ill with the virus.

All 10 cases are in those over the age of 50, Japanese broadcaster NHK said.

Four are in their 50s, four are in their 60s, one is in their 70s, and another one is in their 80s. Two of them are said to be Japanese, and none are in "serious condition", NHK said.

Japanese Health Minister Katsunobu Kato said the confirmed cases were among 31 results from 273 people tested so far.

"We had them [the ones who tested positive] get off the vessel and... we are sending them to medical organisations," he said at a news conference on Wednesday.

Output using Stanford CoreNLP (here the tag 'O' stands for 'Other'):

[('Almost', 'O'), ('300', 'NUMBER'), ('of', 'O'), ('the', 'O'), ('3,700', 'NUMBER'), ('people', 'O'), ('on', 'O'), ('the', 'O'), ('Diamond', 'ORGANIZATION'), ('Princess', 'ORGANIZATION'), ('have', 'O'), ('been', 'O'), ('tested', 'O'), ('so', 'O'), ('far', 'O'), ('.', 'O'), ('The', 'O'), ('number', 'O'), ('of', 'O'), ('infected', 'O'), ('could', 'O'), ('rise', 'O'), ('as', 'O'), ('testing', 'O'), ('continues', 'O'), ('.', 'O'), ('The', 'O'), ('checks', 'O'), ('began', 'O'), ('after', 'O'), ('an', 'O'), ('80-year-old', 'DURATION'), ('Hong', 'LOCATION'), ('Kong', 'LOCATION'), ('man', 'O'), ('who', 'O'), ('had', 'O'), ('been', 'O'), ('on', 'O'), ('the', 'O'), ('ship', 'O'), ('last', 'DATE'), ('month', 'DATE'), ('fell', 'O'), ('ill', 'O'), ('with', 'O'), ('the', 'O'), ('virus', 'O'), ('.', 'O'), ('All', 'O'), ('10', 'NUMBER'), ('cases', 'O'), ('are', 'O'), ('in', 'O'), ('those', 'O'), ('over', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('50', 'NUMBER'), (',', 'O'), ('Japanese', 'MISC'), ('broadcaster', 'O'), ('NHK', 'ORGANIZATION'), ('said', 'O'), ('.', 'O'), ('Four', 'NUMBER'), ('are', 'O'), ('in', 'O'), ('their', 'O'), ('50s', 'O'), (',', 'O'), ('four', 'NUMBER'), ('are', 'O'), ('in', 'O'), ('their', 'O'), ('60s', 'O'), (',', 'O'), ('one', 'NUMBER'), ('is', 'O'), ('in', 'O'), ('their', 'O'), ('70s', 'O'), (',', 'O'), ('and', 'O'), ('another', 'O'), ('one', 'NUMBER'), ('is', 'O'), ('in', 'O'), ('their', 'O'), ('80s', 'NUMBER'), ('.', 'O'), ('Two', 'NUMBER'), ('of', 'O'), ('them', 'O'), ('are', 'O'), ('said', 'O'), ('to', 'O'), ('be', 'O'), ('Japanese', 'MISC'), (',', 'O'), ('and', 'O'), ('none', 'O'), ('are', 'O'), ('in', 'O'), ('``', 'O'), ('serious', 'O'), ('condition', 'O'), ("''", 'O'), (',', 'O'), ('NHK', 'ORGANIZATION'), ('said', 'O'), ('.', 'O'), ('Japanese', 'MISC'), ('Health', 'O'), ('Minister', 'O'), ('Katsunobu', 'PERSON'), ('Kato', 'PERSON'), ('said', 'O'), ('the', 'O'), ('confirmed', 'O'), ('cases', 'O'), ('were', 'O'), ('among', 'O'), ('31', 'NUMBER'), ('results', 'O'), ('from', 'O'), ('273', 'NUMBER'), ('people', 'O'), ('tested', 'O'), ('so', 'O'), ('far', 'O'), ('.', 'O'), ('``', 'O'), ('We', 'O'), ('had', 'O'), ('them', 'O'), ('-LSB-', 'O'), ('the', 'O'), ('ones', 'O'), ('who', 'O'), ('tested', 'O'), ('positive', 'O'), ('-RSB-', 'O'), ('get', 'O'), ('off', 'O'), ('the', 'O'), ('vessel', 'O'), ('and', 'O'), ('...', 'O'), ('we', 'O'), ('are', 'O'), ('sending', 'O'), ('them', 'O'), ('to', 'O'), ('medical', 'O'), ('organisations', 'O'), (',', 'O'), ("''", 'O'), ('he', 'O'), ('said', 'O'), ('at', 'O'), ('a', 'O'), ('news', 'O'), ('conference', 'O'), ('on', 'O'), ('Wednesday', 'DATE'), ('.', 'O')]

Output using nltk:

(S Almost/RB 300/CD of/IN the/DT 3,700/CD people/NNS on/IN the/DT (ORGANIZATION Diamond/NNP) Princess/NNP have/VBP been/VBN tested/VBN so/RB far/RB ./. The/DT number/NN of/IN infected/VBN could/MD rise/VB as/IN testing/VBG continues/VBZ ./. The/DT checks/NNS began/VBD after/IN an/DT 80-year-old/JJ (GPE Hong/NNP Kong/NNP) man/NN who/WP had/VBD been/VBN on/IN the/DT ship/NN last/JJ month/NN fell/VBD ill/RB with/IN the/DT virus/NN ./. All/DT 10/CD cases/NNS are/VBP in/IN those/DT over/IN the/DT age/NN of/IN 50/CD ,/, (GPE Japanese/JJ) broadcaster/NN (ORGANIZATION NHK/NNP) said/VBD ./. Four/CD are/VBP in/IN their/PRP$ 50s/CD ,/, four/CD are/VBP in/IN their/PRP$ 60s/CD ,/, one/CD is/VBZ in/IN their/PRP$ 70s/CD ,/, and/CC another/DT one/CD is/VBZ in/IN their/PRP$ 80s/CD ./. Two/CD of/IN them/PRP are/VBP said/VBD to/TO be/VB (GPE Japanese/JJ) ,/, and/CC none/NN are/VBP in/IN ``/`` serious/JJ condition/NN ''/'' ,/, (ORGANIZATION NHK/NNP) said/VBD ./. (PERSON Japanese/JJ Health/NNP) Minister/NNP (PERSON Katsunobu/NNP Kato/NNP) said/VBD the/DT confirmed/VBN cases/NNS were/VBD among/IN 31/CD results/NNS from/IN 273/CD people/NNS tested/VBN so/RB far/RB ./. ``/`` We/PRP had/VBD them/PRP [/RP the/DT ones/NNS who/WP tested/VBD positive/JJ ]/NNP get/VB off/RP the/DT vessel/NN and/CC .../: we/PRP are/VBP sending/VBG them/PRP to/TO medical/JJ organisations/NNS ,/, ''/'' he/PRP said/VBD at/IN a/DT news/NN conference/NN on/IN Wednesday/NNP ./.)

Output using spaCy:

[('Almost 300', 'CARDINAL'), ('3,700', 'CARDINAL'), ('Hong Kong', 'GPE'), ('last month', 'DATE'), ('10', 'CARDINAL'), ('the age of 50', 'DATE'), ('Japanese', 'NORP'), ('NHK', 'ORG'), ('Four', 'CARDINAL'), ('four', 'CARDINAL'), ('60s', 'DATE'), ('one', 'CARDINAL'), ('70s', 'DATE'), ('80s', 'DATE'), ('Two', 'CARDINAL'), ('Japanese', 'NORP'), ('NHK', 'ORG'), ('Japanese', 'NORP'), ('Katsunobu Kato', 'PERSON'), ('31', 'CARDINAL'), ('273', 'CARDINAL'), ('Wednesday', 'DATE')]

Now this gives you an idea of the output you can expect from these three different toolkit using their pre-trained corpus.

Caveat: opinions expressed here are author's own.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了