AI for IT: Named Entity Recognition from Unstructured Texts in IT Services
Naga (Arun) Ayachitula
Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer
Arun Ayachitula & Rohit Khandekar
Identifying and classifying Named Entities, or Named Entity Recognition (NER), from unstructured texts, is a central problem in natural language processing and has several applications like classification, intent analysis, etc.
Here we discuss the NER problem in the context of analyzing unstructured texts arising in IT services domain. Consider an IT service provider managing an IT infrastructure comprising of several software and hardware components. It encounters large volumes of unstructured texts from incident tickets raised by users or monitoring systems. Consider, for example, the following incident report/ticket:
“Logging in Rational Business Developer, version 12.3.23 release 34, is set to ‘verbose’ level – these logs are filling up the temp space on H01NAXPROD.”
There are several “named” entities mentioned in this ticket:
Business Application: Rational Business Developer
Application version: version 12.3.23 release 34
Host: H01NAXPROD
Identifying such named entities helps analyzing the intent of the incident ticket, routing it to the right mitigation teams for resolution, and potentially avoiding such incidents in the future.
The named entity types we focus in our study include Hostnames, IP addresses, Business applications (and their versions), Middleware applications (and their versions), Operating systems (and their versions).
Method
Our approach for addressing this problem has two steps:
1. Use IT infrastructure topology and NIST software product catalog to identify named entities that could potentially be present in the incident ticket data. Create a labeled dataset with identified named entities.
2. Train a machine learning model on the labeled dataset to recognize named entities from unseen examples. To this end, we fine-tune Google’s language model called BERT (Bidirectional Encoder Representations from Transformers).
Dictionary based labeled data generation
To train any machine learning algorithm, we typically need a large labeled dataset containing a lot of examples of verified named entities and how they appear in the running text relevant to the use case.
To this end, we use the IT infrastructure topology data for each stake holder or client. For each client, we try to get a comprehensive list of hosts and IP addresses in its IT infrastructure. Similarly, we get a fairly comprehensive list of software products and versions thereof from National Vulnerability Database from NIST.
For each incident ticket, we analyze unstructured texts present in the abstract, description and resolution. We split each sentence into tokens and find mentions of known entities like hosts, IP addresses, applications and OS names. Thus, we create a labeled dataset to be used to train a machine learning model.
Training a machine learning model
While the static dictionary-based approach may give fair coverage for identifying named entities, it may never be comprehensive, primarily because both IT infrastructure and the set of software products are highly evolving. New hosts get added to the IT infrastructure and new software products get introduced on a daily basis; and keeping the dictionaries up to date is almost impractical.
We take an approach of training a model that “learns” the patterns related to how named entities appear in the text and predicts potential named entities not seem before.
Given several examples like the one above, the model is expected to learn linguistic patterns in which various different types of named entities appear and be able to generalize them to detect unseen named entities of similar types.
Bidirectional Encoder Representations from Transformers (BERT) for Sequence Labeling Task
We model the NER problem as a sequence labeling problem. In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. We fine tune Google’s BERT language model to train a model for the sequence labeling task. BERT is a multi-layer bidirectional transform encoder. Transforms are multi-head attention mechanisms (encoders and decoders) that are used in sequence dependency NLP tasks and have been proven to be more effective than Hidden Markov Model (HMM) and Conditional Random Fields (CRF) as well as other neural-network-based approaches like Recurrent Neural Networks (RNN) and LSTMs.
To train a model, we first create the dataset in so-called IOB format as shown in the figure below. Each token in each sentence is labeled with either ‘O’ (other), ‘B-XXXX’ (beginning token for entity of type ‘XXXX’), ‘I-XXXX’ (inside token for entity type ‘XXXX’).
Out of 1 million sentences from the corpus of incident tickets, 241K sentences were labeled with at least one named entity. We train a model on GPUs with 20 epochs in about 5 days. The model obtained an accuracy of over 99% in the sequence labeling task. More importantly, it was able to identify several unseen named entities. The Venn diagram given below summarizes how the sets of named entities labeled or identified by different methods intersect with each other.
Summary
We addressed the Named Entity Recognition (NER) problem arising on IT service incident tickets. Starting with the known examples hosts and IP addresses from IT infrastructure topology catalog and examples of software applications from NIST database, we created a labeled dataset of named entities. This dataset has a rich variety of how these named entities are mentioned in unstructured IT texts. We then model the NER problem as a sequence-labeling problem and train a deep-learning based model by fine-tuning BERT language model. The model was able to detect several new entities based on linguistic features associated with the underlying entity types.
Acknowledgements: This article is based on a joint work with Salim Roukos, Radu Florian, Parul Awasthy & Navaneeth Vadapalli.
Python Backend Developer (TDD) / AIML
2 年Great Article
AI Practitioner at IBM
4 年Great arcticle. Is the training dataset available publicly?
Go-To-Market Sales Enablement Leader | Expert in Customer Growth, Strategic Leadership, and AI-Driven Innovation | 19+ Years in IT & Transformation
4 年Well written