Named Entity Recognition using CRF's
Conditional Random Field (CRF). Conditional Random Field is a probabilistic graphical model that has a wide range of applications such as named entity recognition (NER), Parts-of-Speech(POS) tagging, etc. Conditional Random Field has been used when information about neighboring labels are essential while calculating a label for individual sequence?item. This lends them to be a great model for NER applications.
There are two types of probabilistic graphical models, namely, Bayesian network and Markov Random Fields. Bayesian Networks are mostly directed acyclic graphs, whereas Markov Random Fields are undirected graphs and may be cyclic. Conditional Random Fields come in the latter category.
A linear chain CRF confers to a labeler in which tag assignment depends only on the tag of just one previous word. Such a CRF can be used for NER which is extracting named entities from text. A named entity might be one of a person's name, cities, countries, companies, etc. These are called as tags categorized usually as PER(for a person), ORG(for an organization), LOC(for a location) etc.
First, we decide feature functions that will assist in generating unique features per word of the sentence and will be assisting in recognizing a Named Entity. These features function return either True:1 or False:0 (since the features are unique).
To explain exactly how this formula would work to figure out the 'named entities' of a sentence like "The World Cup is now held in Qatar", the following substitutions would have to be made (for example to calculate the P([O ORG ORG O O O O LOC] | 'The World Cup is now held in Qatar'))
The numerator can be rewritten as
exp (Σ? w? Σ???F?(‘The World Cup is now held in Qatar’,’O, ORG ORG,O,O,O,O,LOC’)).
The denominator can be rewritten as
exp (Σ? w? Σ???F?(‘The World Cup is now held in Qatar’,’O O O O O O O O)’)) + exp (Σ? w? Σ???F?(‘The World Cup is now held in Qatar’,’LOC ORG O PER ORG O PER LOC’)) + exp (Σ? w? Σ???F?(‘The World Cup is now held in Qatar’,’ORG O PER ORG PER ORG ORG ORG’))... (and so on and so forth cycling through all the tag combinations).
The probability of P([O ORG ORG O O O O LOC] | 'The World Cup is now held in Qatar') should be highest amongst all other possible sequences if the CRF is trained well. This will prove that Qatar is a location and that the World Cup is an organization!