登录查看更多内容

Data - Extraction, Engineering, Extrapolation - How ???

Dominic Fernandez

Dominic loves to get IT done !

发布日期: 2016年10月26日

How Information Extraction (IE) Works ???

Basic Data Model of IE consists of annotations and spans.Each unit of structured info, extracted from a document is called an annotation. It includes not only the final outputs of extraction but also the lower-level intermediate output. Annotations are usually associated with one or more regions of text called spans.

High-level View: IE proceeds through three phases

Feature Selection takes raw text as input and identifies low-level entities called features. Features can be things like capitalized words, sequences of numbers, or the names of Fortune 500 companies.

Identification uses features to build more complex entities and the relationships among them, including sentiment and events. For example, a common first name followed by a capitalized word might resolve into a person’s full name.

Resolution involves cleaning up ambiguities that arise in the output of the “Identification” step (that is, the entities, relationships, events, and sentiment identified in the text). For example, a document may use several different strings – first name, full name, a pronoun – to identify the same person.

The first two phases are usually done one document at a time, but Resolution is often performed at the collection level, looking across many documents to find global entities. The final output of this process is a collection of clean, well-organized, structured information that can serve as input to downstream analytics or business intelligence software.

Feature Selection involves finding simple “atomic” entities that serve as the raw inputs to other stages of extraction. Typically, these features involve low-level primitive text operations such as exhaustive dictionaries of terms, character-level regular expressions, or part of speech tagging. The software that implements these primitives is a combination of off-the-shelf morphological analysis software and many relatively simple rules. Feature quality usually determines how well the end-to-end extraction pipeline works. (It’s not unusual for feature selection to account for 70 or 80% of the effort in solving a given extraction problem.)

For example, the word “John” could turn into a “common first name” feature, while “Merker” might become a “capitalized word” feature. The numbers at the end of the line might become a “sequence of numbers” feature, and the phrase “cell #” could turn into a “phone number type ID” feature. These features would not yet be combined into full names or phone numbers; that step would occur during Entity Identification.

In Identification, low-level features are combined to produce entities, relationships, events, sentiment, and co-references that are close to what the actual business problem requires.

Example features “John” & “Merker” become a “Person” entity, and the sequence of numbers at the end of the first line becomes a “Phone Number” entity. These entities, in turn, are linked into a person-phone relationship. Similarly, the features in the second line of text turn into a person-phone relationship tagged with a phone number type. In addition, co-reference resolution techniques may be applied to determine that the two “Person” entities actually refer to the same person.

Resolution requires complex analyses that look across collections of entities and relationships to clean and organize the output of Identification. For example, Entity Resolution may merge together multiple extracted entities that refer to the same person or thing. Furthermore, Entity Resolution rules may join the extracted entities with additional structured data from external sources. It’s important to note that the output of the first two phases of IE is often the input to an Entity Resolution engine.

In this example, Person Resolution rules bring in external information in the form of an office phone number directory that validates individual phone numbers. In this way, Resolution can detect and fix typographical errors.

Basic Components of Information Extraction

Regular expression allow an extractor to represent patterns within text with a compact formula that can be evaluated efficiently.

Example:

(\p{Upper}\p{Lower} + (\s + \p{Upper}\p{Lower}+){0,2}

... Capitalized word(s) Whitespace 1-3 words

In Information Extraction regular expression are useful for identifying basic character patterns in the text.

Examples: numbers, capitalized words, IP addresses, URLs

... They are important for extraction over semi-structured (XML, HTML, etc.) or system log data

Regular expression match semantics for IE applications:

Identify locations of all matches within the document (as opposed to telling a given line / file matches)
(usually) don't overlap with a longer match
(often) start and end on a token boundary

Dictionaries (Gazetteers) is an exhaustive list of terms – very important primitive for IE since many important concepts in text can be approximated with a few list of terms. Most IE Systems incorporate dedicated matching engines for dictionaries – find locations of all matches that start and end on a token boundary.

Shallow Parsing / morphological analysis – breaks text into tokens and labels, each token with lexical information (part of speech, Lemmatized form {“would have been” > “be”, “mine” > “I”} ... (sometimes) small grammatical units like noun or prepositional phrases

Often integrated with regular expression and dictionary evaluation.
Uses only local information – a small window of text around each token
Relatively high-throughput, put prone to mistakes – about an order of magnitude slower than dictionary/regular expression evaluation. Even in formal text like news articles, 10-15% of labels not completely correct.
Does not produce “deep” information about sentence structure – will identify that a word is a transitive verb, but won't identify the verb's direct object.

Deep Parsing processes entire sentences at once – produces detailed syntactic information about every token in the sentence – overall structure of the sentence, role of each word within the sentence, relationship between words, highly accurate part of speech information. Very expensive – 2-3 orders of magnitude slower than other extraction primitives. Two main types of parse tree – Nested phrases (constituency structure) & Dependencies between words (dependency structure)

Types of IE Systems

Two Dominant types – Rule-Based & Machine Learning-Based – distinction is based on how Entity Identification is performed

Rule-based History of IE in the NLP Community

1978-1997 MUC (Message Understanding Conference) – DARPA competition 1987 to 1997 (FRUMP [DeJong82], FASTUS [Appelt93], TextPro, PROTEUS.
1998: Common Pattern Specification Language (CPSL) standard [Appelt98] for subsequent Rule-Based systems
1999-2010: Commercial products, GATE

Machine Learning History

at first: simple techniques like Naive Bayes
1990's: Learning Rules (AUTOSLOG [Riloff93], CRYSTAL [Soderland98], SRV [Freitag98]
2000's: More specialized models – Hidden Markov Models [Leek97], Maximum Entropy Markov [McCallum00], Conditional Random Fields [Lafferty01], Automatic feature expansion.

PROS & CONS

Rule-Based Pros:

Declarative, Easy to comprehend & maintain, Easy to incorporate domain-knowledge, Easy to debug

Machine Learning-Based Pros:

Trainable, Adaptable, Reduces manual effort

Rule-Based Cons: Heuristic & Requires tedious manual labor

Machine Learning-Based Cons:

Requires labelled-data & re-training for domain-adaptation.
Requires ML expertise to use & maintain. It is Opaque (not transparent)

In reality, most systems combine Rules and Machine Learning, because complex rules can be used as features for a Machine Learning model. Machine Learning, in turn, can identify basic features that can be used in Rules – for example:

building blocks such as dictionaries and NER.
Machine Learning models can pre-process or clean noisy text (for example, on Twitter) - before it moves into a Rule-based IE system. Finally, Rules can serve as target for Machine Learning.

Evaluating Quality: If a given text corpus contains mentions of a concept (entity, relation, event, sentiment, and so on) to be extracted, we would ideally like for the extractor to extract all the correct occurrences and nothing else. In practice, the extractor may make two kinds of errors:

False positive: A false positive is a mention not belonging to the concept but extracted.
False negative: A false negative is a mention belonging to the concept but not extracted.

To complete the terminology, we also define:

True positive: A true positive is a mention belonging to the concept and extracted.
True negative: A true negative is a mention not belonging to the concept and not extracted.

The quality of the extractor is measured on two metrics: Precision and Recall.

Precision (P): is the percentage of the correct mentions extracted by the extractor - Calculated as follows:Precision = Total number of True positives / (Total number of True positives + Total number of False positives)
Recall (R): is the percentage of the correct mentions extracted by the extractor - Calculated as follows:Recall = Total number of True positives / (Total number of True positives + Total number of False negatives)

Neither precision nor recall alone is sufficient to characterize the quality of the extractor.

Example, an extractor that extracts a single correct mention and nothing else will have P=1.00 and very small R.

An extractor that extracts everything will have R=1.00 but very small P. In practice, there is a trade-off between achieving high precision and achieving high recall. A quantity that takes into account both is the F-measure.

F-measure evaluates quality as a single number. It is a weighted harmonic mean of P and R. It’s a harmonic mean (as opposed to an average) because we penalize the lower of the P and the R. This measure is weighted so that we can choose to put more emphasis on P or R. For example, F2 measure weights R higher than P, whereas F0.5 weights P higher than R.

Balanced F-measure (also called F1 measure)

Metrics for measuring Run-time Performance

Measure of Systems' scalability and run-time efficiency is based on:

Throughput – how much raw text does the system process in one unit of time, usually measured in Kb/sec/core

Memory footprint – how much memory does the system need to achieve near peak throughput

Typically, not a focus in the IE research-community, BUT extremely important in practice. Directly influences the hardware cost of compute resources – required by the system, or whether this can even be accomplished. Rule-Based IE Systems have better run-time performance compared to Machine Learning-Based Systems

Measure Quality and Run-time Performance

Spectrum of Throughput More complex > Slower Performance. Wide differences between different IE Systems– NER can achieve 1 MB/sec/core with a rule-based system but can be 2 orders of magnitude slower in ML-based systems.

Measuring Quality In Practice

Test set-based: If you have labeled data (commonly referred as Gold Standard), you can measure quality in practice using test set-based method.

Gold Standard consists of a test collection with documents that have been labelled with all mentions of interest, and only those mentions. To evaluate: extractor is run over the test documents. The result of the extractor is then compared with the Gold Standard mentions to calculate the Precision and Recall metrics.

Pipeline-based: If there is no labelled data, you can still compute some metrics. Calculate Precision: Run the extractor on sample data and measure Precision manually; that is, look at every result and mark it as correct (true positive) or incorrect (false positive). Based on this, you can now calculate Precision. You cannot report Recall for lack of labelled data to compare against. However, you can report a measure of the Volume of the results; that is, how many mentions are extracted by the system. Assuming Precision remains the same, a higher Volume effectively translates to higher Recall, although how close Recall is to 1.00 remains unknown.

Train, Validation and Test Data-sets: Goal is to utilize IE System to obtain quality on new (unseen) data. The quality of Train-data Set is a bad predictor of performance on new data – Only indicates that the system has learned what it was supposed to learn. Therefore, we need to use at least two data-sets:

Build the system on the training-set (seen data)
Measure system performance on the test-set (unseen data)
Sometimes, we also need a validation-set to TUNE the system – Quality is watched-for on the Validation-set, not the actual results (otherwise Validation acts as a train-set)
As we train the system, the quality on the validation set goes up; Stop training when the quality starts degrading on the validation-set
The validation-set cannot be used for testing (because it is not unseen)

All data sets should be representative of the data that will be applied to the System.

Obtaining Train / Test Data; Cross Validation

Facts: The more training data, the higher the quality of the system. More Test-data > More Accurate Error-estimate

Problem-1: Creating labelled-data is labor intensive and time-consuming

Solution: Obtain a limited Labelled-data-set and randomly split it into Training & Test sets – usually between 10% and 33% is reserved for Testing.

Problem 2: Sampling doesn't work well on small data-sets

Solution: Maximize the use of labelled-data using k-fold cross-validation.

Divide data randomly into k-folds (subsets) of equal size
train the model on k-1 folds, use one fold for testing
Repeat this process on k times so that all folds are used for testing
Compute the average performance on the K test sets

Cross-validation is applicable only for ML-based systems

Let me help you find out more ... I am just an email away ...

Dom Fernandez - Consultant - < Contact >

defining, designing and delivering solutions that enable businesses achieve results - efficiently and effectively

要查看或添加评论，请登录

Dominic Fernandez的更多文章

#iterative #dynamic #dashboards

2021年9月16日

#iterative #dynamic #dashboards

I had the privilege of attending this workshop conducted by Jatan Shah, a Microsoft Certified Trainer. Creating…
Vendor Management ...

2021年8月24日

Vendor Management ...

A Vendor Management Solution that simplifies the process of procuring, managing, and optimizing flexible work-forces…
Integration ...

2021年8月23日

Integration ...

Integration is the key to bridge the silos across applications, data, and processes, and provide a foundation to apply…
Communication, Management and Leadership

2021年8月16日

Communication, Management and Leadership

>>> Communication, Management and Leadership ..
Contact .... Connected ....

2021年5月25日

Contact .... Connected ....

An old teacher was being interviewed by a young professional. The professional started interviewing the teacher as…
Covid-19 - it's Treatment Protocol.

2021年5月17日

Covid-19 - it's Treatment Protocol.

Dr. Mathew Varghese, senior doctor at St.
What Is Machine-Learning?

2021年5月9日

What Is Machine-Learning?

Machine learning is the process of teaching a computer system certain algorithms that can improve themselves with…
Data Science Concepts

2021年4月29日

Data Science Concepts

Data Science: Data science, which is frequently lumped together with machine learning, is a field that uses processes…
To Survive, Use Your Brains ... But, To Live (Forever) - Use Your Heart!

2021年4月4日

To Survive, Use Your Brains ... But, To Live (Forever) - Use Your Heart!

"When A Lizard Can, Why Can't We?" This is a STORY that happened in Japan . In order to renovate a house in Japan, one…
Alladin And His Lamp ...

2021年1月12日

Alladin And His Lamp ...

Most of us know this story of how Alladin found a lamp and when he rubbed off the dust a Genie popped out and offered:…

See all articles

Data - Extraction, Engineering, Extrapolation - How ???

Dominic Fernandez

Dominic loves to get IT done !

Dominic Fernandez的更多文章

社区洞察

其他会员也浏览了

K-means Clustering: Applications and Real-world Use Cases

Retrieval Augmented Generation (RAG) for Structured Data Processing

ML and CI/CD Pipelines for Unstructured datasets: Efficiency and Optimization Investigation

Data representation

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

What's New in DataOps Suite 2.0.0

Advanced analytics

Difference Between MetaGraph, Ontology and Taxonomy

In the Age of AI: Mastering Data Tools and Quality for Empowered Decision-Making

The Secret to Unstructured Data Management

Dominic Fernandez的更多文章

#iterative #dynamic #dashboards

Vendor Management ...

Integration ...

Communication, Management and Leadership

Contact .... Connected ....

Covid-19 - it's Treatment Protocol.

What Is Machine-Learning?

Data Science Concepts

To Survive, Use Your Brains ... But, To Live (Forever) - Use Your Heart!

Alladin And His Lamp ...

社区洞察

其他会员也浏览了

K-means Clustering: Applications and Real-world Use Cases

Retrieval Augmented Generation (RAG) for Structured Data Processing

ML and CI/CD Pipelines for Unstructured datasets: Efficiency and Optimization Investigation

Data representation

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

What's New in DataOps Suite 2.0.0

Advanced analytics

Difference Between MetaGraph, Ontology and Taxonomy

In the Age of AI: Mastering Data Tools and Quality for Empowered Decision-Making

The Secret to Unstructured Data Management