登录查看更多内容

Unsupervised Deep Parsing

Erik Tromp

Future Club makes AI work for you

发布日期: 2016年1月31日

As I have written before, it is not always easy to understand the possibilities of current-day text analytics algorithms, much more can be done than we actually are aware of. Among some of the more difficult tasks of text analysis is the task of deep parsing – a term encompassing tasks such as constituent parsing or semantic role labeling. These forms of deep parsing yield so-called parse-trees that reveal syntactical or semantical structures in text in a hierarchical way.

Deep parsing is something that we can do reasonably well, especially nowadays with deep learning methods. The work by Collobert et al., captured in their academic open source SENNA is a good example hereof. There is one big challenge with doing deep parsing however.

Despite being one of the harder problems of text analysis, deep parsing is particularly useful. Having a parse tree – be it syntactic or semantic – available, tells us a great deal about the structure or meaning of text. While often not a goal on its own, this knowledge aids tremendously in solving other tasks, such as sentiment analysis or topic detection. Deep parsing is hence one of the most vital bricks of text analysis.

The English Problem

While deep parsing is a problem that has been tackled pretty well, most of the work – and this holds for many other text analysis problems as well – is highly focused on dominant languages, English in particular. One of the most accurate ways of performing deep parsing is by constructing a language model on labeled data, corpora with sentences that have been annotated with parse-trees by human experts. We often call these corpora treebanks.

While many treebank corpora exist for many languages, only a few are elaborate or specific enough to support in deep parsing tasks. Apart from that, the majority of these corpora are not free or only available for research. This makes commercial application of deep parsing in languages other than a few such as English, particularly difficult.

A Structural Solution

Using the earlier mentioned deep learning approaches to text analysis, has led us to a breakthrough in deep parsing. Using deep learning, we can create language models that ‘know’ about the structure in a language, without us explicitly outlining it or handcrafting the model using expert knowledge.

We applied this mechanism to deep parsing in a way where we learn structure in a language where deep parsing resources are plenty available, such as English. We do the same on a specific target language where such resources are scarcely or not at all available and we use the induced structure of both models to jointly learn how to perform deep parsing in the target language.

This approach to deep parsing has two major advantages.

We do not need a manually annotated treebank in our target language to be able to deep parsing.
To train a model for our target language, all we have to do is induce its structure, something that takes little time and very little human intervention.

Using this approach, we have been able to create deep parsing (constituency parsing and semantic role labeling) in Dutch, French and German in just a matter of hours. Without ever seeing a single manually annotated parse-tree in any of these languages.

What About Performance?

Very well, you might say, what about performance? This may be a nice story, but if the results don’t add up, no one will ever care. Well now, evaluating the performance of these models is not always straightforward since high quality treebanks are not always available for these target languages. Luckily however, for a couple of these languages there are. For German for example, there is the Negra corpus, which is known to be of decent quality, although only available for research only.

We conducted a small experiment on 100 sentences of the Negra corpus and compared our induced parse trees with those of the Negra corpus. We found an incredibly high accuracy of over 70 sentences that resulted in exactly the same parse tree using our method compared to the Negra corpus. In little over 20 other sentences, the errors made by our approach we solely due to flipping order in parsing small subtrees, for example picking one noun-phrase over the other first. All other errors which had more structural discrepancies, were not completely off but for example assigned verbs to wrong prepositions.

Of course, further evaluation and investigation is required to make more definitive conclusions, but given these promising results, we are already embodying unsupervised deep parsing in our advanced text analysis solutions at UnderstandLing, combining them for example with Recursive Auto-Encoders to perform state-of-the-art text classification tasks such as topic detection or sentiment analysis in a rapidly growing amount of European languages.

Tiny Teaser

As I can imagine, you are interested in seeing some of this work into action. Without spoiling too much of its mystery, let me give you a single example of a parse tree induced on the Dutch sentence ‘Dit is een test in het Nederlands.’ (English: This is a test in Dutch) – in Penn Treebank notation:

(S1(S(NP*)(VP*(NP(NP**)(PP*(NP**))))*))

Did this get you excited? Are you interested in our approach? Contact us at [email protected] and let us inform you on how our solutions can help you in performing real-time and highly accurate customer experience monitoring.

要查看或添加评论，请登录

Erik Tromp的更多文章

How Gen Z is reshaping today's job market

2023年10月10日

How Gen Z is reshaping today's job market

Generation Z is making its mark on the job market, transforming the employer-employee relationship as we know it. This…
Staffing 2.0 - Programmatic Matching

2018年8月27日

Staffing 2.0 - Programmatic Matching

Wow! I have been so occupied with #Personality #Match that I hardly had time to blog about it and explain to the world…

1 条评论
PersonalityMatch on ProductHunt!

2018年5月7日

PersonalityMatch on ProductHunt!

We call it matching 2.0! Blending in personality driven by AI to make the old-fashioned recruitment and staffing…
Fake News Detection

2018年4月11日

Fake News Detection

With the recent developments of fake news playing a role in Trump’s elections, Cambridge Analytica using it to great…
‘Een pakket dat geautomatiseerd bedrijven helpt’

2017年12月25日

‘Een pakket dat geautomatiseerd bedrijven helpt’

Het kennen van je klant zou het uitgangspunt moeten zijn voor elk bedrijf. Met de opkomst van big data wordt het steeds…

1 条评论
Deriving Personality Traits from Text

2017年8月19日

Deriving Personality Traits from Text

If you’d ask me, one the most compelling fields in language processing is that of authorship profiling. In this field…
The (Non-)Sense of Word Vectors (2/2)

2017年8月11日

The (Non-)Sense of Word Vectors (2/2)

This is the second part in a two-series blog. Read the first part here.
The (Non-)Sense of Word Vectors

2017年8月3日

The (Non-)Sense of Word Vectors

In this new blog post we explore the power of word vectors as many claim. We show the boundaries of what they are…
Topic Classification – Bridging Topic Modelling and Text Classification

2017年7月26日

Topic Classification – Bridging Topic Modelling and Text Classification

Processing human language is a wide field with many aspects that can be of interest. One of such aspects is to find out…
The Need to Know Your Customer

2016年2月13日

The Need to Know Your Customer

The field of customer experience monitoring is a booming business, just google the term and you will be overloaded with…

1 条评论

See all articles

Unsupervised Deep Parsing

Erik Tromp

Future Club makes AI work for you

The English Problem

A Structural Solution

What About Performance?

Tiny Teaser

Erik Tromp的更多文章

社区洞察

其他会员也浏览了

BERT

Running OpenLLM on GPUs using PyTorch and vLLM backend in a Docker Container

Unveiling Text Representation and Embeddings: A Comprehensive Guide for NLP Practitioners

BERT Embeddings for data sets Explained: Key Benefits, Examples, and ML Model Steps

AI2’s AllenNLP, Grover, and GPT-2 For Practical Content Generation

WHAT IS TEXT STEMMING IN NLP?

Exploring Text Analytics: Unveiling Insights from Unstructured Data

Deep Learning the Stock Market

Word Embedding: Unveiling the Hidden Semantics of Words

?? 5 AI Tools Every Software Developer Should Know in 2024 ??

The English Problem

A Structural Solution

What About Performance?

Tiny Teaser

Erik Tromp的更多文章

How Gen Z is reshaping today's job market

Staffing 2.0 - Programmatic Matching

PersonalityMatch on ProductHunt!

Fake News Detection

‘Een pakket dat geautomatiseerd bedrijven helpt’

Deriving Personality Traits from Text

The (Non-)Sense of Word Vectors (2/2)

The (Non-)Sense of Word Vectors

Topic Classification – Bridging Topic Modelling and Text Classification

The Need to Know Your Customer

社区洞察

其他会员也浏览了

BERT

Running OpenLLM on GPUs using PyTorch and vLLM backend in a Docker Container

Unveiling Text Representation and Embeddings: A Comprehensive Guide for NLP Practitioners

BERT Embeddings for data sets Explained: Key Benefits, Examples, and ML Model Steps

AI2’s AllenNLP, Grover, and GPT-2 For Practical Content Generation

WHAT IS TEXT STEMMING IN NLP?

Exploring Text Analytics: Unveiling Insights from Unstructured Data

Deep Learning the Stock Market

Word Embedding: Unveiling the Hidden Semantics of Words

?? 5 AI Tools Every Software Developer Should Know in 2024 ??