Unsupervised Deep Parsing
An English parse tree - image courtesy of MIT

Unsupervised Deep Parsing

As I have written before, it is not always easy to understand the possibilities of current-day text analytics algorithms, much more can be done than we actually are aware of. Among some of the more difficult tasks of text analysis is the task of deep parsing – a term encompassing tasks such as constituent parsing or semantic role labeling. These forms of deep parsing yield so-called parse-trees that reveal syntactical or semantical structures in text in a hierarchical way.

Deep parsing is something that we can do reasonably well, especially nowadays with deep learning methods. The work by Collobert et al., captured in their academic open source SENNA is a good example hereof. There is one big challenge with doing deep parsing however.

Despite being one of the harder problems of text analysis, deep parsing is particularly useful. Having a parse tree – be it syntactic or semantic – available, tells us a great deal about the structure or meaning of text. While often not a goal on its own, this knowledge aids tremendously in solving other tasks, such as sentiment analysis or topic detection. Deep parsing is hence one of the most vital bricks of text analysis.

The English Problem

While deep parsing is a problem that has been tackled pretty well, most of the work – and this holds for many other text analysis problems as well – is highly focused on dominant languages, English in particular. One of the most accurate ways of performing deep parsing is by constructing a language model on labeled data, corpora with sentences that have been annotated with parse-trees by human experts. We often call these corpora treebanks.

While many treebank corpora exist for many languages, only a few are elaborate or specific enough to support in deep parsing tasks. Apart from that, the majority of these corpora are not free or only available for research. This makes commercial application of deep parsing in languages other than a few such as English, particularly difficult.

A Structural Solution

Using the earlier mentioned deep learning approaches to text analysis, has led us to a breakthrough in deep parsing. Using deep learning, we can create language models that ‘know’ about the structure in a language, without us explicitly outlining it or handcrafting the model using expert knowledge.

We applied this mechanism to deep parsing in a way where we learn structure in a language where deep parsing resources are plenty available, such as English. We do the same on a specific target language where such resources are scarcely or not at all available and we use the induced structure of both models to jointly learn how to perform deep parsing in the target language.

This approach to deep parsing has two major advantages.

  1. We do not need a manually annotated treebank in our target language to be able to deep parsing.
  2. To train a model for our target language, all we have to do is induce its structure, something that takes little time and very little human intervention.

Using this approach, we have been able to create deep parsing (constituency parsing and semantic role labeling) in Dutch, French and German in just a matter of hours. Without ever seeing a single manually annotated parse-tree in any of these languages.

What About Performance?

Very well, you might say, what about performance? This may be a nice story, but if the results don’t add up, no one will ever care. Well now, evaluating the performance of these models is not always straightforward since high quality treebanks are not always available for these target languages. Luckily however, for a couple of these languages there are. For German for example, there is the Negra corpus, which is known to be of decent quality, although only available for research only.

We conducted a small experiment on 100 sentences of the Negra corpus and compared our induced parse trees with those of the Negra corpus. We found an incredibly high accuracy of over 70 sentences that resulted in exactly the same parse tree using our method compared to the Negra corpus. In little over 20 other sentences, the errors made by our approach we solely due to flipping order in parsing small subtrees, for example picking one noun-phrase over the other first. All other errors which had more structural discrepancies, were not completely off but for example assigned verbs to wrong prepositions.

Of course, further evaluation and investigation is required to make more definitive conclusions, but given these promising results, we are already embodying unsupervised deep parsing in our advanced text analysis solutions at UnderstandLing, combining them for example with Recursive Auto-Encoders to perform state-of-the-art text classification tasks such as topic detection or sentiment analysis in a rapidly growing amount of European languages.

Tiny Teaser

As I can imagine, you are interested in seeing some of this work into action. Without spoiling too much of its mystery, let me give you a single example of a parse tree induced on the Dutch sentence ‘Dit is een test in het Nederlands.’ (English: This is a test in Dutch) – in Penn Treebank notation:

(S1(S(NP*)(VP*(NP(NP**)(PP*(NP**))))*))

 Did this get you excited? Are you interested in our approach? Contact us at [email protected] and let us inform you on how our solutions can help you in performing real-time and highly accurate customer experience monitoring.

要查看或添加评论,请登录

Erik Tromp的更多文章

  • How Gen Z is reshaping today's job market

    How Gen Z is reshaping today's job market

    Generation Z is making its mark on the job market, transforming the employer-employee relationship as we know it. This…

  • Staffing 2.0 - Programmatic Matching

    Staffing 2.0 - Programmatic Matching

    Wow! I have been so occupied with #Personality #Match that I hardly had time to blog about it and explain to the world…

    1 条评论
  • PersonalityMatch on ProductHunt!

    PersonalityMatch on ProductHunt!

    We call it matching 2.0! Blending in personality driven by AI to make the old-fashioned recruitment and staffing…

  • Fake News Detection

    Fake News Detection

    With the recent developments of fake news playing a role in Trump’s elections, Cambridge Analytica using it to great…

  • ‘Een pakket dat geautomatiseerd bedrijven helpt’

    ‘Een pakket dat geautomatiseerd bedrijven helpt’

    Het kennen van je klant zou het uitgangspunt moeten zijn voor elk bedrijf. Met de opkomst van big data wordt het steeds…

    1 条评论
  • Deriving Personality Traits from Text

    Deriving Personality Traits from Text

    If you’d ask me, one the most compelling fields in language processing is that of authorship profiling. In this field…

  • The (Non-)Sense of Word Vectors (2/2)

    The (Non-)Sense of Word Vectors (2/2)

    This is the second part in a two-series blog. Read the first part here.

  • The (Non-)Sense of Word Vectors

    The (Non-)Sense of Word Vectors

    In this new blog post we explore the power of word vectors as many claim. We show the boundaries of what they are…

  • Topic Classification – Bridging Topic Modelling and Text Classification

    Topic Classification – Bridging Topic Modelling and Text Classification

    Processing human language is a wide field with many aspects that can be of interest. One of such aspects is to find out…

  • The Need to Know Your Customer

    The Need to Know Your Customer

    The field of customer experience monitoring is a booming business, just google the term and you will be overloaded with…

    1 条评论

社区洞察

其他会员也浏览了