Models of meaning; unsupervised keyphrase extraction for web pages
Maaike van der Post
People & Culture Strategist | Building Trust and Creating Inclusive, Sustainable Workplaces for Remote and Global Teams - maaikepost.com
Data Scientist Tim Haarman from the Department of Artificial Intelligence at the University of Groningen was at Dataprovider.com to complete his research into ‘Unsupervised Keyphrase Extraction for Web Pages’, we got a chance to talk to him about his research.
Hello Tim, what was the goal of your research?
,,We are trying to find the best method for extracting keywords from a web page. This plays an important part in the field of Natural Language Processing (NLP) but it is not straightforward, since web pages are relatively untouched as a medium. Reasons being that structural elements and the fragmentation of text play a huge part in understanding what is important in a given web text. Add to this that we researched unsupervised keyphrase extraction methods…”
Which is more difficult than supervised methods?
,,It is a different approach to the problem. Supervised methods can be very effective, but they require extremely large datasets which are laborious to create. Even though there is limited research on unsupervised keyphrase extraction, we decided to look into this method due to its large potential – provided it is accurate enough, of course.”
Why is that?
,,Our model understands what words do not contribute to the message of the web page. Keyphrases such as ‘contact us’ and ‘read more’ score very low, despite appearing often. And that is what is most interesting about our research, to generate a model that actually understands what message the content of a web page is trying to convey. We did this by taking existing techniques and optimizing them to read web pages.
Are there other applications that can use this technique?
,,Look at it this way: if you are searching online for ‘electric car’, you will find the results that match this specific term. However, should you not also be able to find pages that speak of ‘electric vehicle’? We use synonyms all the time. Perhaps this model will, with further research, enable loose keyword searching – which is quite exciting!”
What was the most difficult thing about your research?
,,It was the analysis of our data. Because behind all the calculations and models the central question is still: what is a good keyphrase? I could not decide that and there are no datasets or researches that answer this question for web pages in particular. That is why we created our own benchmark to evaluate the quality of the model. Dataprovider.com has a database of over 250 million hostnames, of which 105 were randomly selected – after applying some filters to ensure the pages were of sufficient quality. Three annotators went through the pages and wrote down their findings. These were used in evaluating the accuracy of the different models.”
And what was the best moment?
,,That would have to be the moment during the research where you realize that the model can actually extract meanings from the web pages. Our model outperformed all other unsupervised methods when finding quality keywords for the web pages in the set. The best part is knowing that it is indeed possible to accurately model the meanings of words and how they relate to one another, even on unstructured webpages. It’s exciting to think about what is possible with this.”
---
The article ‘Unsupervised Keyphrase Extraction for Web Pages’ was recently published in the international, scientific, peer-reviewed, open access journal Multimodal Technologies and Interaction. The link to the open access article can be found in the first comment.
---
Text & photography: Coen Berkhout, Edwart Visser.
People & Culture Strategist | Building Trust and Creating Inclusive, Sustainable Workplaces for Remote and Global Teams - maaikepost.com
5 年https://www.mdpi.com/2414-4088/3/3/58