登录查看更多内容

Detecting Semantic Change using Word Embedding

Balamurali A R, PhD

Co-Founder & CTO at NuGenomics, Answer Genomics| Ex-Samsung| Ex-visiting Professor IIM

发布日期: 2015年12月7日

Word embedding captures lexicon-properties based on the concurrence statistics from large text corpus. Word embedding acts as high quality semantic vectors that could be employed in various NLP tasks like syntactic parsing, sentiment analysis, semantic relatedness. Representation captured via embedding is distributional in nature. Such a representation has multiple applications apart from being used as features for everyday natural language processing tasks. One such novel application is to detect changes in the meaning of words.

Languages evolve. This can be in the form of lexical, phonological, syntactic and semantic changes. Lexical change reflects the ongoing influx of new words or word forms into the language. Phonological change is associated with the concept of sound change covering both phonetic and phonological developments whereas syntactic change refers to the structural changes in the use of language. Semantic change refers the evolution of word usage. More specifically, semantic change is the change in one of the meanings of a word. The marked semantic distance between the new meanings the word takes reflects this change. For instance, the word 'Android' was more synonymous to humanoid talking robot in the late 90's but now it is commonly associated with mobile phone operating system. One may attribute this phenomenon to the emergence of popular topics/senses that persists in an era. However, we believe every word possesses a variety of senses and connotations, which can be added, removed, or altered over time, often to the extent that cognates across space and time to have very different meanings.

We studied semantic changes that occur over years and decades on selective text corpus. Distributional hypothesis is used to capture the semantic change that happens at lexical level over a time period. To put it differently, we hypothesize whether the word has changed its meaning by observing `the company it keeps' (as per Firth) using the word vectors. To do this, we first created word vector models from large news and magazine corpus pertaining to different eras of the publication. Thereafter we introduce a simple, yet effective method to project one word vector model onto another for comparison. We tracked the semantic evolution of the same word using these projected models.

Representation of words depended on the format and domain of the text from where co-occurrence statistics are built. Canberra Times (https://www.canberratimes.com.au/) is crawled for the data between 1980 to 2010 accounting for 2.42 Million words.To make robust co-occurrence statistics, data from Wikipedia was added to corpus based on Canberra Times (https://en.wikipedia.org/ accessed on 4 September, 2013). It contains 1.6 billion words.

We divided Canberra Times dataset into three epochs, each spanning a decade. After training word embedding using Mikolov’s word vector package (https://code.google.com/p/word2vec/) and projecting to multiple models from one epoch to another. Words deviating by wide margins are selected to inspect for semantic change.

Deviation is measured in terms of the cosine similarity. This deviation represents the shift in meaning. Words similar to the word in consideration in different word vector models belonging to different epochs are also captured. These words gave a crude meaning of what the word in question actually means in that epoch. Figure below shows some of the words that the system detected with large deviation.

At present these words are detected based on the divergences of the words from different era above a threshold. Along with the true samples of words showing the semantic change, many words, which do not show the phenomenon, are also captured. The false positives are high as the method to detect the semantic change based on divergence is noisy. More reliable methods like point change detection algorithm can be employed to reduce the false positive detection.

Distributional representation of word embedding can have multitudes of applications that go beyond feature engineering for Deep Learning. At present there is a lot of research conducted in this direction.

Acknowledgment: This is part of joint work with Prachetos Sadhukhan and Vasudevan N, Benoit Favre, and Fredric Bechet by the author during his period in LIF Lab, CNRS, Marseille.

要查看或添加评论，请登录

Balamurali A R, PhD的更多文章

Sleep Gauging Technology: The Wearable Revolution in Sleep Monitoring

2024年10月4日

Sleep Gauging Technology: The Wearable Revolution in Sleep Monitoring

In our fast-paced world, sleep has become a precious commodity. As we increasingly recognize the vital role of quality…

2 条评论
Sleep - The Foundation of Health

2024年9月12日

Sleep - The Foundation of Health

Introduction Humans are the only known species on Earth that deliberately disrupt their sleep patterns. While other…
Part 2: Diving into the Technical Aspects of Medical Large Language Models for Healthcare Professionals

2024年7月25日

Part 2: Diving into the Technical Aspects of Medical Large Language Models for Healthcare Professionals

Previous part can be found here: https://www.linkedin.
Part 1: Understanding the role of Medical Large Language Models in Healthcare

2024年7月3日

Part 1: Understanding the role of Medical Large Language Models in Healthcare

Recently, general large language models (LLMs), like PaLM, LLaMA, GPT-series, and ChatGLM, which have already…

1 条评论

Detecting Semantic Change using Word Embedding

Balamurali A R, PhD

Co-Founder & CTO at NuGenomics, Answer Genomics| Ex-Samsung| Ex-visiting Professor IIM

Balamurali A R, PhD的更多文章

社区洞察

其他会员也浏览了

Top LLM Papers of the Week (July Week-1 2024)

Top LLM Papers of the Week (June Week-4 2024)

AMR Future Brief| Why Have Large Language Models (LLMs) Become Indispensable to the Healthcare Sector in 2024?

The Origination of Eight Major Methods For FineTuning an LLM

The Semantic Web Project Revitalized: From Vision to Reality with Reasoning and Inference

Large Language Models: From Prototype to Production

A guide to build contextual RAG systems with hybrid search and reranking

Revolutionizing Language Models with LangChain

Retrieval Augmented Generation (RAG): The Ultimate Guide

A Guide to Training Your Own Language Model

Balamurali A R, PhD的更多文章

Sleep Gauging Technology: The Wearable Revolution in Sleep Monitoring

Sleep - The Foundation of Health

Part 2: Diving into the Technical Aspects of Medical Large Language Models for Healthcare Professionals

Part 1: Understanding the role of Medical Large Language Models in Healthcare

社区洞察

其他会员也浏览了

Top LLM Papers of the Week (July Week-1 2024)

Top LLM Papers of the Week (June Week-4 2024)

AMR Future Brief| Why Have Large Language Models (LLMs) Become Indispensable to the Healthcare Sector in 2024?

The Origination of Eight Major Methods For FineTuning an LLM

The Semantic Web Project Revitalized: From Vision to Reality with Reasoning and Inference

Large Language Models: From Prototype to Production

A guide to build contextual RAG systems with hybrid search and reranking

Revolutionizing Language Models with LangChain

Retrieval Augmented Generation (RAG): The Ultimate Guide

A Guide to Training Your Own Language Model