Fundamentals -4: Lemmatization
Dheeraj RP ????
Data Engineer | Spark | Python | Hive | SQL | AWS: S3 | Lambda | Glue | GCP: BigQuery
MasterNLP?extends a warm welcome to everyone. This week, we will be learning about?Lemmatization?in NLP.
What is Lemmatization?
Lemmatization is a process in NLP that involves reducing words to their base or dictionary form, which is known as the lemma. The process involves identifying the base form of a word, which is also known as the morphological root, by taking into account its context and morphology.
What lemmatization does?
The goal of lemmatization is to reduce inflected words to their base form so that they can be analyzed more easily.
For instance, the words "am," "are," and "is" can all be lemmatized to the base form "be." Similarly, the words "go," "going," and "went" can be lemmatized to the base form "go."
By reducing words to their base form, we can reduce the number of unique words in a corpus and improve the accuracy of downstream NLP tasks such as information retrieval, sentiment analysis, and machine translation.
What makes lemmatization differ from stemming?
Lemmatization is different from stemming, which is another process used in NLP to reduce words to their root form. Stemming is a simpler process that involves removing the suffixes from a word to obtain its root form. However, stemming may not always produce valid words, whereas lemmatization always produces valid words.
Key Differences between stemming and lemmatization
Stemming: Studies ------------> Studi
Lemmatization: Studies ------------> Study
Pythonic Implementation
领英推荐
import nltk
nltk.download('wordnet')? # download WordNet corpus
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Example sentence
sentence = "the boys are going by the car"
# Tokenize sentence into words
words = nltk.word_tokenize(sentence)
# Lemmatize words
lemmatized_words = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(pos_tag)) for word, pos_tag in nltk.pos_tag(words)]
print(lemmatized_words)
# Output: ['the', 'boy', 'be', 'go', 'by', 'the', 'car']
We can also remove stopwords before lemmatization, which gives the output-
'boy', 'going', 'car'
Stopwords removal is discussed already in previous editions. Please checkout stopwords edition for more -
Lemmatization can also be customized to specific domains or languages by using domain-specific dictionaries or language-specific algorithms.
For example, the spaCy library for Python provides language-specific lemmatization algorithms for several languages, including English, German, Spanish, and French.
As discussed in one pf the articles, we will be having one interview question in every upcoming article.
Interview Question:
Question: What is NLP, and how does it differ from other areas of AI?
Answer: NLP is a subfield of AI that focuses on enabling computers to understand, interpret, and generate human language. NLP aims to bridge the gap between human language and computer language, enabling machines to understand and respond to natural language inputs like text, speech, and images.
NLP differs from other areas of AI in its focus on language and communication. While other areas of AI, such as computer vision or robotics, may also deal with natural language input and output, their primary focus is on perception and action in the physical world. NLP, on the other hand, is concerned primarily with the meaning and structure of language, and how it can be processed and analyzed by computers.
Another important aspect that sets NLP apart from other areas of AI is the complexity and variability of natural language. Unlike computer programming languages or other formal languages, natural language is highly ambiguous, context-dependent, and constantly evolving. NLP researchers and practitioners must contend with issues like synonymy (different words with the same meaning), polysemy (the same word with multiple meanings), and homonymy (different words with the same spelling or pronunciation).
Subscribe to MasterNLP for more conceptual content every week.
Best regards,
#MasterNLP?Newsletter
Have a great week ahead.
AML Watcher Brand Ambassador /Consultant Sanctions,AML,KYC | Director
2 年Sounds like Latin where many languages emanate from?