Data Preprocessing using NLTK
for NLP

Data Preprocessing using NLTK for NLP

What is Python nltk package?

Natural Language Tool Kit (NLTK) is a Python library to make codes that work with natural language. It provides a user-friendly interface to datasets. The library can perform different operations such as tokenizing, stemming, classification, tagging, semantic reasoning, etc. The latest version is NLTK 3.3. It is an Open Source and free library. It is available for Windows, Mac OS, and Linux.

No alt text provided for this image

Tokenization : Tokenization is a process of splitting the text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as a single tokens. Here split() function is used to split the text input tokens:

No alt text provided for this image

Stemming : Stemming is a process of reducing words to their word stem, base or root form, for example friendship:friend, books:book, etc. Here we are using main two algorithms which are Porter stemming algorithm which removes common morphological and inflexional endings from words and Lancaster stemming algorithm which is a more aggressive stemming algorithm.

No alt text provided for this image

Lemmatization : Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma.

No alt text provided for this image

Part of speech tagging : Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.

No alt text provided for this image

CONCLUSION : In this article we described main steps included in data preprocessing like normalization, tokenization, lemmatization, part of speech tagging, etc using NLTK(Natural Language Tool Kit).


要查看或添加评论,请登录

社区洞察

其他会员也浏览了