Data Preprocessing using NLTK for NLP
What is Python nltk package?
Natural Language Tool Kit (NLTK) is a Python library to make codes that work with natural language. It provides a user-friendly interface to datasets. The library can perform different operations such as tokenizing, stemming, classification, tagging, semantic reasoning, etc. The latest version is NLTK 3.3. It is an Open Source and free library. It is available for Windows, Mac OS, and Linux.
Tokenization : Tokenization is a process of splitting the text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as a single tokens. Here split() function is used to split the text input tokens:
Stemming : Stemming is a process of reducing words to their word stem, base or root form, for example friendship:friend, books:book, etc. Here we are using main two algorithms which are Porter stemming algorithm which removes common morphological and inflexional endings from words and Lancaster stemming algorithm which is a more aggressive stemming algorithm.
Lemmatization : Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma.
Part of speech tagging : Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.
CONCLUSION : In this article we described main steps included in data preprocessing like normalization, tokenization, lemmatization, part of speech tagging, etc using NLTK(Natural Language Tool Kit).