NLP : The basics
Natural language processing (NLP) is aimed at the processing and understanding of human language by machines. Since machines don't see words the way humans do, the words are converted into some numerical representation. Various techniques are available to do so and l provide a simple introduction to a few in this article.
For the purpose of this introduction the text corpus used is some content from my previous article. A corpus is assumed to be composed of multiple documents (just a simple sentence in this case).
Preprocessing Techniques -
These are the techniques used to clean up the data corpus before it can be processed.
- Tokenization -
Tokenization is the process of breaking up a document into individual tokens (or a sentence into words in this case). I use the word tokenizer from the NLTK library for this demonstration.
- Stemming -
It is the process of reducing derived words into their common stem. Stemming is kind of “raw” in the form that it simply cuts off word prefixes and suffixes with the aim of achieving the shortest possible representation.
The most common stemming algorithm is Porter’s algorithm which is supported by NLTK as shown below. The word ‘differences’ becomes ‘differ’ and ‘possibilities’ becomes ‘possibl’. Notice the spelling of ‘possibl’. This happens with stemming as it just follows rules to reduce the suffix. Also notice that the Porter’s algorithm stemming supported by NLTK only removes suffixes.
- Lemmatization -
Lemmatization reduces words derived from a common lemma to the same form i.e it converts words into their dictionary form. Consider the example below, where ‘differences’ becomes ‘difference’ and ‘possibilities’ becomes ‘possibility’.
Lemmatization doesn't just simply cut off prefixes and suffixes but does so with the knowledge of vocabulary and also which part of speech we are referring to (say noun or adjective).
- Stop word removal -
Stop words are words like ‘a’,’and’,’the’, etc which are just used as syntax elements in grammar. These words are common for all documents and don't really convey a special meaning. Furthermore these words actually decrease a NLP model’s performance. Stop word removal is the technique of removing these words from documents
NLTK’s english stop words consist of a list of 127 such common words. Stop word removal applied on our dataset gives a result as shown below -
Feature extraction techniques -
- Count vectorization -
Count vectorization is a technique where each unique appearance of a word is counted and the document (or sentence in this case) is encoded based on it.
For example the encoding of this corpus using the count vectorizer from sklearn is,
- Hashing vectorizer -
Hashing vectorizer converts words to indices using a hashing function. It then updates the count of the value present at these indices. As a direct result of this, it is not possible to obtain feature (word) names given the encoding. This technique can handle a potentially unlimited vocabulary size as it uses a hashing function to map words to indices.
In this specific example using Sklearn’s hashing vectorizer, the counts are further normalized using a l2 norm. Also I have limited the number of features to 16.
- Tf-idf vectorization technique -
The word Tf-idf means ‘Term frequency inverse document frequency’. As the name suggests the value of a word feature is proportional to its frequency in a document and its inverse frequency in other documents. That is, if a word has a high Tf-idf it should be a characteric identifier of a document having high frequency in that document, while having low frequencies all other documents. The idea behind this being that common words like ‘a’, ‘and’, ‘the’, etc get a low score as they don’t really reflect an interesting or characteristic feature of a document.
Next time we take a look at more fancier feature extraction techniques like Word2Vec, etc.
Thank you for reading this article and hope you found it useful. Find the link to the jupyter notebook here. Also please let me know if you have anything to add on.
Senior Data Scientist at HBO Max | Machine Learning | Product & Marketing Analytics | A/B Testing | AI/ML Product Management
4 年Wonderfully explained! Great work Rahul!