NLP : The basics

NLP : The basics

Natural language processing (NLP) is aimed at the processing and understanding of human language by machines. Since machines don't see words the way humans do, the words are converted into some numerical representation. Various techniques are available to do so and l provide a simple introduction to a few in this article. 

For the purpose of this introduction the text corpus used is some content from my previous article. A corpus is assumed to be composed of multiple documents (just a simple sentence in this case). 

No alt text provided for this image

Preprocessing Techniques -

These are the techniques used to clean up the data corpus before it can be processed.

  • Tokenization -

Tokenization is the process of breaking up a document into individual tokens (or a sentence into words in this case). I use the word tokenizer from the NLTK library for this demonstration.

No alt text provided for this image
  • Stemming - 

It is the process of reducing derived words into their common stem. Stemming is kind of “raw” in the form that it simply cuts off word prefixes and suffixes with the aim of achieving the shortest possible representation.

The most common stemming algorithm is Porter’s algorithm which is supported by NLTK as shown below. The word ‘differences’ becomes ‘differ’ and ‘possibilities’ becomes ‘possibl’. Notice the spelling of ‘possibl’. This happens with stemming as it just follows rules to reduce the suffix. Also notice that the Porter’s algorithm stemming supported by NLTK only removes suffixes.

No alt text provided for this image
  • Lemmatization - 

Lemmatization reduces words derived from a common lemma to the same form i.e it converts words into their dictionary form. Consider the example below, where ‘differences’ becomes ‘difference’ and ‘possibilities’ becomes ‘possibility’.

Lemmatization doesn't just simply cut off prefixes and suffixes but does so with the knowledge of vocabulary and also which part of speech we are referring to (say noun or adjective).

No alt text provided for this image
  • Stop word removal -

Stop words are words like ‘a’,’and’,’the’, etc which are just used as syntax elements in grammar. These words are common for all documents and don't really convey a special meaning. Furthermore these words actually decrease a NLP model’s performance. Stop word removal is the technique of removing these words from documents

NLTK’s english stop words consist of a list of 127 such common words. Stop word removal applied on our dataset gives a result as shown below - 

No alt text provided for this image

Feature extraction techniques -

  • Count vectorization - 

Count vectorization is a technique where each unique appearance of a word is counted and the document (or sentence in this case) is encoded based on it.

For example the encoding of this corpus using the count vectorizer from sklearn is, 

No alt text provided for this image
  • Hashing vectorizer -

Hashing vectorizer converts words to indices using a hashing function. It then updates the count of the value present at these indices. As a direct result of this, it is not possible to obtain feature (word) names given the encoding. This technique can handle a potentially unlimited vocabulary size as it uses a hashing function to map words to indices. 

In this specific example using Sklearn’s hashing vectorizer, the counts are further normalized using a l2 norm. Also I have limited the number of features to 16.

No alt text provided for this image
  • Tf-idf vectorization technique -

The word Tf-idf means ‘Term frequency inverse document frequency’. As the name suggests the value of a word feature is proportional to its frequency in a document and its inverse frequency in other documents. That is, if a word has a high Tf-idf it should be a characteric identifier of a document having high frequency in that document, while having low frequencies all other documents. The idea behind this being that common words like ‘a’, ‘and’, ‘the’, etc get a low score as they don’t really reflect an interesting or characteristic feature of a document.

No alt text provided for this image

Next time we take a look at more fancier feature extraction techniques like Word2Vec, etc.

Thank you for reading this article and hope you found it useful. Find the link to the jupyter notebook here. Also please let me know if you have anything to add on.

Yashvardhan Das

Senior Data Scientist at HBO Max | Machine Learning | Product & Marketing Analytics | A/B Testing | AI/ML Product Management

4 年

Wonderfully explained! Great work Rahul!

回复

要查看或添加评论,请登录

Rahul Suresh的更多文章

  • Types of gradient descent

    Types of gradient descent

    Hi guys, I’d like to give an intro into the different types of gradient descent used. Gradient descent (GD) is an…

    2 条评论

社区洞察

其他会员也浏览了