Stopwords: Important for the Language not so in NLP

Stopwords: Important for the Language not so in NLP

What is NLP?

Language to humans is as important as to eat food.Hence with growth of Artificial intelligence (AI), one of its component- Natural Language Processing (NLP) - is also rapidly growing as the field enables a computer program to understand human language as it is spoken.

NLP applications are difficult to develop as its biggest challenge is that humans speak to them in programming language which is precise and unambiguous. Human speech, however, is not always precise -- it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects and social context.

NLP can be used to interpret free text and make it analysable. There is a tremendous amount of information stored in free text files, like patients' medical records, for example. This information was inaccessible to computer-assisted analysis and could not be analysed in any kind of systematic way. But NLP allows analysts to sift through massive troves of free text to find relevant information in the files.

Stopwords

No alt text provided for this image

If someone is dealing with text problem in NLP, the words that make the text valuable are necessary to evaluate.

Before starting the introduction of stopwords , lets see an example. Consider a sentence-

“Jawaharlal Nehru was the first prime minister of India.”.

Now consider another sentence-

“Jawaharlal Nehru first prime minister India.”

These two sentences have same intent but the latter has some missing words like ‘was’, ‘the’, ‘of’ etc. Text processing invariably requires that some words in the source corpus be removed before moving on to more complex tasks (such as keyword extraction, summarization and topic modelling).

Stopwords are common words that are present in the text but generally do not contribute to the meaning of a sentence. They hold almost no importance for the purposes of information retrieval and natural language processing. They can safely be ignored without sacrificing the meaning of the sentence. For example – ‘the’ and ‘a’.

What to remove What not to remove?

Stopwords are usually thought of as "the most common words in a language". It clearly makes sense to consider eliminating stop word if the task is based on word frequencies.

If the concern is with the context (e.g. sentiment analysis) of the text it might make sense to treat words differently. For example, “Not” is included as stop word but when considering context of text, negation changes the so-called valence of a text. This needs to be treated carefully and is usually not trivial.

Considering example for ‘not’-

1). LinkedIn is helpful => LinkedIn helpful
2). LinkedIn is not helpful => LinkedIn helpful

The sentence lost its meaning by removing ‘not’. So here removing ‘not’ is not useful.

How to remove?

In python, there are packages that are used to remove stopwords from text such as “NLTK”, “spaCy”, and “Stanford NLP” etc.

If the task is something similar to sentiment analysis, one is required creating another corpora for the relevant words to be removed called stop list, and then remove the words from the text.

Removing using ‘NLTK’ -

No alt text provided for this image

Installation:

pip install nltk

Importing Library:

import nltk

nltk.download('stopwords')

Check pre-defined stop words:

nltk_stopwords = nltk.corpus.stopwords.words('english')

print('First ten stop words: %s' % list(nltk_stopwords)[:10])

Output-

First ten stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

One can change the language from English to another language. To see the language nltk offers the command is -

stopwords.fileids()

Output-

['danish', 'dutch', 'english', 'finnish', 'french', 'german',

'hungarian', 'italian', 'norwegian', 'portuguese', 'russian',

'spanish', 'swedish', 'turkish']

Remove stop words

To remove stopwords the text need to be first tokenized using a tokenizer from NLTK package.

But what is meant by tokenization?

Tokenization means splitting your text into minimal meaningful units. It is a mandatory step before any kind of processing.

Consider a sentence-

sentence=”Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list.”

Now tokenizing the sentence-

tokens = nltk.tokenize.word_tokenize(sentence)
print(tokens)

Output-

['Though', "''", 'stop', 'words', "''", 'usually', 'refers', ‘to’, ‘the’, 'common', 'words', ‘in’, ‘a’, 'language', ',', ‘there’, ‘is’, ‘no’, 'single', 'universal', 'list', ‘of’, 'stop', 'words', 'used', ‘by’, ‘all’, 'natural', 'language', 'processing', 'tools', ',', 'indeed', ‘not’, ‘all’,  'tools', 'even', 'use', ‘such’, ‘a’, 'list', '.']

Removing the stopwords from the text-

tokens = [token for token in tokens if not token in nltk_stopwords]

print(tokens)

Output-

['Though', "''", 'stop', 'words', "''", 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'used', 'natural', 'language', 'processing', 'tools', ',', 'indeed', 'tools', 'even', 'use', 'list', '.']

We can see that “are”, “all”, “to” etc are removed as checking whether the is_stop attribute of token is True or not.

Removing using ‘spaCy’

No alt text provided for this image


Installation-

pip install spacy

Import library-

import spacy

spacy_nlp = spacy.load('en_core_web_sm')

spaCy also supports many languages (go through the documentation https://spacy.io/usage)

Check pre-defined stop words-

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Output-

First ten stop words: ['in', 'yourself', 'becoming', 'never', 'something', 'ten', 'ca', 'they', 'used', 'everyone']

Remove stop words-

doc = spacy_nlp(sentence)

tokens = [token.text for token in doc if not token.is_stop]

print(tokens)

Output-

['Though', '"', 'stop', 'words', '"', 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'natural', 'language', 'processing', 'tools', ',', 'tools', 'use', 'list', '.']

We can see that “are”, “used”, “to” etc are removed as checking whether the is_stop attribute of token is True or not.

Removing using ‘Stanford NLP’

No alt text provided for this image

Stanford NLP gives the stop word list which can be used as corpus for NLTK and stopwords can be removed. The list can be viewed from below-

https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt

Conclusion

To conclude, stop word elimination is a simple but important aspect of many text mining applications as it has the following advantages:

Reduces memory overhead (since we eliminate words in consideration)
Reduces noise and false positives (since we are focusing on the more important terms)
Can potentially improve power of prediction (this is dependent on the application)

A point to note is that, while many applications benefit from the use of stop words, some do not see any added advantage in removing stop words other than the fact that it makes analysis or lookup much faster and reduces overall memory requirements.

End Notes-

I hope this article is somehow been useful for building an understanding about what is stopword, when to remove it and when not to, how to remove and its importance.

Did you find this article helpful? please share opinion/thoughts in the comments.

Regards!!!


要查看或添加评论,请登录

Sunakshi Mamgain的更多文章

  • Employee In Search of New Waters!!!

    Employee In Search of New Waters!!!

    With the advancement in technology and increasing job opportunities in every field of work, employees are leaving the…

    2 条评论
  • Detecting Covid-19 in x-ray Images

    Detecting Covid-19 in x-ray Images

    Inspiration of the work Where it all began? The corona-virus outbreak came to light on December 31, 2019 when China…

  • The Art of Sampling

    The Art of Sampling

    When we are in the process of building a model corresponding to a dataset, we tend to focus on several steps such as:…

  • Vectorization Implementation in Machine Learning: TF-IDF

    Vectorization Implementation in Machine Learning: TF-IDF

    Facebook, Twitter, Instagram, Snapchat, Medium, Towards Data Science, Analytics Vidhya, Udemy,..

  • Exploring the Trees of Data: An approach to Data Visualization

    Exploring the Trees of Data: An approach to Data Visualization

    Mirror mirror on the wall visualize data in all Confused?? I know me too. When I first started with data science, I…

  • Supervise The UnSupervised Learning (Part 2)

    Supervise The UnSupervised Learning (Part 2)

    Hello everyone, welcome to the continuation article of Supervise The UnSupervised Learning. If you haven't gone through…

    1 条评论
  • Supervise The UnSupervised Learning (Part 1)

    Supervise The UnSupervised Learning (Part 1)

    What is Unsupervised Learning? It’s too cliche to define unsupervised learning as many blogs on internet do. Let me…

    10 条评论

社区洞察

其他会员也浏览了