Stopwords: Important for the Language not so in NLP
Sunakshi Mamgain
Senior Manager Content Strategy at Great Learning | Ex-upGrad | Data Science | NLP
What is NLP?
Language to humans is as important as to eat food.Hence with growth of Artificial intelligence (AI), one of its component- Natural Language Processing (NLP) - is also rapidly growing as the field enables a computer program to understand human language as it is spoken.
NLP applications are difficult to develop as its biggest challenge is that humans speak to them in programming language which is precise and unambiguous. Human speech, however, is not always precise -- it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects and social context.
NLP can be used to interpret free text and make it analysable. There is a tremendous amount of information stored in free text files, like patients' medical records, for example. This information was inaccessible to computer-assisted analysis and could not be analysed in any kind of systematic way. But NLP allows analysts to sift through massive troves of free text to find relevant information in the files.
Stopwords
If someone is dealing with text problem in NLP, the words that make the text valuable are necessary to evaluate.
Before starting the introduction of stopwords , lets see an example. Consider a sentence-
“Jawaharlal Nehru was the first prime minister of India.”.
Now consider another sentence-
“Jawaharlal Nehru first prime minister India.”
These two sentences have same intent but the latter has some missing words like ‘was’, ‘the’, ‘of’ etc. Text processing invariably requires that some words in the source corpus be removed before moving on to more complex tasks (such as keyword extraction, summarization and topic modelling).
Stopwords are common words that are present in the text but generally do not contribute to the meaning of a sentence. They hold almost no importance for the purposes of information retrieval and natural language processing. They can safely be ignored without sacrificing the meaning of the sentence. For example – ‘the’ and ‘a’.
What to remove What not to remove?
Stopwords are usually thought of as "the most common words in a language". It clearly makes sense to consider eliminating stop word if the task is based on word frequencies.
If the concern is with the context (e.g. sentiment analysis) of the text it might make sense to treat words differently. For example, “Not” is included as stop word but when considering context of text, negation changes the so-called valence of a text. This needs to be treated carefully and is usually not trivial.
Considering example for ‘not’-
1). LinkedIn is helpful => LinkedIn helpful
2). LinkedIn is not helpful => LinkedIn helpful
The sentence lost its meaning by removing ‘not’. So here removing ‘not’ is not useful.
How to remove?
In python, there are packages that are used to remove stopwords from text such as “NLTK”, “spaCy”, and “Stanford NLP” etc.
If the task is something similar to sentiment analysis, one is required creating another corpora for the relevant words to be removed called stop list, and then remove the words from the text.
Removing using ‘NLTK’ -
Installation:
pip install nltk
Importing Library:
import nltk
nltk.download('stopwords')
Check pre-defined stop words:
nltk_stopwords = nltk.corpus.stopwords.words('english')
print('First ten stop words: %s' % list(nltk_stopwords)[:10])
Output-
First ten stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
One can change the language from English to another language. To see the language nltk offers the command is -
stopwords.fileids()
Output-
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']
Remove stop words
To remove stopwords the text need to be first tokenized using a tokenizer from NLTK package.
But what is meant by tokenization?
Tokenization means splitting your text into minimal meaningful units. It is a mandatory step before any kind of processing.
Consider a sentence-
sentence=”Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list.”
Now tokenizing the sentence-
tokens = nltk.tokenize.word_tokenize(sentence)
print(tokens)
Output-
['Though', "''", 'stop', 'words', "''", 'usually', 'refers', ‘to’, ‘the’, 'common', 'words', ‘in’, ‘a’, 'language', ',', ‘there’, ‘is’, ‘no’, 'single', 'universal', 'list', ‘of’, 'stop', 'words', 'used', ‘by’, ‘all’, 'natural', 'language', 'processing', 'tools', ',', 'indeed', ‘not’, ‘all’, 'tools', 'even', 'use', ‘such’, ‘a’, 'list', '.']
Removing the stopwords from the text-
tokens = [token for token in tokens if not token in nltk_stopwords]
print(tokens)
Output-
['Though', "''", 'stop', 'words', "''", 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'used', 'natural', 'language', 'processing', 'tools', ',', 'indeed', 'tools', 'even', 'use', 'list', '.']
We can see that “are”, “all”, “to” etc are removed as checking whether the is_stop attribute of token is True or not.
Removing using ‘spaCy’
Installation-
pip install spacy
Import library-
import spacy
spacy_nlp = spacy.load('en_core_web_sm')
spaCy also supports many languages (go through the documentation https://spacy.io/usage)
Check pre-defined stop words-
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('First ten stop words: %s' % list(spacy_stopwords)[:10])
Output-
First ten stop words: ['in', 'yourself', 'becoming', 'never', 'something', 'ten', 'ca', 'they', 'used', 'everyone']
Remove stop words-
doc = spacy_nlp(sentence)
tokens = [token.text for token in doc if not token.is_stop]
print(tokens)
Output-
['Though', '"', 'stop', 'words', '"', 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'natural', 'language', 'processing', 'tools', ',', 'tools', 'use', 'list', '.']
We can see that “are”, “used”, “to” etc are removed as checking whether the is_stop attribute of token is True or not.
Removing using ‘Stanford NLP’
Stanford NLP gives the stop word list which can be used as corpus for NLTK and stopwords can be removed. The list can be viewed from below-
Conclusion
To conclude, stop word elimination is a simple but important aspect of many text mining applications as it has the following advantages:
Reduces memory overhead (since we eliminate words in consideration)
Reduces noise and false positives (since we are focusing on the more important terms)
Can potentially improve power of prediction (this is dependent on the application)
A point to note is that, while many applications benefit from the use of stop words, some do not see any added advantage in removing stop words other than the fact that it makes analysis or lookup much faster and reduces overall memory requirements.
End Notes-
I hope this article is somehow been useful for building an understanding about what is stopword, when to remove it and when not to, how to remove and its importance.
Did you find this article helpful? please share opinion/thoughts in the comments.
Regards!!!