Text preprocessing with Natural Language Processing (NLP)
Photo by Google DeepMind on Unsplash

Text preprocessing with Natural Language Processing (NLP)

Text preprocessing is an essential step in natural language processing (NLP) that involves transforming raw text into a more suitable format for analysis and modeling. In this blog post, I will show you some common text preprocessing techniques and how to implement them in Python using the popular NLTK library.

Why do we need text preprocessing?

Text data can come from various sources, such as websites, social media, emails, books, etc. Each source may have different characteristics, such as language, style, format, spelling, grammar, etc. Moreover, text data often contains noise, such as punctuation, numbers, symbols, emoticons, abbreviations, etc. These factors can make text data messy and inconsistent, which can affect the performance and accuracy of NLP models.

Text preprocessing aims to remove or reduce the noise and variability in text data and make it more uniform and structured. This can help NLP models to focus on the meaningful and relevant information in the text and improve their efficiency and effectiveness.

What are some common text preprocessing techniques?

There are many text preprocessing techniques that can be applied depending on the type and purpose of the text data. Some of the most common ones are:

  • Tokenization: This is the process of breaking down text into smaller units called tokens. Tokens can be words, sentences, paragraphs, etc. Tokenization helps to split text into meaningful segments that can be easily processed by NLP models.
  • Normalization: This is the process of converting text into a standard or common form. Normalization can include:Case conversion: This is the process of changing the case of letters in text to either lower or upper case. Case conversion helps to reduce the variability in text and make it more consistent.
  • Stemming: This is the process of reducing words to their root or base form by removing suffixes. For example, "running", "runs", and "ran" can be stemmed to "run". Stemming helps to reduce the number of words in text and simplify the vocabulary.
  • Lemmatization: This is the process of reducing words to their canonical or dictionary form by considering their part of speech and context. For example, "is", "are", and "were" can be lemmatized to "be". Lemmatization is similar to stemming but more accurate and sophisticated.
  • Stopword removal: This is the process of removing words that are very common and do not add much meaning or information to the text. For example, "the", "a", "and", etc. Stopword removal helps to reduce the noise and size of text and focus on the important words.
  • Punctuation removal: This is the process of removing punctuation marks from text, such as commas, periods, question marks, etc. Punctuation removal helps to eliminate unnecessary symbols and make text more clean and simple.
  • Spelling correction: This is the process of correcting spelling errors or typos in text. Spelling correction helps to improve the quality and readability of text and avoid confusion or misunderstanding.

How to implement text preprocessing in Python using NLTK?

First, we need to import some libraries that will help us with text processing. We will use NLTK, a popular NLP library for Python, and spaCy, a newer and faster NLP library that also supports neural network models. We will also use Gensim, a library for topic modeling and text summarization, and TextBlob, a library for sentiment analysis and text translation.SpaCy

# Importing libraries
import pandas as pd
import nltk
import spacy
import gensim
import textblob
from nltk.stem import WordNetLemmatizer
from textblob import Word, TextBlob
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')        
# Downloading NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('vader_lexicon', quiet=True)        

After downloading the NLTK data, the output is as follows:

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /usr/share/nltk_data...
[nltk_data]   Package words is already up-to-date!
True        

Next, we need to load some text data that we want to process. For this example, I will use a short paragraph from Wikipedia about natural language processing. You can use any text data you want, as long as it is in English.

# Loading the text data
text = """Inulinases are used for the production of high-fructose syrup and fructooligosaccharides, and are widely utilized in food and pharmaceutical industries. In this study, different carbon sources were screened for inulinase production by Aspergillus niger in shake flask fermentation. Optimum working conditions of the enzyme were determined. Additionally, some properties of produced enzyme were determined [activation (Ea)/inactivation (Eia) energies, Q10 value, inactivation rate constant (kd), half-life (t1/2), D value, Z value, enthalpy (ΔH), free energy (ΔG), and entropy (ΔS)]. Results showed that sugar beet molasses (SBM) was the best in the production of inulinase, which gave 383.73 U/mL activity at 30 °C, 200 rpm and initial pH 5.0 for 10 days with 2% (v/v) of the prepared spore solution. Optimum working conditions were 4.8 pH, 60 °C, and 10 min, which yielded 604.23 U/mL, 1.09 inulinase/sucrase ratio, and 2924.39 U/mg. Additionally, Ea and Eia of inulinase reaction were 37.30 and 112.86 kJ/mol, respectively. Beyond 60 °C, Q10 values of inulinase dropped below one. At 70 and 80 °C, t1/2 of inulinase was 33.6 and 7.2 min; therefore, inulinase is unstable at high temperatures, respectively. Additionally, t1/2, D, ΔH, ΔG values of inulinase decreased with the increase in temperature. Z values of inulinase were 7.21 °C. Negative values of ΔS showed that enzymes underwent a significant process of aggregation during denaturation. Consequently, SBM is a promising carbon source for inulinase production by A. niger. Also, this is the first report on the determination of some properties of A. niger A42 (ATCC 204,447) inulinase."""        

Reference for the text used as variable, which is also one of my scientific paper: Germec, M., & Turhan, I. (2019). Evaluation of carbon sources for the production of inulinase by Aspergillus niger A42 and its characterization. Bioprocess and biosystems engineering, 42, 1993-2005.

Standardization of letters

This following code snippet is like a breath of fresh air for data standardization. With just a couple of lines, it unleashes a whirlwind of transformation that brings every letter into harmonious alignment. The 'summary' column of your dataframe undergoes a vibrant makeover as all its letters are magically converted into lowercase. This might seem like a small change, but it's a giant leap in data consistency. Imagine the newfound clarity and unity that emerges from this transformation – every 's' and 'S' becomes a 's', each 'T' and 't' gracefully turns into a 't'. This code doesn't just manipulate characters; it paints a symphony of uniformity across your data canvas. It's a testament to the power of simplicity, demonstrating how a touch of code can revolutionize your data world!

# standardization of letters
text = text.lower()
text        

The output of the above code is as follows:

'inulinases are used for the production of high-fructose syrup and fructooligosaccharides, and are widely utilized in food and pharmaceutical industries. in this study, different carbon sources were screened for inulinase production by aspergillus niger in shake flask fermentation. optimum working conditions of the enzyme were determined. additionally, some properties of produced enzyme were determined [activation (ea)/inactivation (eia) energies, q10 value, inactivation rate constant (kd), half-life (t1/2), d value, z value, enthalpy (δh), free energy (δg), and entropy (δs)]. results showed that sugar beet molasses (sbm) was the best in the production of inulinase, which gave 383.73 u/ml activity at 30 °c, 200 rpm and initial ph 5.0 for 10 days with 2% (v/v) of the prepared spore solution. optimum working conditions were 4.8 ph, 60 °c, and 10 min, which yielded 604.23 u/ml, 1.09 inulinase/sucrase ratio, and 2924.39 u/mg. additionally, ea and eia of inulinase reaction were 37.30 and 112.86 kj/mol, respectively. beyond 60 °c, q10 values of inulinase dropped below one. at 70 and 80 °c, t1/2 of inulinase was 33.6 and 7.2 min; therefore, inulinase is unstable at high temperatures, respectively. additionally, t1/2, d, δh, δg values of inulinase decreased with the increase in temperature. z values of inulinase were 7.21 °c. negative values of δs showed that enzymes underwent a significant process of aggregation during denaturation. consequently, sbm is a promising carbon source for inulinase production by a. niger. also, this is the first report on the determination of some properties of a. niger a42 (atcc 204,447) inulinase.'        

Punctuation

Hold onto your seats because this following code is about to take your text on an exhilarating rollercoaster ride of transformation! This simple yet potent piece of code harnesses the awe-inspiring power of regular expressions to wield its magic. It's like having a wizard cast a spell on your text, banishing all those pesky punctuation marks to a distant realm. With each character replaced or vanquished, your text emerges like a phoenix rising from the ashes, cleaner, more refined, and bursting with newfound energy. This code doesn't just clean; it revitalizes, rejuvenates, and electrifies your text, leaving behind a trail of pure linguistic brilliance. It's the digital equivalent of a sparkling makeover that turns your once punctuation-laden string into a masterpiece of clarity and elegance. So buckle up, because this code is your ticket to a text transformation that will leave you positively thrilled!

import re
text = re.sub(r'[^\w\s]', '', text)
text        

The output of the above code is as follows:

'inulinases are used for the production of highfructose syrup and fructooligosaccharides and are widely utilized in food and pharmaceutical industries in this study different carbon sources were screened for inulinase production by aspergillus niger in shake flask fermentation optimum working conditions of the enzyme were determined additionally some properties of produced enzyme were determined activation eainactivation eia energies q10 value inactivation rate constant kd halflife t12 d value z value enthalpy δh free energy δg and entropy δs results showed that sugar beet molasses sbm was the best in the production of inulinase which gave 38373 uml activity at 30 c 200 rpm and initial ph 50 for 10 days with 2 vv of the prepared spore solution optimum working conditions were 48 ph 60 c and 10 min which yielded 60423 uml 109 inulinasesucrase ratio and 292439 umg additionally ea and eia of inulinase reaction were 3730 and 11286 kjmol respectively beyond 60 c q10 values of inulinase dropped below one at 70 and 80 c t12 of inulinase was 336 and 72 min therefore inulinase is unstable at high temperatures respectively additionally t12 d δh δg values of inulinase decreased with the increase in temperature z values of inulinase were 721 c negative values of δs showed that enzymes underwent a significant process of aggregation during denaturation consequently sbm is a promising carbon source for inulinase production by a niger also this is the first report on the determination of some properties of a niger a42 atcc 204447 inulinase'        

Numbers

Get ready to witness the text of your dreams come to life with this extraordinary code! Imagine waving a magic wand and watching as all those digits and numbers vanish into thin air, leaving behind a pristine landscape of pure textual bliss. This following code is like a digital magician, skillfully erasing every numerical distraction and conjuring a world where only words reign supreme. The elegance lies in its simplicity as it performs its digital dance, effortlessly sweeping away any numeric intruders. With a single stroke, your text is reborn – cleaner, clearer, and more captivating than ever before. It's as though a cluttered canvas has been transformed into a work of art, where each word stands out like a star in the night sky. So embrace this code and let it take your text on a journey of metamorphosis, revealing the true beauty that lies within words unburdened by digits.

text = re.sub(r'\d', '', text)
text        

The output of the above code is as follows. A small note for the numbers in the used text is that the numbers inside this text are significant. Therefore, before continuing to handle the numbers, think twice.

'inulinases are used for the production of highfructose syrup and fructooligosaccharides and are widely utilized in food and pharmaceutical industries in this study different carbon sources were screened for inulinase production by aspergillus niger in shake flask fermentation optimum working conditions of the enzyme were determined additionally some properties of produced enzyme were determined activation eainactivation eia energies q value inactivation rate constant kd halflife t d value z value enthalpy δh free energy δg and entropy δs results showed that sugar beet molasses sbm was the best in the production of inulinase which gave  uml activity at  c  rpm and initial ph  for  days with  vv of the prepared spore solution optimum working conditions were  ph  c and  min which yielded  uml  inulinasesucrase ratio and  umg additionally ea and eia of inulinase reaction were  and  kjmol respectively beyond  c q values of inulinase dropped below one at  and  c t of inulinase was  and  min therefore inulinase is unstable at high temperatures respectively additionally t d δh δg values of inulinase decreased with the increase in temperature z values of inulinase were  c negative values of δs showed that enzymes underwent a significant process of aggregation during denaturation consequently sbm is a promising carbon source for inulinase production by a niger also this is the first report on the determination of some properties of a niger a atcc  inulinase'        

Rare words

Hold onto your hats because this code is about to unleash a whirlwind of linguistic brilliance! With the power of a thousand words, it dives into your text, deciphering the frequencies of each and every term. It's like having a word wizard in your midst, meticulously tallying up the counts with a swish and flick. But that's not all – armed with a chosen threshold, it separates the common from the rare, elevating your text into a realm of sheer precision. This code is not just a tool; it's a true conductor of textual symphonies. As it filters out the rare gems, your text emerges as a polished masterpiece, exuding a renewed sense of clarity and vibrancy. So get ready to witness the magic as the code takes your words on an extraordinary journey from the mundane to the magnificent, creating a tapestry of language that's nothing short of awe-inspiring!

from collections import Counter

def remove_rare_words(text, threshold=5):
    words = nltk.word_tokenize(text)
    word_freq = Counter(words)
    
    filtered_words = [word for word in words if word_freq[word] >= threshold]
    
    return ' '.join(filtered_words)

text = remove_rare_words(text, threshold=2)
text        

The output of the above code is as follows:

'are for the production of and and are in and in this carbon were for inulinase production by niger in optimum working conditions of the enzyme were determined additionally some properties of enzyme were determined eia q value t d value z value δh δg and δs showed that sbm was the in the production of inulinase which uml at c and ph for with of the optimum working conditions were ph c and min which uml and additionally and eia of inulinase were and respectively c q values of inulinase at and c t of inulinase was and min inulinase is at respectively additionally t d δh δg values of inulinase with the in z values of inulinase were c values of δs showed that a of sbm is a carbon for inulinase production by a niger this is the the of some properties of a niger a inulinase'        

Tokenization

Now that we have our text data, we can start applying some text processing techniques to it. Let's start with tokenization, which is the process of breaking down the text into smaller units called tokens. Tokens can be words, punctuation marks, numbers, or symbols. Tokenization is useful for removing unnecessary characters from the text and preparing it for further analysis.

There are different ways to tokenize text in Python, but one of the simplest ways is to use the word_tokenize function from NLTK. This function splits the text into tokens based on whitespace and punctuation.

# Tokenize text using NLTK
tokens = nltk.word_tokenize(text)
print(tokens)        

The output of the above code is as follows:

['are', 'for', 'the', 'production', 'of', 'and', 'and', 'are', 'in', 'and', 'in', 'this', 'carbon', 'were', 'for', 'inulinase', 'production', 'by', 'niger', 'in', 'optimum', 'working', 'conditions', 'of', 'the', 'enzyme', 'were', 'determined', 'additionally', 'some', 'properties', 'of', 'enzyme', 'were', 'determined', 'eia', 'q', 'value', 't', 'd', 'value', 'z', 'value', 'δh', 'δg', 'and', 'δs', 'showed', 'that', 'sbm', 'was', 'the', 'in', 'the', 'production', 'of', 'inulinase', 'which', 'uml', 'at', 'c', 'and', 'ph', 'for', 'with', 'of', 'the', 'optimum', 'working', 'conditions', 'were', 'ph', 'c', 'and', 'min', 'which', 'uml', 'and', 'additionally', 'and', 'eia', 'of', 'inulinase', 'were', 'and', 'respectively', 'c', 'q', 'values', 'of', 'inulinase', 'at', 'and', 'c', 't', 'of', 'inulinase', 'was', 'and', 'min', 'inulinase', 'is', 'at', 'respectively', 'additionally', 't', 'd', 'δh', 'δg', 'values', 'of', 'inulinase', 'with', 'the', 'in', 'z', 'values', 'of', 'inulinase', 'were', 'c', 'values', 'of', 'δs', 'showed', 'that', 'a', 'of', 'sbm', 'is', 'a', 'carbon', 'for', 'inulinase', 'production', 'by', 'a', 'niger', 'this', 'is', 'the', 'the', 'of', 'some', 'properties', 'of', 'a', 'niger', 'a', 'inulinase']        

SpaCy

Another way to tokenize text in Python is to use the spaCy library. SpaCy has a more advanced tokenizer that can handle complex cases such as contractions, hyphenated words, and emojis. To use spaCy, we need to load a language model first. For this example, I will use the en_core_web_sm model, which is a small English model that supports basic NLP tasks.

# Loading spaCy model
nlp = spacy.load('en_core_web_sm')

# Tokenize text using spaCy
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)        

The output of the above code is as follows:

['are', 'for', 'the', 'production', 'of', 'and', 'and', 'are', 'in', 'and', 'in', 'this', 'carbon', 'were', 'for', 'inulinase', 'production', 'by', 'niger', 'in', 'optimum', 'working', 'conditions', 'of', 'the', 'enzyme', 'were', 'determined', 'additionally', 'some', 'properties', 'of', 'enzyme', 'were', 'determined', 'eia', 'q', 'value', 't', 'd', 'value', 'z', 'value', 'δh', 'δg', 'and', 'δs', 'showed', 'that', 'sbm', 'was', 'the', 'in', 'the', 'production', 'of', 'inulinase', 'which', 'uml', 'at', 'c', 'and', 'ph', 'for', 'with', 'of', 'the', 'optimum', 'working', 'conditions', 'were', 'ph', 'c', 'and', 'min', 'which', 'uml', 'and', 'additionally', 'and', 'eia', 'of', 'inulinase', 'were', 'and', 'respectively', 'c', 'q', 'values', 'of', 'inulinase', 'at', 'and', 'c', 't', 'of', 'inulinase', 'was', 'and', 'min', 'inulinase', 'is', 'at', 'respectively', 'additionally', 't', 'd', 'δh', 'δg', 'values', 'of', 'inulinase', 'with', 'the', 'in', 'z', 'values', 'of', 'inulinase', 'were', 'c', 'values', 'of', 'δs', 'showed', 'that', 'a', 'of', 'sbm', 'is', 'a', 'carbon', 'for', 'inulinase', 'production', 'by', 'a', 'niger', 'this', 'is', 'the', 'the', 'of', 'some', 'properties', 'of', 'a', 'niger', 'a', 'inulinase']        

As you can see, the tokens are mostly the same as the ones from NLTK, except for some minor differences in punctuation handling.

Stopwords

After tokenizing the text, we can apply some other techniques to normalize the tokens. Normalization is the process of transforming the tokens into a standard form that is easier to compare and analyze. For example, we can convert all the tokens to lowercase, remove stopwords, stem the tokens, or lemmatize the tokens.

Stopwords are words that are very common and do not carry much meaning, such as 'the', 'a', 'and', etc. Removing stopwords can help reduce the noise and size of the text data. To remove stopwords, we can use the stopwords set from NLTK, which contains a list of English stopwords.

# Import stopwords from NLTK
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from tokens
filtered_tokens = [token for token in tokens if token not in stop_words]
print(filtered_tokens)        

The output of the above code is as follows:

['production', 'carbon', 'inulinase', 'production', 'niger', 'optimum', 'working', 'conditions', 'enzyme', 'determined', 'additionally', 'properties', 'enzyme', 'determined', 'eia', 'q', 'value', 'value', 'z', 'value', 'δh', 'δg', 'δs', 'showed', 'sbm', 'production', 'inulinase', 'uml', 'c', 'ph', 'optimum', 'working', 'conditions', 'ph', 'c', 'min', 'uml', 'additionally', 'eia', 'inulinase', 'respectively', 'c', 'q', 'values', 'inulinase', 'c', 'inulinase', 'min', 'inulinase', 'respectively', 'additionally', 'δh', 'δg', 'values', 'inulinase', 'z', 'values', 'inulinase', 'c', 'values', 'δs', 'showed', 'sbm', 'carbon', 'inulinase', 'production', 'niger', 'properties', 'niger', 'inulinase']        

Stemming

Stemming is the process of reducing the tokens to their root forms, which are not necessarily valid words. For example, the words 'running', 'runs', and 'run' can be stemmed to the root form 'run'. Stemming can help group together words that have similar meanings but different forms. To perform stemming, we can use the PorterStemmer class from NLTK, which implements a widely used stemming algorithm.

# Importing PorterStemmer from NLTK
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# Stem tokens using PorterStemmer
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)        

The output of the above code is as follows:

['product', 'carbon', 'inulinas', 'product', 'niger', 'optimum', 'work', 'condit', 'enzym', 'determin', 'addit', 'properti', 'enzym', 'determin', 'eia', 'q', 'valu', 'valu', 'z', 'valu', 'δh', 'δg', 'δs', 'show', 'sbm', 'product', 'inulinas', 'uml', 'c', 'ph', 'optimum', 'work', 'condit', 'ph', 'c', 'min', 'uml', 'addit', 'eia', 'inulinas', 'respect', 'c', 'q', 'valu', 'inulinas', 'c', 'inulinas', 'min', 'inulinas', 'respect', 'addit', 'δh', 'δg', 'valu', 'inulinas', 'z', 'valu', 'inulinas', 'c', 'valu', 'δs', 'show', 'sbm', 'carbon', 'inulinas', 'product', 'niger', 'properti', 'niger', 'inulinas']        

Lemmatization

Lemmatization is similar to stemming, but it produces valid words that are the base forms of the tokens. For example, the word 'better' can be lemmatized to the base form 'good'. Lemmatization can also take into account the part-of-speech of the tokens, which can affect their meanings. For example, the word 'saw' can be lemmatized to either 'see' or 'saw' depending on whether it is a verb or a noun. To perform lemmatization, we can use the WordNetLemmatizer class from NLTK, which uses a lexical database called WordNet to find the base forms of words.

# Loading the English language model
nlp = spacy.load("en_core_web_sm")

# Join the list into a sentence
sentence = " ".join(filtered_tokens)

# Process the sentence with spaCy
doc = nlp(sentence)

# Lemmatize and print the lemmatized tokens
lemmatized_tokens = [token.lemma_ for token in doc]
print(lemmatized_tokens)
        

The output of the above code is as follows:

['production', 'carbon', 'inulinase', 'production', 'niger', 'optimum', 'working', 'condition', 'enzyme', 'determine', 'additionally', 'property', 'enzyme', 'determine', 'eia', 'q', 'value', 'value', 'z', 'value', 'δh', 'δg', 'δs', 'show', 'sbm', 'production', 'inulinase', 'uml', 'c', 'ph', 'optimum', 'working', 'condition', 'ph', 'c', 'min', 'uml', 'additionally', 'eia', 'inulinase', 'respectively', 'c', 'q', 'value', 'inulinase', 'c', 'inulinase', 'min', 'inulinase', 'respectively', 'additionally', 'δh', 'δg', 'value', 'inulinase', 'z', 'value', 'inulinase', 'c', 'value', 'δs', 'show', 'sbm', 'carbon', 'inulinase', 'production', 'niger', 'property', 'niger', 'inulinase']        

As you can see, some of the tokens are unchanged, while some are changed to their base forms.

Part-of-speech tagging (Pos_Tag)

Part-of-speech tagging is the process of assigning a grammatical category to each token, such as noun, verb, adjective, etc. Part-of-speech tagging can help us understand the structure and meaning of the text. To perform part-of-speech tagging, we can use the pos_tag function from NLTK, which uses a pre-trained model to tag the tokens.

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Tag each token with its part-of-speech
tags = nltk.pos_tag(tokens)

# Chunk the tagged tokens into named entities
chunks = nltk.ne_chunk(tags)
df = pd.DataFrame(chunks)
df.columns = ['Word', 'IOB Notation']
df.head()        

The output of the above code is as follows:

IOB Notation of words

As you can see, NLTK chunks the tagged tokens into named entities based on the IOB notation, which stands for Inside-Outside-Beginning. This notation indicates whether a token is inside a named entity (I), outside a named entity (O), or at the beginning of a named entity (B). You can find more information about the IOB notation clicking the link.

Sentiment analysis

One of the most challenging and interesting text processing tasks is sentiment analysis, which is the process of determining the attitude or emotion expressed in a text, such as positive, negative, neutral, angry, happy, sad, etc. Sentiment analysis can help us understand the opinions and feelings of people towards various topics, products, events, etc. To perform sentiment analysis in Python, we can use the SentimentIntensityAnalyzer class from NLTK:

# Create a sentiment intensity analyzer object
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Analyze the sentiment of the sentence
scores = sia.polarity_scores(text)
scores        

The output of the above code is as follows:

{'neg': 0.0, 'neu': 0.813, 'pos': 0.187, 'compound': 0.9738}        

As you can see, NLTK analyzes the sentiment of the sentence using four scores: negative (neg), neutral (neu), positive (pos), and compound (compound). The compound score is a normalized value that ranges from -1 (most negative) to +1 (most positive) and represents the overall sentiment of the sentence. The other scores indicate the proportion of each sentiment in the sentence.

All together in one function

Let's write the whole above codes all in one!

def text_preprocessing(text):
    # Download NLTK data
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)
    nltk.download('vader_lexicon', quiet=True)

    # standardization of letters
    standardized_text = text.lower()
    
    # punctuation
    punctuated_text = re.sub(r'[^\w\s]', '', standardized_text)
    
    # numbers
    text_numbers_removed = re.sub(r'\d', '', punctuated_text)
    
    # rare words
    def remove_rare_words(text, threshold=5):
        words = nltk.word_tokenize(text)
        word_freq = Counter(words)
        filtered_words = [word for word in words if word_freq[word] >= threshold]
        return ' '.join(filtered_words)

    text_removed_from_rare_words = remove_rare_words(text_numbers_removed, threshold=2)
    
    # tokenization
    tokens = nltk.word_tokenize(text_removed_from_rare_words)
    
    # spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text_removed_from_rare_words)
    tokens_spacy = [token.text for token in doc]
    
    # stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    
    # lemmatization
    sentence = " ".join(filtered_tokens)
    doc = nlp(sentence)
    lemmatized_tokens = [token.lemma_ for token in doc]
    
    # Part-of-speech tagging (Pos_Tag)
    pos_tag_tokens = nltk.word_tokenize(text_removed_from_rare_words)
    tags = nltk.pos_tag(pos_tag_tokens)
    chunks = nltk.ne_chunk(tags)
    df = pd.DataFrame(chunks)
    df.columns = ['Word', 'IOB Notation']
    
    # sentiment_analysis
    sia = SentimentIntensityAnalyzer()
    scores = sia.polarity_scores(text_removed_from_rare_words)
    
    return standardized_text, punctuated_text, text_numbers_removed, text_removed_from_rare_words, tokens, tokens_spacy, filtered_tokens, stemmed_tokens, lemmatized_tokens, df, scores

text = """Inulinases are used for the production of high-fructose syrup and fructooligosaccharides, and are widely utilized in food and pharmaceutical industries. In this study, different carbon sources were screened for inulinase production by Aspergillus niger in shake flask fermentation. Optimum working conditions of the enzyme were determined. Additionally, some properties of produced enzyme were determined [activation (Ea)/inactivation (Eia) energies, Q10 value, inactivation rate constant (kd), half-life (t1/2), D value, Z value, enthalpy (ΔH), free energy (ΔG), and entropy (ΔS)]. Results showed that sugar beet molasses (SBM) was the best in the production of inulinase, which gave 383.73 U/mL activity at 30 °C, 200 rpm and initial pH 5.0 for 10 days with 2% (v/v) of the prepared spore solution. Optimum working conditions were 4.8 pH, 60 °C, and 10 min, which yielded 604.23 U/mL, 1.09 inulinase/sucrase ratio, and 2924.39 U/mg. Additionally, Ea and Eia of inulinase reaction were 37.30 and 112.86 kJ/mol, respectively. Beyond 60 °C, Q10 values of inulinase dropped below one. At 70 and 80 °C, t1/2 of inulinase was 33.6 and 7.2 min; therefore, inulinase is unstable at high temperatures, respectively. Additionally, t1/2, D, ΔH, ΔG values of inulinase decreased with the increase in temperature. Z values of inulinase were 7.21 °C. Negative values of ΔS showed that enzymes underwent a significant process of aggregation during denaturation. Consequently, SBM is a promising carbon source for inulinase production by A. niger. Also, this is the first report on the determination of some properties of A. niger A42 (ATCC 204,447) inulinase."""

standardized_text, punctuated_text, text_numbers_removed, text_removed_from_rare_words, tokens, tokens_spacy, filtered_tokens, stemmed_tokens, lemmatized_tokens, df, scores = text_preprocessing(text)        
# run the following codes one by one
standardized_text
punctuated_text
text_numbers_removed
text_removed_from_rare_words
print(tokens)
print(tokens_spacy)
print(filtered_tokens)
print(stemmed_tokens)
print(lemmatized_tokens)
df.head()
scores        

Conclusions

This study offer a comprehensive overview of text preprocessing techniques using Natural Language Processing (NLP) with Python, particularly focusing on NLTK and spaCy libraries. Text preprocessing is crucial for preparing raw text data for analysis and modeling in NLP applications. The content of this study emphasizes the importance of text preprocessing due to the varied nature of text sources, such as social media, books, emails, etc., which can introduce noise and inconsistencies into the data.

The code snippets used in the present study demonstrate various text preprocessing steps, each accompanied by enthusiastic descriptions:

Standardization of Letters: This initial code snippet converts all the text to lowercase, ensuring uniformity and consistency. It showcases how a small change can yield significant improvements in data clarity.

Punctuation Removal: The code snippet with regular expressions artfully removes punctuation marks from the text, leaving behind a polished, cleaner version. This code is likened to a wizard's spell, creating an environment of clarity and elegance.

Numbers Removal: The code snippet removes numerical digits, resulting in a text devoid of numeric distractions. This code is described as a magical transformation that reveals the true beauty of words unburdened by digits.

Rare Words Removal: By utilizing the Counter class, this code snippet identifies and removes rare words based on a specified threshold. It is celebrated as a conductor of textual symphonies that turns the text into a masterpiece of precision.

Tokenization: This code snippet showcases tokenization, breaking down text into smaller units called tokens using NLTK's word_tokenize function. It emphasizes how tokenization prepares text for analysis by separating it into meaningful segments.

SpaCy: The code presents tokenization using the spaCy library, highlighting its advanced features such as handling contractions and emojis. It's compared to NLTK's tokenization, showcasing slight differences in handling punctuation.

Stopwords Removal: The code removes common stopwords from the text using NLTK's stopwords set. This step is described as reducing noise and improving data quality by eliminating irrelevant words.

Stemming: The PorterStemmer from NLTK is employed to reduce tokens to their root forms, aiding in grouping words with similar meanings. The code is hailed for its ability to create a sense of unity and consistency in the text.

Lemmatization: Lemmatization, facilitated by NLTK's WordNetLemmatizer and spaCy, is demonstrated to transform tokens into base forms. It highlights the advantage of lemmatization in producing valid words compared to stemming.

Part-of-Speech Tagging: The code showcases NLTK's pos_tag function to assign grammatical categories to tokens, demonstrating the tagging of tokens with part-of-speech labels.

Sentiment Analysis: The SentimentIntensityAnalyzer from NLTK is utilized to analyze the sentiment of the text, producing scores for negative, neutral, positive, and compound sentiment. The code emphasizes how sentiment analysis aids in understanding emotions and opinions expressed in text.

Finally, we collected the whole codes above in a function called 'text_preprocessing()' that gives the same results.

In summary, this study offer an enthusiastic journey through various text preprocessing techniques using NLP libraries in Python, providing readers with a comprehensive understanding of how to prepare text data for analysis and modeling.

Hazza Hashim

Data Analyst |Python |SQL | PowerBI | Excel | Tableau |NLP | Transforming Complex Data into Actionable Insights

1 年

Great

回复

要查看或添加评论,请登录

Mustafa Germec, PhD的更多文章

社区洞察

其他会员也浏览了