登录查看更多内容

Text Mining

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

发布日期: 2019年7月7日

Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data. According to Wikipedia: Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

For more information refer these links: https://en.wikipedia.org/wiki/Text_mining

Text mining data flow:

Step 1 : Information Retrieval : This is the first step in the process of data mining. This step involves the help of a search engine to find out the collection of text also known as corpus of texts which might need some conversion. These texts should also be brought together in a particular format which will be helpful for the users to understand.

Step 2 : Natural Language Processing : This step allows the system to perform grammatical analysis of a sentence to read the text. It also analyzes the text in structures.

Step 3 : Information extraction :This is the second stage where in order to identify the meaning of a particular text mark-up is done. In this stage a metadata is added to the database about the text. It also involves adding names or locations to the text. This step lets the search engine to get the information and find out the relationships between the texts using their metadata.

Step 4 : Data Mining : The final stage is data mining using different tools. This step finds the similarities between the information that has the same meaning which will be otherwise difficult to find. Text Mining is a tool which boosts the research process and helps to test the queries.

Text Mining techniques:

Typical text mining tasks include:

Text Categorization: Cataloguing texts into categories
Text Clustering: Clustering groups of automatically retrieved text into a list of meaningful categories
Concept/entity extraction: Locating and classifying elements in text into predefined categories such as persons, organizations, locations, monetary values etc.
Granular taxonomies: Enabling organization or classification of information as a set of objects and displayed as a taxonomy
Sentimental Analysis: Identifying and extracting subjective information in source materials (e.g., emotion, beliefs)
Document Summarization: Creating a shortened version of a text containing the most important elements
Entity Relation modeling: Automated learning of relationships between data types.

So far we discuss a lot of theory about text mining. Now, let's try to do some practical on text mining. Before starting with a project lets try to collect all the important packages which we will need in Text Mining project.

The most commonly used python packages for text mining are as follows:

NLTK(Natural Language Toolkit): It is the ‘mother’ of all NLP libraries. Excellent for educational purposes and the de-facto standard for many NLP tasks.
TextBlob : It is definitely one of my favorite libraries and my personal go-to when it comes to prototyping or implementing common NLP tasks. It can be considered as modern multi-purpose NLT toolset that is really great for fast and easy development.
Gensim : It is the go-to library for Semantic analysis and topic modelling of NLP and text mining. It’s fast, scalable, and very efficient.
Scikit-learn : It is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Polyglot : Polyglot is primarily designed for multilingual applications.
PyNLPl : It can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build a simple language model. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Project: Twitter Sentimental Analysis

In this project we'll try to understand the sentiments of tweet by using text mining techniques using python.

#Data Extraction
train = pd.read_csv('train_E6oV3lV.csv')

#Counting number of words in tweet
train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))train[['tweet','word_count']].head()

#Finding length of tweet
train['char_count'] = train['tweet'].str.len() ## this also includes spaces
train[['tweet','char_count']].head()

#Counting average word count in a tweet
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))
train[['tweet','avg_word']].head()

#Removing Stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train[['tweet','stopwords']].head()

#Number of special caracters
train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train[['tweet','hastags']].head()

#Counting number of numerics
train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train[['tweet','numerics']].head()

#Counting number of uppercase
train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train[['tweet','upper']].head()

#Countinfg lowercase letters in a tweet
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#Replacing spaces with ","
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')

#Counting stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))



freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

from textblob import TextBlob
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

#Tokenization refers to dividing the text into a sequence of words or sentences. In our example, we have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

TextBlob(train['tweet'][1]).words

#Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. For this purpose, we will use PorterStemmer from the NLTK library.

from nltk.stem import PorterStemmer
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

#Lemmatization is a more effective option than stemming because it converts the word into its root word,
#we usually prefer using lemmatization over stemming.

from textblob import Word
train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['tweet'].head()

#N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

TextBlob(train['tweet'][0]).ngrams(2)

#TF = (Number of times term T appears in the particular row) / (number of terms in that row)

tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

#IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))

#TF-IDF is the multiplication of the TF and IDF which we calculated above.

tf1['tfidf'] = tf1['tf'] * tf1['idf']

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])

#Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data.

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['tweet'])

#let’s check the sentiment of the first few tweets.

train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)
train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['tweet','sentiment']].head()

#Word Embedding is the representation of text in the form of vectors.

from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

Text Mining Model:

Text mining and NLP both are trending technologies these days and is used everywhere. Few applications of text mining and NLP are given below:

Social media data analysis:
Risk management
Cyber crime prevention
Customer care service
Fraud detection through claims investigation
Contextual Advertising
Business intelligence
Content enrichment
Spam filtering
Knowledge management

Conclusion:

In this Article, I tried to explain what is Text Mining, Text Mining Data Flow, Text mining Techniques, commonly used python packages for Text Mining, project on twitter sentiment analysis and Text Mining Model.

I hope you like this article.

Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.

Keep reading and I’ll keep writing.

Sai Teja Nagam

Software Engineer @ Rivian | MS in Computer Science

5 年

Great Article Suravi Mahanta

查看更多评论

要查看或添加评论，请登录

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

2020年6月20日

Why we need Analytics in Suppl chain management

Supply chain is the backbone of any product based company. To have a succesful busniess, you need a sucessfull supply…
Prophet Forecasting

2020年6月7日

Prophet Forecasting

Forecasting is one of the most commonly used machine learning algorithms in any business. It’s become a necessity to…
Important statistics for Data science

2020年2月13日

Important statistics for Data science

In this blog I'll try to cover most of the statistical measures which are used in almost all data science projects. So…

5 条评论
Hierarchical clustering: The simplest clustering algorithm

2019年8月21日

Hierarchical clustering: The simplest clustering algorithm

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into…

4 条评论
Let's evaluate classification model with ROC and PR curves.

2019年5月31日

Let's evaluate classification model with ROC and PR curves.

Model evaluation is one of the most important part while building model. And to evaluate the model we use different…

4 条评论
Let's find the value of K for K means in 2 minutes and by using two methods.

2019年5月10日

Let's find the value of K for K means in 2 minutes and by using two methods.

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as K…

1 条评论
What is the best algorithm for classification problem?

2019年4月23日

What is the best algorithm for classification problem?

Classification is one of the data mining tasks, applied in many area especially in retail, banking sector…
Journey from DataBricks to Azure DataBricks

2019年4月4日

Journey from DataBricks to Azure DataBricks

DataBricks is an organization and big data processing platform designed by the creators of Apache Spark. It was founded…
What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

2019年3月24日

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

Intelligence has been defined in many ways, including: the capacity for logic, understanding, self-awareness, learning,…

See all articles

Text Mining

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

Conclusion:

Suravi Mahanta的更多文章

社区洞察

其他会员也浏览了

Unlock the full potential hidden in your company data by combining Process Mining Technology and Generative AI

Can worthwhile technical content be... funny?

How Data Mining and AI Help Create Business Value

Paper Made Easy: A guide to Hierarchical Classification

Text Mining for Investiments

Product review aspect extraction

Difference Between Data Mining & Machine Learning

Text Mining - a way to provide better services

Data is Where You Look for it – The Power of Text Mining and Analytics

Exploring the Power of Text Mining Tools Beyond R

Conclusion:

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

Prophet Forecasting

Important statistics for Data science

Hierarchical clustering: The simplest clustering algorithm

Let's evaluate classification model with ROC and PR curves.

Let's find the value of K for K means in 2 minutes and by using two methods.

What is the best algorithm for classification problem?

Journey from DataBricks to Azure DataBricks

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

社区洞察

其他会员也浏览了

Unlock the full potential hidden in your company data by combining Process Mining Technology and Generative AI

Can worthwhile technical content be... funny?

How Data Mining and AI Help Create Business Value

Paper Made Easy: A guide to Hierarchical Classification

Text Mining for Investiments

Product review aspect extraction

Difference Between Data Mining & Machine Learning

Text Mining - a way to provide better services

Data is Where You Look for it – The Power of Text Mining and Analytics

Exploring the Power of Text Mining Tools Beyond R