Text Mining

Text Mining

Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data. According to Wikipedia: Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

For more information refer these links: https://en.wikipedia.org/wiki/Text_mining

Text mining data flow:

Step 1 : Information Retrieval : This is the first step in the process of data mining. This step involves the help of a search engine to find out the collection of text also known as corpus of texts which might need some conversion. These texts should also be brought together in a particular format which will be helpful for the users to understand.

Step 2 : Natural Language Processing : This step allows the system to perform grammatical analysis of a sentence to read the text. It also analyzes the text in structures.

Step 3 : Information extraction :This is the second stage where in order to identify the meaning of a particular text mark-up is done. In this stage a metadata is added to the database about the text. It also involves adding names or locations to the text. This step lets the search engine to get the information and find out the relationships between the texts using their metadata.

Step 4 : Data Mining : The final stage is data mining using different tools. This step finds the similarities between the information that has the same meaning which will be otherwise difficult to find. Text Mining is a tool which boosts the research process and helps to test the queries.

No alt text provided for this image

Text Mining techniques:

Typical text mining tasks include:

  • Text Categorization: Cataloguing texts into categories
  •  Text Clustering: Clustering groups of automatically retrieved text into a list of meaningful categories
  • Concept/entity extraction: Locating and classifying elements in text into predefined categories such as persons, organizations, locations, monetary values etc.
  • Granular taxonomies: Enabling organization or classification of information as a set of objects and displayed as a taxonomy
  • Sentimental Analysis: Identifying and extracting subjective information in source materials (e.g., emotion, beliefs)
  • Document Summarization: Creating a shortened version of a text containing the most important elements
  • Entity Relation modeling: Automated learning of relationships between data types.

So far we discuss a lot of theory about text mining. Now, let's try to do some practical on text mining. Before starting with a project lets try to collect all the important packages which we will need in Text Mining project.

The most commonly used python packages for text mining are as follows:

  1. NLTK(Natural Language Toolkit): It is the ‘mother’ of all NLP libraries. Excellent for educational purposes and the de-facto standard for many NLP tasks.
  2. TextBlob : It is definitely one of my favorite libraries and my personal go-to when it comes to prototyping or implementing common NLP tasks. It can be considered as modern multi-purpose NLT toolset that is really great for fast and easy development.
  3. Gensim : It is the go-to library for Semantic analysis and topic modelling of NLP and text mining. It’s fast, scalable, and very efficient.
  4. Scikit-learn : It is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
  5. Polyglot : Polyglot is primarily designed for multilingual applications.
  6. PyNLPl : It can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build a simple language model. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Project: Twitter Sentimental Analysis

In this project we'll try to understand the sentiments of tweet by using text mining techniques using python.

#Data Extraction
train = pd.read_csv('train_E6oV3lV.csv')

#Counting number of words in tweet
train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))train[['tweet','word_count']].head()

#Finding length of tweet
train['char_count'] = train['tweet'].str.len() ## this also includes spaces
train[['tweet','char_count']].head()

#Counting average word count in a tweet
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))
train[['tweet','avg_word']].head()

#Removing Stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train[['tweet','stopwords']].head()

#Number of special caracters
train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train[['tweet','hastags']].head()

#Counting number of numerics
train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train[['tweet','numerics']].head()

#Counting number of uppercase
train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train[['tweet','upper']].head()

#Countinfg lowercase letters in a tweet
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#Replacing spaces with ","
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')

#Counting stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))



freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

from textblob import TextBlob
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

#Tokenization refers to dividing the text into a sequence of words or sentences. In our example, we have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

TextBlob(train['tweet'][1]).words

#Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. For this purpose, we will use PorterStemmer from the NLTK library.

from nltk.stem import PorterStemmer
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

#Lemmatization is a more effective option than stemming because it converts the word into its root word,
#we usually prefer using lemmatization over stemming.

from textblob import Word
train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['tweet'].head()

#N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

TextBlob(train['tweet'][0]).ngrams(2)

#TF = (Number of times term T appears in the particular row) / (number of terms in that row)

tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

#IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))

#TF-IDF is the multiplication of the TF and IDF which we calculated above.

tf1['tfidf'] = tf1['tf'] * tf1['idf']

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])

#Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data.

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['tweet'])

#let’s check the sentiment of the first few tweets.

train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)
train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['tweet','sentiment']].head()

#Word Embedding is the representation of text in the form of vectors.

from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

Text Mining Model:

No alt text provided for this image

Text mining and NLP both are trending technologies these days and is used everywhere. Few applications of text mining and NLP are given below:

  • Social media data analysis:
  • Risk management
  • Cyber crime prevention
  • Customer care service
  • Fraud detection through claims investigation
  • Contextual Advertising
  • Business intelligence
  • Content enrichment
  • Spam filtering
  • Knowledge management

Conclusion:

In this Article, I tried to explain what is Text Mining, Text Mining Data Flow, Text mining Techniques, commonly used python packages for Text Mining, project on twitter sentiment analysis and Text Mining Model.

I hope you like this article.

Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.

Keep reading and I’ll keep writing.

Sai Teja Nagam

Software Engineer @ Rivian | MS in Computer Science

5 年

Great Article Suravi Mahanta

回复

要查看或添加评论,请登录

Suravi Mahanta的更多文章

社区洞察

其他会员也浏览了