Unlocking the Power of Text ????
In the realm of Natural Language Processing, text representation techniques are the magic behind converting words into numbers. ??? These numeric formats ??, serve as the foundation for machine learning and statistical analysis. From sentiment analysis to machine translation and text classification, these techniques enable NLP to make sense of the language we use. ??????
1. ??One-Hot Encoding: Transforming Categories into Code ??
One-hot encoding is like a secret code for categorical variables! ???♂? It converts categories into unique binary values, usually 0?? or 1??. This technique is a must-know in the world of machine learning, as it takes categorical data and turns it into a numerical format that algorithms can understand. ?? Each category becomes a binary feature, with only one being "hot" (set to 1??) for a given data point, indicating its presence. All others stay "cold" (set to 0??), showing the absence of those categories. ????
* The original data had a "Fruit Type" column with categorical values.
* After applying one-hot encoding, new columns are created for each category (Apple, Banana, Orange), with "1" indicating the presence of that category and "0" indicating the absence.
* Each row corresponds to the original data, and in each row, only one of the new columns has a "1," indicating the fruit type for that row.
import numpy as np
vocabulary = ['I', 'love', 'natural', 'language', 'processing']
sentence = "I love natural language processing"
words = sentence.split()
one_hot_encoding = np.zeros((len(words), len(vocabulary)))
for i, word in enumerate(words):
if word in vocabulary:
index = vocabulary.index(word)
one_hot_encoding[i, index] = 1
print(one_hot_encoding)
# output
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
2. ?? Bag of Words (BoW) : It transforms text documents into numerical vectors. It does this by constructing a fixed size vector where dimension of that vector is equal to length of vocabulary and value in each dimension represents the frequency of the words in the vocabulary. ????.
Let's dive into the Bag of Words (BoW) technique with a practical example. ????
Suppose we're working with a small corpus of three concise documents:
Document 1: "I love machine learning."
Document 2: "Machine learning is fascinating."
Document 3: "NLP and machine learning are related."
Here's how we apply the BoW model:
Step 1: Tokenization :We break down each document into individual words:
Step 2: Vocabulary Creation We construct a vocabulary by compiling all unique words from the corpus:
Vocabulary : ["I", "love", "machine", "learning", "is", "fascinating", "NLP", "and", "are", "related"]
Step 3: Vectorization Now, we represent each document numerically based on word frequency in the vocabulary. We count how often each vocabulary word appears in each document:
In these vectors, each position corresponds to a word in the vocabulary, and the value at each position represents the frequency of that word in the respective document.?
from sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = [
"This is the first document.",
"This document is the second document.",
"And one this is the this third this one.",
"Is this the first document?"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("Feature Names:", feature_names)
print("Bag of Words Representation:")
print(X_array)
#result
Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Bag of Words Representation:
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 2 0 1 1 3]
[0 1 1 1 0 0 1 0 1]]
领英推荐
Term Frequency (TF): This part of TF-IDF measures how often a word appears in a document and then normalizes it by dividing it by the total number of terms in the vocabulary. It's like giving each word a "popularity score" within the document. ????
Let's dive into the calculating the TF(Term Frequency )with a practical example. ????
Suppose we're working with a small corpus of three concise documents:
Document 1: "I love machine learning."
Document 2: "Machine learning is fascinating."
Document 3: "NLP and machine learning are related."
Step 1: Tokenization Tokenize each document into individual words:
Step 2: Vocabulary Creation We construct a vocabulary by compiling all unique words from the corpus:
Vocabulary: ["I", "love", "machine", "learning", "is", "fascinating", "NLP", "and", "are", "related"]
Step 3 : TF (Term Frequency) Representation
Now, we represent each document as a numerical vector based on the TF values of words in the vocabulary. In this TF representation, each cell measures how often a word appears in a document relative to the total number of words in that document.
Here's the TF (Term Frequency) representation of the documents as fractions ????:
Inverse Document Frequency(IDF): It measures how important a term in the entire corpus of documents. It is calculated as the logarithm of the ratio of the total number of documents in the corpus to the number of documents containing the term. ??
The formula for IDF is typically : IDF = log(N / n )?
where N is the total number of documents in the corpus, and n is the number of documents containing the term.??
TF-IDF??: The superhero of NLP , TF-IDF score for a term in a document is obtained by multiplying the TF and IDF values for that term.??
TF-IDF = TF * IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text data
corpus = [
"This is the first document.",
"This document is the second document.",
"And one this is the this third this one.",
"Is this the first document?"
]
# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
tfidf_matrix_array = tfidf_matrix.toarray()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
print("TF-IDF Feature Names:", tfidf_feature_names)
print("TF-IDF Representation:")
print(tfidf_matrix_array)
#result
TF-IDF Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
TF-IDF Representation:
[[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]
[0. 0.6876236 0. 0.28108867 0. 0.53864762
0.28108867 0. 0.28108867]
[0.33341663 0. 0. 0.17399063 0.66683325 0.
0.17399063 0.33341663 0.52197188]
[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]]
Hence, Text representations are the building blocks of Natural Language Processing, empowering machines to understand and process human language. From One-Hot to Bag of Words (BoW) to TF-IDF, these techniques have paved the way for groundbreaking applications like sentiment analysis, language translation, and content classification.
References