登录查看更多内容

Unlocking the Power of Text ????

Jashneet Kaur

Research Associate @IIIT Delhi | Data Scientist | NLP & LLM Enthusiast

发布日期: 2023年9月25日

In the realm of Natural Language Processing, text representation techniques are the magic behind converting words into numbers. ??? These numeric formats ??, serve as the foundation for machine learning and statistical analysis. From sentiment analysis to machine translation and text classification, these techniques enable NLP to make sense of the language we use. ??????

1. ??One-Hot Encoding: Transforming Categories into Code ??

One-hot encoding is like a secret code for categorical variables! ???♂? It converts categories into unique binary values, usually 0?? or 1??. This technique is a must-know in the world of machine learning, as it takes categorical data and turns it into a numerical format that algorithms can understand. ?? Each category becomes a binary feature, with only one being "hot" (set to 1??) for a given data point, indicating its presence. All others stay "cold" (set to 0??), showing the absence of those categories. ????

* The original data had a "Fruit Type" column with categorical values.

* After applying one-hot encoding, new columns are created for each category (Apple, Banana, Orange), with "1" indicating the presence of that category and "0" indicating the absence.

* Each row corresponds to the original data, and in each row, only one of the new columns has a "1," indicating the fruit type for that row.

import numpy as np
vocabulary = ['I', 'love', 'natural', 'language', 'processing']
sentence = "I love natural language processing"
words = sentence.split()
one_hot_encoding = np.zeros((len(words), len(vocabulary)))

for i, word in enumerate(words):
    if word in vocabulary:
        index = vocabulary.index(word)
        one_hot_encoding[i, index] = 1

print(one_hot_encoding)

# output
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

2. ?? Bag of Words (BoW) : It transforms text documents into numerical vectors. It does this by constructing a fixed size vector where dimension of that vector is equal to length of vocabulary and value in each dimension represents the frequency of the words in the vocabulary. ????.

Let's dive into the Bag of Words (BoW) technique with a practical example. ????

Suppose we're working with a small corpus of three concise documents:

Document 1: "I love machine learning."

Document 2: "Machine learning is fascinating."

Document 3: "NLP and machine learning are related."

Here's how we apply the BoW model:

Step 1: Tokenization :We break down each document into individual words:

Document 1: ["I", "love", "machine", "learning"]
Document 2: ["Machine", "learning", "is", "fascinating"]
Document 3: ["NLP", "and", "machine", "learning", "are", "related"]

Step 2: Vocabulary Creation We construct a vocabulary by compiling all unique words from the corpus:

Vocabulary : ["I", "love", "machine", "learning", "is", "fascinating", "NLP", "and", "are", "related"]

Step 3: Vectorization Now, we represent each document numerically based on word frequency in the vocabulary. We count how often each vocabulary word appears in each document:

Document 1: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
Document 2: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
Document 3: [0, 0, 1, 1, 0, 0, 1, 1, 1, 1]

In these vectors, each position corresponds to a word in the vocabulary, and the value at each position represents the frequency of that word in the respective document.?

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And one this is the this third this one.",
    "Is this the first document?"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()

print("Feature Names:", feature_names)
print("Bag of Words Representation:")
print(X_array)

#result
Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Bag of Words Representation:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 2 0 1 1 3]
 [0 1 1 1 0 0 1 0 1]]

TF-IDF ????: which stands for Term Frequency-Inverse Document Frequency, is like a detective that helps us find the most important words in a document relative to a collection of documents (corpus). ???♂??

领英推荐

Attention Mechanisms in Web Data Processing: A…

Dr. Tuhin Banik 5 个月前

Different Types of Machine Learning You Should Know

Get Ahead by LinkedIn News 2 年前

Word Embedding: An In-Depth Explanation

Sanjay Kumar MBA,MS,PhD 7 个月前

Term Frequency (TF): This part of TF-IDF measures how often a word appears in a document and then normalizes it by dividing it by the total number of terms in the vocabulary. It's like giving each word a "popularity score" within the document. ????

Let's dive into the calculating the TF(Term Frequency )with a practical example. ????

Suppose we're working with a small corpus of three concise documents:

Document 1: "I love machine learning."

Document 2: "Machine learning is fascinating."

Document 3: "NLP and machine learning are related."

Step 1: Tokenization Tokenize each document into individual words:

Document 1: ["I", "love", "machine", "learning"]
Document 2: ["Machine", "learning", "is", "fascinating"]
Document 3: ["NLP", "and", "machine", "learning", "are", "related"]

Step 2: Vocabulary Creation We construct a vocabulary by compiling all unique words from the corpus:

Vocabulary: ["I", "love", "machine", "learning", "is", "fascinating", "NLP", "and", "are", "related"]

Step 3 : TF (Term Frequency) Representation

Now, we represent each document as a numerical vector based on the TF values of words in the vocabulary. In this TF representation, each cell measures how often a word appears in a document relative to the total number of words in that document.

Here's the TF (Term Frequency) representation of the documents as fractions ????:

Inverse Document Frequency(IDF): It measures how important a term in the entire corpus of documents. It is calculated as the logarithm of the ratio of the total number of documents in the corpus to the number of documents containing the term. ??

The formula for IDF is typically : IDF = log(N / n )?

where N is the total number of documents in the corpus, and n is the number of documents containing the term.??

TF-IDF??: The superhero of NLP , TF-IDF score for a term in a document is obtained by multiplying the TF and IDF values for that term.??

TF-IDF = TF * IDF

TF captures the importance of a term within the context of that document. Terms that appear frequently in a document are likely to be important for understanding the document’s content.??
IDF assesses how unique or rare a term is across a corpus of documents. It highlights terms that are not common across all documents in the collection.Common words like “the,” “and,” or “is” have low IDF scores since they appear in almost every document.??
TF-IDF combines TF and IDF to calculate an overall significance score for a term within a document. The resulting TF-IDF score signifies the importance of a term in the specific document relative to its rarity and importance in the entire corpus. In other words, it tells you how much a term stands out in a particular document compared to its prevalence in the entire collection of documents.Terms with higher TF-IDF scores are considered more important and relevant to the document’s content.

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And one this is the this third this one.",
    "Is this the first document?"
]

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
tfidf_matrix_array = tfidf_matrix.toarray()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

print("TF-IDF Feature Names:", tfidf_feature_names)
print("TF-IDF Representation:")
print(tfidf_matrix_array)

#result
TF-IDF Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
TF-IDF Representation:
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.33341663 0.         0.         0.17399063 0.66683325 0.
  0.17399063 0.33341663 0.52197188]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]

Hence, Text representations are the building blocks of Natural Language Processing, empowering machines to understand and process human language. From One-Hot to Bag of Words (BoW) to TF-IDF, these techniques have paved the way for groundbreaking applications like sentiment analysis, language translation, and content classification.

References

https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/

https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

要查看或添加评论，请登录

Jashneet Kaur的更多文章

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

2024年10月3日

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

Chunking refers to breaking down large documents or texts into smaller, manageable sections or "chunks". In RAG systems…

6 条评论
From Scratch BM25 and it's Variant's

2024年10月2日

From Scratch BM25 and it's Variant's

BM25 is a popular algorithm used to find the most relevant documents for a given search query. It does this by looking…

4 条评论
Retrieval Techniques

2024年9月30日

Retrieval Techniques

Information retrieval (IR) is the process of finding relevant information from large collections of unstructured data…

4 条评论
Exploring Decision Trees: The Branching Paths of Data

2023年10月19日

Exploring Decision Trees: The Branching Paths of Data

A decision tree is a Non-parametric (doesn't assume that your data follows a specific shape or pattern) supervised…

7 条评论
"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

2023年10月3日

"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

Welcome to the world of speech transcription! ??? It's a crucial tool across numerous domains, including customer…

5 条评论

See all articles

Unlocking the Power of Text ????

Jashneet Kaur

Research Associate @IIIT Delhi | Data Scientist | NLP & LLM Enthusiast

领英推荐

Jashneet Kaur的更多文章

社区洞察

其他会员也浏览了

Embeddings in Natural Language Processing (NLP)

Byte-Pair Encoding, WordPiece, and Unigram Tokenization

AI News Letter, December 31,2022

Transformers on Hugging Face: A Beginner's Guide

Evolution of Word Embeddings: A Journey Through NLP History

Artificial Intelligence and Role of NLP in Big Data

Engineers Guide to AI - Tokenization

Is Machine Learning Artificial Intelligence? Understanding the Connection

Understanding Large Language Models(LLM): Vectors, Embeddings, Transformers

Exploring the Power of Machine Learning Real-world Applications and Advancements

领英推荐

Jashneet Kaur的更多文章

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

From Scratch BM25 and it's Variant's

Retrieval Techniques

Exploring Decision Trees: The Branching Paths of Data

"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

社区洞察

其他会员也浏览了

Embeddings in Natural Language Processing (NLP)

Byte-Pair Encoding, WordPiece, and Unigram Tokenization

AI News Letter, December 31,2022

Transformers on Hugging Face: A Beginner's Guide

Evolution of Word Embeddings: A Journey Through NLP History

Artificial Intelligence and Role of NLP in Big Data

Engineers Guide to AI - Tokenization

Is Machine Learning Artificial Intelligence? Understanding the Connection

Understanding Large Language Models(LLM): Vectors, Embeddings, Transformers

Exploring the Power of Machine Learning Real-world Applications and Advancements