登录查看更多内容

Introduction to Natural Language Processing: Byte Pair Encoding (BPE) and Natural Language Toolkit (NLTK)

RISHABH SINGH

Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University

发布日期: 2024年9月23日

NLP is a subfield of artificial intelligence (AI) and computational linguistics. It focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful. The primary goal of NLP is to bridge the gap between human communication and computer understanding.

Key aspects of NLP include:

Text analysis: Breaking down and understanding the structure of written text.
Speech recognition: Converting spoken language into written text.
Machine translation: Translating text from one language to another.
Sentiment analysis: Determining the emotional tone behind a piece of text.
Named entity recognition: Identifying and classifying named entities (e.g., person names, organizations) in text.
Question answering: Developing systems that can automatically answer questions posed in natural language.

Byte Pair Encoding (BPE) Explained with?Example

Byte Pair Encoding (BPE) is a data compression technique that has been adapted for subword tokenization in NLP. It helps in breaking down words into smaller units called subword tokens.

Why BPE is Useful in NLP

Handling Out-of-Vocabulary Words: By breaking words into subwords, models can understand and generate words not seen during training.
Morphological Representation: Captures prefixes, suffixes, and root words, helping in understanding word structures.
Efficiency: Reduces the vocabulary size, making the model more efficient without losing significant information.

How BPE Works?—?Algorithm Steps:

Let’s walk through a simple example.

Corpus:

low
lowest
lower
newest
newer

Step 1: Initialization

Break each word into a sequence of characters with an end-of-word symbol </w>.

l o w </w>
l o w e s t </w>
l o w e r </w>
n e w e s t </w>
n e w e r </w>

Vocabulary:

Initial vocabulary is all unique symbols:

{l, o, w, e, s, t, n, r, </w>}

Step 2: Iterative Merging

Iteration 1:

Count Pairs:

('l', 'o'): 3 times
('o', 'w'): 3 times
('w', '</w>'): 1 time
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
and so on.

Most Frequent Pair: 'l o' occurs 3 times (the same count as 'o w', but we'll handle one pair at a time).

Merge ‘l o’ → ‘lo’

Update Corpus:

lo w </w>
lo w e s t </w>
lo w e r </w>
n e w e s t </w>
n e w e r </w>

Update Vocabulary: Add lo to the vocabulary.

Iteration 2:

Count Pairs:

('lo', 'w'): 3 times
('w', '</w>'): 3 times
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
and so on.

Most Frequent Pair: 'lo w' occurs 3 times.

Merge ‘lo w’ → ‘low.

Update Corpus:

low </w>
low e s t </w>
low e r </w>
n e w e s t </w>
n e w e r </w>

Update Vocabulary: Add low to the vocabulary.

Iteration 3:

Count Pairs:

('low', '</w>'): 1 time
('low', 'e'): 2 times
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times

Most Frequent Pair: 'e s' occurs 2 times.

Merge ‘e s’ → ‘es’

Update Corpus:

low </w>
low es t </w>
low e r </w>
n e w es t </w>
n e w e r </w>

Update Vocabulary: Add es to the vocabulary.

Iteration 4:

('low', '</w>'): 1 time
('low', 'es'): 2 times
('es', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times

Most Frequent Pair: ('es', 't') occurs 2times.

Merge ‘es t’ → ‘est’

Update Corpus:

领英推荐

Natural Language Processing for Software Testing

testRigor 3 周前

Demystifying NLP and NLTK: A Step-by-Step Guide for…

Eduardo Miranda 8 个月前

Natural Language Processing Roadmap- Step-by-Step Guide

Aqsa Z. 6 个月前

low </w>
low est </w>
low e r </w>
n e w est </w>
n e w e r </w>

Continue Iterations

Proceed with the merging process, each time updating the corpus and vocabulary.

Final Vocabulary:

After several iterations, the vocabulary may include tokens like low, est, er, new, west, etc.

Encoding New Words

When a new word appears (e.g., “lowest”), the word is broken down using the known tokens from the vocabulary.

“lowest” → low, est

Python Implementation of BPE:

def get_vocab(corpus):
    vocab = collections.defaultdict(int)
    for word in corpus:
        symbols = ' '.join(word) + ' </w>'
        vocab[symbols] += 1
    return vocab

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pair = (symbols[i], symbols[i+1])
            pairs[pair] += freq
    return pairs

def merge_vocab(pair, vocab):
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

corpus = ['low', 'lowest', 'lower', 'newest', 'newer']
num_merges = 10
vocab = get_vocab(corpus)

for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(f'Step {i+1}: Merge {best}')
    print(f'Vocabulary: {list(vocab.keys())}\n')

Explanation of the Code

get_vocab(corpus): Prepares the initial vocabulary by splitting words into individual characters and adding an end-of-word symbol.
get_stats(vocab): Counts frequency of adjacent symbol pairs in the vocabulary.
merge_vocab(pair, vocab): Merges the most frequent pair in the vocabulary.
Main Loop: Performs a specified number of merges, updating the vocabulary each time.

Sample Output

Step 1: Merge ('e', 's')
Vocabulary: ['l o w </w>', 'l o w e s t </w>', 'l o w e r </w>', 
'n e w e s t </w>', 'n e w e r </w>']

Step 2: Merge ('es', 't')
Vocabulary: ['l o w </w>', 'l o w est </w>', 'l o w e r </w>', 
'n e w est </w>', 'n e w e r </w>']

Step 3: Merge ('l', 'o')
Vocabulary: ['lo o w </w>', 'lo o w est </w>', 'lo o w e r </w>', 
'n e w est </w>', 'n e w e r </w>']

Natural Language Toolkit (NLTK) with?Examples

NLTK is a powerful Python library for NLP tasks. Here are some key features explained with examples.

1. Access to Text Corpora

NLTK provides interfaces to various corpora.

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

# List available books
books = gutenberg.fileids()
print("Available Books:", books)

# Load text of a book
text = gutenberg.raw('austen-emma.txt')
print("First 500 characters:\n", text[:500])

2. Tokenization Tools

Splitting text into words or sentences.

from nltk.tokenize import word_tokenize, sent_tokenize

# Word Tokenization
words = word_tokenize("Hello, world! How are you?")
print("Word Tokens:", words)

# Sentence Tokenization
sentences = sent_tokenize("Hello, world! How are you?")
print("Sentence Tokens:", sentences)

3. Part-of-Speech Tagging

Assigning grammatical tags to words.

nltk.download('averaged_perceptron_tagger')

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)

4. Named Entity Recognition

Identifying named entities like person names, organizations, locations.

nltk.download('maxent_ne_chunker')
nltk.download('words') 

sentence = "Barack Obama was the 44th President of the United States."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
entities = nltk.ne_chunk(pos_tags)
print("Named Entities:", entities)

5. Sentiment Analysis Tools

While NLTK doesn’t have built-in sentiment analysis, it provides resources for it.

# Using NLTK's movie_reviews corpus for sentiment analysis
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Example: Loading positive and negative reviews
positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')

print("Number of positive reviews:", len(positive_reviews))
print("Number of negative reviews:", len(negative_reviews))

6. Parsing Tools

Analyzing grammatical structure.

from nltk import CFG, ChartParser

# Define a simple grammar
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the' | 'a'
N -> 'dog' | 'cat'
V -> 'chased' | 'sat'
""")

parser = ChartParser(grammar)
sentence = "the dog chased a cat".split()
for tree in parser.parse(sentence):
    print(tree)

7. Stemming and Lemmatization

Reducing words to their root form.

Stemming: Process of reducing words to their root form by removing suffixes.

“Studies” → “studi”
“Studying” → “studi”
Pros: Simple and fast.
Cons: May produce non-real words (“studi” instead of “study”).

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word_list = ["running", "ran", "runs", "easily", "fairly"]
stems = [stemmer.stem(word) for word in word_list]
print("Stems:", stems)

Output:

Stems: ['run', 'ran', 'run', 'easili', 'fairli']

Lemmatization: Converts words to their base form using vocabulary and morphological analysis.

“Studies” → “study”
“Studying” → “study”
Pros: Produces actual words.
Cons: Slower, requires more resources.

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in word_list]
print("Lemmas:", lemmas)

Output:

Lemmas: ['running', 'ran', 'run', 'easily', 'fairly']

Neelanjan Mitra

Data professional with expertise in statistical modeling, algorithm development, and predictive analytics. Actively seeking full time roles.

5 个月

Great post! Few more reasons to love BPE: 1. Super intuitional 2. implementing BPE is a GREAT exercise to keep your programming skills sharp :P 3. Even as a foundational step, BPE gives so much insight into how a language model "relates"

2 次回应

Chris Nolen

AI and Technology Specialist | Innovator in Emerging Tech

5 个月

Great breakdown of key NLP concepts with clear examples! The Byte Pair Encoding explanation and Python code are especially helpful for understanding subword tokenization.

2 次回应

查看更多评论

要查看或添加评论，请登录

RISHABH SINGH的更多文章

Classification Measures in Machine Learning

2025年2月24日

Classification Measures in Machine Learning

In classification problems, it’s crucial to have effective measures to evaluate how well our model is performing…

2 条评论
Regularization in Machine Learning

2024年11月17日

Regularization in Machine Learning

Regularization is a technique used in machine learning to prevent overfitting, which occurs when a model learns the…
Logistic Regression

2024年11月4日

Logistic Regression

Logistic Regression is one of the most fundamental algorithms in Machine Learning and is primarily used for…
Why Logistic Regression Beats Linear Regression for Classification

2024年10月29日

Why Logistic Regression Beats Linear Regression for Classification

In machine learning, there are two main types of tasks: regression and classification. Linear Regression is designed…

2 条评论
Introduction to Machine Learning

2024年10月26日

Introduction to Machine Learning

Machine Learning (ML) is a branch of artificial intelligence (AI) that allows computers to learn and make predictions…
Statistics for Machine Learning

2024年10月24日

Statistics for Machine Learning

Statistics is described as a collection of tools and methods used to derive meaningful insights by performing…

4 条评论
Sliding Window Technique Simplified (C++)

2024年10月4日

Sliding Window Technique Simplified (C++)

The Sliding Window Technique is a powerful method to solve problems involving arrays or strings. It optimizes problems…
Natural Language Processing: Linear Text Classification

2024年9月29日

Natural Language Processing: Linear Text Classification

Linear classification refers to using a straight line (or hyperplane in higher dimensions) to separate different…
Mastering Arrays & Pointers (C++): Learning Basics to Solving Top Interview Questions (Part-2)

2024年9月25日

Mastering Arrays & Pointers (C++): Learning Basics to Solving Top Interview Questions (Part-2)

Ques.1 Remove Even Integers from Array Given an array of integers, arr, remove all the even integers from the array.

1 条评论
Mastering HashSet in C++: Unraveling the Power of unordered_set

2024年9月21日

Mastering HashSet in C++: Unraveling the Power of unordered_set

In C++, the term “HashSet” is often confused with , but they are essentially the same thing. C++ does not have a direct…

See all articles

Introduction to Natural Language Processing: Byte Pair Encoding (BPE) and Natural Language Toolkit (NLTK)

RISHABH SINGH

Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University

Byte Pair Encoding (BPE) Explained with?Example

Step 1: Initialization

Step 2: Iterative Merging

领英推荐

Natural Language Toolkit (NLTK) with?Examples

RISHABH SINGH的更多文章

社区洞察

其他会员也浏览了

WHAT IS NLP

Natural Language Processing (NLP) A Comprehensive Guide

The Comprehensive Roadmap to Natural Language Processing: Unveiling the Depths of Language Understanding

???? What exactly is Natural Language Processing?

Natural Language Processing _ Part 5

Natural Language Processing: Bridging the Gap between Human Communication and Computers

Natural Language Processing in a nutshell

Natural Language Processing Unleashed: Exploring Techniques and Large Language Model Applications

Understanding Word Embedding in NLP using Sentence Transformers

Byte Pair Encoding (BPE) Explained with?Example

Step 1: Initialization

Step 2: Iterative Merging

领英推荐

Natural Language Toolkit (NLTK) with?Examples

RISHABH SINGH的更多文章

Classification Measures in Machine Learning

Regularization in Machine Learning

Logistic Regression

Why Logistic Regression Beats Linear Regression for Classification

Introduction to Machine Learning

Statistics for Machine Learning

Sliding Window Technique Simplified (C++)

Natural Language Processing: Linear Text Classification

Mastering Arrays & Pointers (C++): Learning Basics to Solving Top Interview Questions (Part-2)

Mastering HashSet in C++: Unraveling the Power of unordered_set

社区洞察

其他会员也浏览了

WHAT IS NLP

Natural Language Processing (NLP) A Comprehensive Guide

The Comprehensive Roadmap to Natural Language Processing: Unveiling the Depths of Language Understanding

???? What exactly is Natural Language Processing?

Natural Language Processing _ Part 5

Natural Language Processing: Bridging the Gap between Human Communication and Computers

Natural Language Processing in a nutshell

Natural Language Processing Unleashed: Exploring Techniques and Large Language Model Applications

Understanding Word Embedding in NLP using Sentence Transformers