Introduction to Natural Language Processing: Byte Pair Encoding (BPE) and Natural Language Toolkit (NLTK)
RISHABH SINGH
Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University
NLP is a subfield of artificial intelligence (AI) and computational linguistics. It focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful. The primary goal of NLP is to bridge the gap between human communication and computer understanding.
Key aspects of NLP include:
Byte Pair Encoding (BPE) Explained with?Example
Byte Pair Encoding (BPE) is a data compression technique that has been adapted for subword tokenization in NLP. It helps in breaking down words into smaller units called subword tokens.
Why BPE is Useful in NLP
How BPE Works?—?Algorithm Steps:
Let’s walk through a simple example.
Corpus:
low
lowest
lower
newest
newer
Step 1: Initialization
Break each word into a sequence of characters with an end-of-word symbol </w>.
l o w </w>
l o w e s t </w>
l o w e r </w>
n e w e s t </w>
n e w e r </w>
Vocabulary:
Initial vocabulary is all unique symbols:
{l, o, w, e, s, t, n, r, </w>}
Step 2: Iterative Merging
Iteration 1:
Count Pairs:
('l', 'o'): 3 times
('o', 'w'): 3 times
('w', '</w>'): 1 time
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
and so on.
Most Frequent Pair: 'l o' occurs 3 times (the same count as 'o w', but we'll handle one pair at a time).
Merge ‘l o’ → ‘lo’
Update Corpus:
lo w </w>
lo w e s t </w>
lo w e r </w>
n e w e s t </w>
n e w e r </w>
Update Vocabulary: Add lo to the vocabulary.
Iteration 2:
Count Pairs:
('lo', 'w'): 3 times
('w', '</w>'): 3 times
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
and so on.
Most Frequent Pair: 'lo w' occurs 3 times.
Merge ‘lo w’ → ‘low.
Update Corpus:
low </w>
low e s t </w>
low e r </w>
n e w e s t </w>
n e w e r </w>
Update Vocabulary: Add low to the vocabulary.
Iteration 3:
Count Pairs:
('low', '</w>'): 1 time
('low', 'e'): 2 times
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
Most Frequent Pair: 'e s' occurs 2 times.
Merge ‘e s’ → ‘es’
Update Corpus:
low </w>
low es t </w>
low e r </w>
n e w es t </w>
n e w e r </w>
Update Vocabulary: Add es to the vocabulary.
Iteration 4:
('low', '</w>'): 1 time
('low', 'es'): 2 times
('es', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
Most Frequent Pair: ('es', 't') occurs 2times.
Merge ‘es t’ → ‘est’
Update Corpus:
领英推荐
low </w>
low est </w>
low e r </w>
n e w est </w>
n e w e r </w>
Continue Iterations
Proceed with the merging process, each time updating the corpus and vocabulary.
Final Vocabulary:
After several iterations, the vocabulary may include tokens like low, est, er, new, west, etc.
Encoding New Words
When a new word appears (e.g., “lowest”), the word is broken down using the known tokens from the vocabulary.
Python Implementation of BPE:
def get_vocab(corpus):
vocab = collections.defaultdict(int)
for word in corpus:
symbols = ' '.join(word) + ' </w>'
vocab[symbols] += 1
return vocab
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pair = (symbols[i], symbols[i+1])
pairs[pair] += freq
return pairs
def merge_vocab(pair, vocab):
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in vocab:
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = vocab[word]
return new_vocab
corpus = ['low', 'lowest', 'lower', 'newest', 'newer']
num_merges = 10
vocab = get_vocab(corpus)
for i in range(num_merges):
pairs = get_stats(vocab)
if not pairs:
break
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(f'Step {i+1}: Merge {best}')
print(f'Vocabulary: {list(vocab.keys())}\n')
Explanation of the Code
Sample Output
Step 1: Merge ('e', 's')
Vocabulary: ['l o w </w>', 'l o w e s t </w>', 'l o w e r </w>',
'n e w e s t </w>', 'n e w e r </w>']
Step 2: Merge ('es', 't')
Vocabulary: ['l o w </w>', 'l o w est </w>', 'l o w e r </w>',
'n e w est </w>', 'n e w e r </w>']
Step 3: Merge ('l', 'o')
Vocabulary: ['lo o w </w>', 'lo o w est </w>', 'lo o w e r </w>',
'n e w est </w>', 'n e w e r </w>']
Natural Language Toolkit (NLTK) with?Examples
NLTK is a powerful Python library for NLP tasks. Here are some key features explained with examples.
1. Access to Text Corpora
NLTK provides interfaces to various corpora.
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
# List available books
books = gutenberg.fileids()
print("Available Books:", books)
# Load text of a book
text = gutenberg.raw('austen-emma.txt')
print("First 500 characters:\n", text[:500])
2. Tokenization Tools
Splitting text into words or sentences.
from nltk.tokenize import word_tokenize, sent_tokenize
# Word Tokenization
words = word_tokenize("Hello, world! How are you?")
print("Word Tokens:", words)
# Sentence Tokenization
sentences = sent_tokenize("Hello, world! How are you?")
print("Sentence Tokens:", sentences)
3. Part-of-Speech Tagging
Assigning grammatical tags to words.
nltk.download('averaged_perceptron_tagger')
sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)
4. Named Entity Recognition
Identifying named entities like person names, organizations, locations.
nltk.download('maxent_ne_chunker')
nltk.download('words')
sentence = "Barack Obama was the 44th President of the United States."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
entities = nltk.ne_chunk(pos_tags)
print("Named Entities:", entities)
5. Sentiment Analysis Tools
While NLTK doesn’t have built-in sentiment analysis, it provides resources for it.
# Using NLTK's movie_reviews corpus for sentiment analysis
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews
# Example: Loading positive and negative reviews
positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')
print("Number of positive reviews:", len(positive_reviews))
print("Number of negative reviews:", len(negative_reviews))
6. Parsing Tools
Analyzing grammatical structure.
from nltk import CFG, ChartParser
# Define a simple grammar
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the' | 'a'
N -> 'dog' | 'cat'
V -> 'chased' | 'sat'
""")
parser = ChartParser(grammar)
sentence = "the dog chased a cat".split()
for tree in parser.parse(sentence):
print(tree)
7. Stemming and Lemmatization
Reducing words to their root form.
Stemming: Process of reducing words to their root form by removing suffixes.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
word_list = ["running", "ran", "runs", "easily", "fairly"]
stems = [stemmer.stem(word) for word in word_list]
print("Stems:", stems)
Output:
Stems: ['run', 'ran', 'run', 'easili', 'fairli']
Lemmatization: Converts words to their base form using vocabulary and morphological analysis.
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in word_list]
print("Lemmas:", lemmas)
Output:
Lemmas: ['running', 'ran', 'run', 'easily', 'fairly']
Data professional with expertise in statistical modeling, algorithm development, and predictive analytics. Actively seeking full time roles.
5 个月Great post! Few more reasons to love BPE: 1. Super intuitional 2. implementing BPE is a GREAT exercise to keep your programming skills sharp :P 3. Even as a foundational step, BPE gives so much insight into how a language model "relates"
AI and Technology Specialist | Innovator in Emerging Tech
5 个月Great breakdown of key NLP concepts with clear examples! The Byte Pair Encoding explanation and Python code are especially helpful for understanding subword tokenization.