Introduction to Natural Language Processing: Byte Pair Encoding (BPE) and Natural Language Toolkit (NLTK)

Introduction to Natural Language Processing: Byte Pair Encoding (BPE) and Natural Language Toolkit (NLTK)

NLP is a subfield of artificial intelligence (AI) and computational linguistics. It focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful. The primary goal of NLP is to bridge the gap between human communication and computer understanding.

Key aspects of NLP include:

  1. Text analysis: Breaking down and understanding the structure of written text.
  2. Speech recognition: Converting spoken language into written text.
  3. Machine translation: Translating text from one language to another.
  4. Sentiment analysis: Determining the emotional tone behind a piece of text.
  5. Named entity recognition: Identifying and classifying named entities (e.g., person names, organizations) in text.
  6. Question answering: Developing systems that can automatically answer questions posed in natural language.

Byte Pair Encoding (BPE) Explained with?Example

Byte Pair Encoding (BPE) is a data compression technique that has been adapted for subword tokenization in NLP. It helps in breaking down words into smaller units called subword tokens.

Why BPE is Useful in NLP

  • Handling Out-of-Vocabulary Words: By breaking words into subwords, models can understand and generate words not seen during training.
  • Morphological Representation: Captures prefixes, suffixes, and root words, helping in understanding word structures.
  • Efficiency: Reduces the vocabulary size, making the model more efficient without losing significant information.

How BPE Works?—?Algorithm Steps:

Pseudo Code for BPE Algorithm

Let’s walk through a simple example.

Corpus:

low
lowest
lower
newest
newer        

Step 1: Initialization

Break each word into a sequence of characters with an end-of-word symbol </w>.

l o w </w>
l o w e s t </w>
l o w e r </w>
n e w e s t </w>
n e w e r </w>        

Vocabulary:

Initial vocabulary is all unique symbols:

{l, o, w, e, s, t, n, r, </w>}        

Step 2: Iterative Merging

Iteration 1:

Count Pairs:

('l', 'o'): 3 times
('o', 'w'): 3 times
('w', '</w>'): 1 time
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
and so on.        

Most Frequent Pair: 'l o' occurs 3 times (the same count as 'o w', but we'll handle one pair at a time).

Merge ‘l o’ → ‘lo’

Update Corpus:

lo w </w>
lo w e s t </w>
lo w e r </w>
n e w e s t </w>
n e w e r </w>        

Update Vocabulary: Add lo to the vocabulary.

Iteration 2:

Count Pairs:

('lo', 'w'): 3 times
('w', '</w>'): 3 times
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times
and so on.        

Most Frequent Pair: 'lo w' occurs 3 times.

Merge ‘lo w’ → ‘low.

Update Corpus:

low </w>
low e s t </w>
low e r </w>
n e w e s t </w>
n e w e r </w>        

Update Vocabulary: Add low to the vocabulary.

Iteration 3:

Count Pairs:

('low', '</w>'): 1 time
('low', 'e'): 2 times
('e', 's'): 2 times
('s', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times        

Most Frequent Pair: 'e s' occurs 2 times.

Merge ‘e s’ → ‘es’

Update Corpus:

low </w>
low es t </w>
low e r </w>
n e w es t </w>
n e w e r </w>        

Update Vocabulary: Add es to the vocabulary.

Iteration 4:

('low', '</w>'): 1 time
('low', 'es'): 2 times
('es', 't'): 2 times
('e', 'r'): 2 times
('n', 'e'): 2 times
('e', 'w'): 2 times        

Most Frequent Pair: ('es', 't') occurs 2times.

Merge ‘es t’ → ‘est’

Update Corpus:

low </w>
low est </w>
low e r </w>
n e w est </w>
n e w e r </w>        

Continue Iterations

Proceed with the merging process, each time updating the corpus and vocabulary.

Final Vocabulary:

After several iterations, the vocabulary may include tokens like low, est, er, new, west, etc.

Encoding New Words

When a new word appears (e.g., “lowest”), the word is broken down using the known tokens from the vocabulary.

  • “lowest” → low, est

Python Implementation of BPE:

def get_vocab(corpus):
    vocab = collections.defaultdict(int)
    for word in corpus:
        symbols = ' '.join(word) + ' </w>'
        vocab[symbols] += 1
    return vocab

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pair = (symbols[i], symbols[i+1])
            pairs[pair] += freq
    return pairs

def merge_vocab(pair, vocab):
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

corpus = ['low', 'lowest', 'lower', 'newest', 'newer']
num_merges = 10
vocab = get_vocab(corpus)

for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(f'Step {i+1}: Merge {best}')
    print(f'Vocabulary: {list(vocab.keys())}\n')        

Explanation of the Code

  • get_vocab(corpus): Prepares the initial vocabulary by splitting words into individual characters and adding an end-of-word symbol.
  • get_stats(vocab): Counts frequency of adjacent symbol pairs in the vocabulary.
  • merge_vocab(pair, vocab): Merges the most frequent pair in the vocabulary.
  • Main Loop: Performs a specified number of merges, updating the vocabulary each time.

Sample Output

Step 1: Merge ('e', 's')
Vocabulary: ['l o w </w>', 'l o w e s t </w>', 'l o w e r </w>', 
'n e w e s t </w>', 'n e w e r </w>']

Step 2: Merge ('es', 't')
Vocabulary: ['l o w </w>', 'l o w est </w>', 'l o w e r </w>', 
'n e w est </w>', 'n e w e r </w>']

Step 3: Merge ('l', 'o')
Vocabulary: ['lo o w </w>', 'lo o w est </w>', 'lo o w e r </w>', 
'n e w est </w>', 'n e w e r </w>']        

Natural Language Toolkit (NLTK) with?Examples

NLTK is a powerful Python library for NLP tasks. Here are some key features explained with examples.

1. Access to Text Corpora

NLTK provides interfaces to various corpora.

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

# List available books
books = gutenberg.fileids()
print("Available Books:", books)

# Load text of a book
text = gutenberg.raw('austen-emma.txt')
print("First 500 characters:\n", text[:500])        

2. Tokenization Tools

Splitting text into words or sentences.

from nltk.tokenize import word_tokenize, sent_tokenize

# Word Tokenization
words = word_tokenize("Hello, world! How are you?")
print("Word Tokens:", words)

# Sentence Tokenization
sentences = sent_tokenize("Hello, world! How are you?")
print("Sentence Tokens:", sentences)        

3. Part-of-Speech Tagging

Assigning grammatical tags to words.

nltk.download('averaged_perceptron_tagger')

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)        

4. Named Entity Recognition

Identifying named entities like person names, organizations, locations.

nltk.download('maxent_ne_chunker')
nltk.download('words') 

sentence = "Barack Obama was the 44th President of the United States."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
entities = nltk.ne_chunk(pos_tags)
print("Named Entities:", entities)        

5. Sentiment Analysis Tools

While NLTK doesn’t have built-in sentiment analysis, it provides resources for it.

# Using NLTK's movie_reviews corpus for sentiment analysis
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Example: Loading positive and negative reviews
positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')

print("Number of positive reviews:", len(positive_reviews))
print("Number of negative reviews:", len(negative_reviews))        

6. Parsing Tools

Analyzing grammatical structure.

from nltk import CFG, ChartParser

# Define a simple grammar
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the' | 'a'
N -> 'dog' | 'cat'
V -> 'chased' | 'sat'
""")

parser = ChartParser(grammar)
sentence = "the dog chased a cat".split()
for tree in parser.parse(sentence):
    print(tree)        

7. Stemming and Lemmatization

Reducing words to their root form.

Stemming: Process of reducing words to their root form by removing suffixes.

  • “Studies” → “studi”
  • “Studying” → “studi”
  • Pros: Simple and fast.
  • Cons: May produce non-real words (“studi” instead of “study”).

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word_list = ["running", "ran", "runs", "easily", "fairly"]
stems = [stemmer.stem(word) for word in word_list]
print("Stems:", stems)        

Output:

Stems: ['run', 'ran', 'run', 'easili', 'fairli']        

Lemmatization: Converts words to their base form using vocabulary and morphological analysis.

  • “Studies” → “study”
  • “Studying” → “study”
  • Pros: Produces actual words.
  • Cons: Slower, requires more resources.

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in word_list]
print("Lemmas:", lemmas)        

Output:

Lemmas: ['running', 'ran', 'run', 'easily', 'fairly']        


Neelanjan Mitra

Data professional with expertise in statistical modeling, algorithm development, and predictive analytics. Actively seeking full time roles.

5 个月

Great post! Few more reasons to love BPE: 1. Super intuitional 2. implementing BPE is a GREAT exercise to keep your programming skills sharp :P 3. Even as a foundational step, BPE gives so much insight into how a language model "relates"

Chris Nolen

AI and Technology Specialist | Innovator in Emerging Tech

5 个月

Great breakdown of key NLP concepts with clear examples! The Byte Pair Encoding explanation and Python code are especially helpful for understanding subword tokenization.

要查看或添加评论,请登录

RISHABH SINGH的更多文章

社区洞察

其他会员也浏览了