Preface
With the arrival of ChatGPT in late 2022 and GPT-4 in early 2023, there is an ignited interest in natural language processing (NLP) including large language models (LLMs). You will find this book very helpful if you are picking it up hoping to get a start with NLP, learn and build the NLP techniques that have matured in the past few decades, or understand the differences between pre-LLM and LLM techniques. With the NLP development in the past four decades, there have been many commercial NLP products built on pre-LLM techniques, such as Word2Vec, Doc2Vec, Latent Semantic Analysis (LSA) or called Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and Ensemble LDA.
With the help of this book, you will not only get started with NLP to build NLP models but also get equipped with some background knowledge of LLMs. We believe the concepts covered in this book will be the necessary bridge for anyone new who comes to NLP, who wants to build NLP products, and who wants to learn LLMs.
Why read this book?
To assist you in learning fundamental NLP concepts and building your NLP applications, we will start with NLP concepts and techniques that enable commercial NLP applications. This guide covers both theories and code practices. It presents NLP topics, so beginners as well as experienced data scientists can benefit from it.
Many of the techniques mentioned earlier, such as Word2Vec, Doc2Vec, LSA, LDA, and Ensemble LDA, are included in the Python Gensim module. Gensim is an open source Python library widely used by NLP researchers and developers, together with other NLP open source modules, including NLTK, Scikit-learn, and spaCy. We will learn how to build models using these modules. In addition, you will also learn about the Transformer-based topic modeling BERTopic in a separate chapter, and a BERTopic use case in the last chapter for NLP use cases.
You will also get to practice implementing your model for scoring and predictions. This implementation perspective enables you to work with data engineers closely in model deployment. We’ll conclude the book with a study of selected large-scale NLP use cases. We believe these use cases can inspire you to build your NLP applications.
What is Gensim
New NLP learners may find the Gensim library cited in many tutorials. Gensim is an open source Python library to process unstructured texts using unsupervised machine learning algorithms. It was first created by Radim ?eh??ek in 2011 and is now developed and maintained continually by 400+ contributors. It has been used in over 2000 research papers and student theses.
One of Gensim’s merits is its fast execution speed. Gensim attributes this advantage to its use of low-level BLAS libraries through NumPy, highly optimized Fortran/C, and multithreading under the hood. Memory independence is also one of their design objectives. Gensim enables data streaming to process large corpora without the need to load a whole training corpus in RAM.
Who this book is for
This book does not assume any prior linguistic knowledge or NLP techniques, so it is suitable for anyone who wants to learn NLP. Data scientists and professionals who want to develop NLP applications will also find it helpful. If you are an NLP practitioner, you can consider this book as a code reference when working on your projects. Those practicing for an upper-class level NLP course can also use this book.
What this book covers
Chapter 1, Introduction to NLP, is an introductory chapter that explains the development from Natural Language Understanding (NLU) and Natural Language Generation (NLG) to NLP. It briefs the core techniques including text pre-processing, LSA/LSI, Word2Vec, Doc2Vec, LDA, Ensemble LDA, and BERTopic. It presents the open source NLP modules Gensim, Scikit-learn, and Spacy.
Chapter 2, Text Representation, starts with the basic step of text representation. It explains the motivation from one-hot encoding to Bag-of-words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). It demonstrates how to perform BoW and TF-IDF with Gensim, Scikit-learn, and NLTK.
Chapter 3, Text Wrangling and Preprocessing, presents the essential text pre-processing tasks: (a) tokenization, (b) lowercase conversion, (c) stop words removal, (d) punctuation removal, (e) stemming, and (f) lemmatization. It guides you to perform the pre-processing tasks with Gensim, spaCy, and NLTK.
Chapter 4, Latent Semantic Analysis with scikit-learn, presents the theory of LSA/LSI. This chapter introduces Singular Vector Decomposition (SVD), Truncated SVD, and Truncated SVD’s application to LSA/LSI. This chapter uses Scikit-learn to illustrate the transition of Truncated SVD to LSA/LSI explicitly.
Chapter 5, Cosine Similarity, is dedicated to explaining this fundamental measure in NLP. Cosine similarity, among other metrics such as Euclidean distance or Manhattan distance, measures the similarity between embedded data in the vector space. This chapter also indicates the applications of cosine similarity for image comparison and querying.
Preface xix
Chapter 6, Latent Semantic Indexing with Gensim, builds an LSA/LSI model with Gensim. This chapter introduces the concept of coherence score that determines the optimal number of topics. It shows how to score new documents with the use of cosine similarity to add to an information retrieval tool.
Chapter 7, Using Word2Vec, introduces the milestone Word2Vec technique and its two neural network architectural variations: Continuous Bag-of-Word (CBOW) and Skip Gram (SG). It illustrates the concept and operation for word embedding in the vector space. It guides you to build a word2Vec model and prepares it as part of an informational retrieval tool. It visualizes word vectors of a Word2Vec model with t-SNE and TensorBoard (by TensorFlow). This chapter ends with the comparisons of Word2Vec with Doc2Vec, GloVe, and FastText.
Chapter 8, Doc2Vec with Gensim, presents the evolution from Word2Vec to Doc2Vec. It details the two neural network architectural variations: Paragraph Vector with Distributed Bag-of-words (PV-DBOW) and Paragraph Vectors with Distributed Memory (PV-DM). It guides you to build a Doc2Vec model and prepares it as part of an informational retrieval tool
Chapter 9, Understanding Discrete Distributions, introduces the discrete distribution family including Bernoulli, binomial, multinomial, beta, and Dirichlet distribution. Because the complex distributions are the generalization of the simple distributions, this sequence helps you to understand Dirichlet distribution. The fact that ‘Dirichlet’ is in the title of LDA tells us its significance. This chapter helps you understand LDA in the next chapter.
Chapter 10, Latent Dirichlet Allocation, presents the LDA algorithm, including the structural design of LDA, generative modeling, and Variational Expectation-Maximization.
Chapter 11, LDA Modeling, demonstrates how to build an LDA model, perform hyperparameter turning, and determine the optimal number of topics. You will learn the steps to apply an LDA model to score new documents as part of an informational retrieval tool.
Chapter 12, LDA Visualization, presents the visualization for LDA. This chapter starts with a design thinking for the rich content of a topic model. Then it shows how to use pyLADviz for visualization.
Chapter 13, The Ensemble LDA for Model Stability, investigates the root causes of the instability of LDA. It explains the Ensemble approach for LDA and the use of Checkback DBSCAN, a clustering algorithm, to deliver a stable set of topics.
Chapter 14, LDA and BERTopic, presents the BERTopic modeling technique that uses an LLM-based BERT algorithm for word embeddings, UMAP for dimensionality reduction for word embedding, HDBSCAN for topic clustering, c-TFIDF for word presentation for topics, and MMR to fine-tune the word representation for topics. It guides you through BERT modeling, visualization, and scoring new documents for topics.
Chapter 15, Real-World Use Cases, presents seven NLP projects in healthcare, medical, legal, finance, and social media. By learning these NLP solutions, you will be motivated to apply code notebooks of this book to perform similar jobs or apply to your future applications.
Download the example code files
The Python notebooks are available for download at https://github.com/PacktPublishing/ The-Handbook-of-NLP-with-Gensim. If there’s an update to the code, it will be updated in the GitHub repository. You are encouraged to use Google Colab. Google Colab is a free Jupyter Notebook environment that runs entirely in the cloud. Google Colab has already pre-installed popular machine-learning libraries such as pandas, NumPy, TensorFlow, Keras, and OpenCV.
We also have other code bundles from our rich catalog of books and videos available at https:// github.com/PacktPublishing/. Check them out!
Data for this book
The AG’s corpus of news articles, made public by A. Gulli, is a collection of more than 1 million news articles from more than 2,000 news sources. Zhang, Zhao, and LeCun sampled news articles from on “world”, “sports”, “business”, and “Science” categories. This dataset ag_news is a frequently used dataset and is available in Kaggle, PyTorch, Huggingface, and Tensorflow. There are 120,000 and 7,600 news articles in the training and testing samples respectively. This dataset is used throughout the book.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
from gensim.summarization import keywords
Any command-line input or output is written as follows:
pip install gensim==3.8.3
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Without natural language processing (NLP) tools, the marketing team can only do basic operations with these text messages and data.”
Get in touch
Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, email us at customercare@
packtpub.com and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Very good Professor Chris
I'll be getting this book.
Data Engineering Lead @ Meta | MS in Computer & Information Systems
1 年Awesome!