登录查看更多内容

Preface

Chris Kuo, Ph.D., CPCU

Data science & insurance professional

发布日期: 2023年11月15日

With the arrival of ChatGPT in late 2022 and GPT-4 in early 2023, there is an ignited interest in natural language processing (NLP) including large language models (LLMs). You will find this book very helpful if you are picking it up hoping to get a start with NLP, learn and build the NLP techniques that have matured in the past few decades, or understand the differences between pre-LLM and LLM techniques. With the NLP development in the past four decades, there have been many commercial NLP products built on pre-LLM techniques, such as Word2Vec, Doc2Vec, Latent Semantic Analysis (LSA) or called Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and Ensemble LDA.

With the help of this book, you will not only get started with NLP to build NLP models but also get equipped with some background knowledge of LLMs. We believe the concepts covered in this book will be the necessary bridge for anyone new who comes to NLP, who wants to build NLP products, and who wants to learn LLMs.

Why read this book?

To assist you in learning fundamental NLP concepts and building your NLP applications, we will start with NLP concepts and techniques that enable commercial NLP applications. This guide covers both theories and code practices. It presents NLP topics, so beginners as well as experienced data scientists can benefit from it.

Many of the techniques mentioned earlier, such as Word2Vec, Doc2Vec, LSA, LDA, and Ensemble LDA, are included in the Python Gensim module. Gensim is an open source Python library widely used by NLP researchers and developers, together with other NLP open source modules, including NLTK, Scikit-learn, and spaCy. We will learn how to build models using these modules. In addition, you will also learn about the Transformer-based topic modeling BERTopic in a separate chapter, and a BERTopic use case in the last chapter for NLP use cases.

You will also get to practice implementing your model for scoring and predictions. This implementation perspective enables you to work with data engineers closely in model deployment. We’ll conclude the book with a study of selected large-scale NLP use cases. We believe these use cases can inspire you to build your NLP applications.

What is Gensim

New NLP learners may find the Gensim library cited in many tutorials. Gensim is an open source Python library to process unstructured texts using unsupervised machine learning algorithms. It was first created by Radim ?eh??ek in 2011 and is now developed and maintained continually by 400+ contributors. It has been used in over 2000 research papers and student theses.

One of Gensim’s merits is its fast execution speed. Gensim attributes this advantage to its use of low-level BLAS libraries through NumPy, highly optimized Fortran/C, and multithreading under the hood. Memory independence is also one of their design objectives. Gensim enables data streaming to process large corpora without the need to load a whole training corpus in RAM.

Who this book is for

This book does not assume any prior linguistic knowledge or NLP techniques, so it is suitable for anyone who wants to learn NLP. Data scientists and professionals who want to develop NLP applications will also find it helpful. If you are an NLP practitioner, you can consider this book as a code reference when working on your projects. Those practicing for an upper-class level NLP course can also use this book.

What this book covers

Chapter 1, Introduction to NLP, is an introductory chapter that explains the development from Natural Language Understanding (NLU) and Natural Language Generation (NLG) to NLP. It briefs the core techniques including text pre-processing, LSA/LSI, Word2Vec, Doc2Vec, LDA, Ensemble LDA, and BERTopic. It presents the open source NLP modules Gensim, Scikit-learn, and Spacy.

Chapter 2, Text Representation, starts with the basic step of text representation. It explains the motivation from one-hot encoding to Bag-of-words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). It demonstrates how to perform BoW and TF-IDF with Gensim, Scikit-learn, and NLTK.

Chapter 3, Text Wrangling and Preprocessing, presents the essential text pre-processing tasks: (a) tokenization, (b) lowercase conversion, (c) stop words removal, (d) punctuation removal, (e) stemming, and (f) lemmatization. It guides you to perform the pre-processing tasks with Gensim, spaCy, and NLTK.

Chapter 4, Latent Semantic Analysis with scikit-learn, presents the theory of LSA/LSI. This chapter introduces Singular Vector Decomposition (SVD), Truncated SVD, and Truncated SVD’s application to LSA/LSI. This chapter uses Scikit-learn to illustrate the transition of Truncated SVD to LSA/LSI explicitly.

Chapter 5, Cosine Similarity, is dedicated to explaining this fundamental measure in NLP. Cosine similarity, among other metrics such as Euclidean distance or Manhattan distance, measures the similarity between embedded data in the vector space. This chapter also indicates the applications of cosine similarity for image comparison and querying.

Preface xix

Chapter 6, Latent Semantic Indexing with Gensim, builds an LSA/LSI model with Gensim. This chapter introduces the concept of coherence score that determines the optimal number of topics. It shows how to score new documents with the use of cosine similarity to add to an information retrieval tool.

Chapter 7, Using Word2Vec, introduces the milestone Word2Vec technique and its two neural network architectural variations: Continuous Bag-of-Word (CBOW) and Skip Gram (SG). It illustrates the concept and operation for word embedding in the vector space. It guides you to build a word2Vec model and prepares it as part of an informational retrieval tool. It visualizes word vectors of a Word2Vec model with t-SNE and TensorBoard (by TensorFlow). This chapter ends with the comparisons of Word2Vec with Doc2Vec, GloVe, and FastText.

Chapter 8, Doc2Vec with Gensim, presents the evolution from Word2Vec to Doc2Vec. It details the two neural network architectural variations: Paragraph Vector with Distributed Bag-of-words (PV-DBOW) and Paragraph Vectors with Distributed Memory (PV-DM). It guides you to build a Doc2Vec model and prepares it as part of an informational retrieval tool

Chapter 9, Understanding Discrete Distributions, introduces the discrete distribution family including Bernoulli, binomial, multinomial, beta, and Dirichlet distribution. Because the complex distributions are the generalization of the simple distributions, this sequence helps you to understand Dirichlet distribution. The fact that ‘Dirichlet’ is in the title of LDA tells us its significance. This chapter helps you understand LDA in the next chapter.

Chapter 10, Latent Dirichlet Allocation, presents the LDA algorithm, including the structural design of LDA, generative modeling, and Variational Expectation-Maximization.

Chapter 11, LDA Modeling, demonstrates how to build an LDA model, perform hyperparameter turning, and determine the optimal number of topics. You will learn the steps to apply an LDA model to score new documents as part of an informational retrieval tool.

Chapter 12, LDA Visualization, presents the visualization for LDA. This chapter starts with a design thinking for the rich content of a topic model. Then it shows how to use pyLADviz for visualization.

Chapter 13, The Ensemble LDA for Model Stability, investigates the root causes of the instability of LDA. It explains the Ensemble approach for LDA and the use of Checkback DBSCAN, a clustering algorithm, to deliver a stable set of topics.

Chapter 14, LDA and BERTopic, presents the BERTopic modeling technique that uses an LLM-based BERT algorithm for word embeddings, UMAP for dimensionality reduction for word embedding, HDBSCAN for topic clustering, c-TFIDF for word presentation for topics, and MMR to fine-tune the word representation for topics. It guides you through BERT modeling, visualization, and scoring new documents for topics.

Chapter 15, Real-World Use Cases, presents seven NLP projects in healthcare, medical, legal, finance, and social media. By learning these NLP solutions, you will be motivated to apply code notebooks of this book to perform similar jobs or apply to your future applications.

Download the example code files

The Python notebooks are available for download at https://github.com/PacktPublishing/ The-Handbook-of-NLP-with-Gensim. If there’s an update to the code, it will be updated in the GitHub repository. You are encouraged to use Google Colab. Google Colab is a free Jupyter Notebook environment that runs entirely in the cloud. Google Colab has already pre-installed popular machine-learning libraries such as pandas, NumPy, TensorFlow, Keras, and OpenCV.

We also have other code bundles from our rich catalog of books and videos available at https:// github.com/PacktPublishing/. Check them out!

Data for this book

The AG’s corpus of news articles, made public by A. Gulli, is a collection of more than 1 million news articles from more than 2,000 news sources. Zhang, Zhao, and LeCun sampled news articles from on “world”, “sports”, “business”, and “Science” categories. This dataset ag_news is a frequently used dataset and is available in Kaggle, PyTorch, Huggingface, and Tensorflow. There are 120,000 and 7,600 news articles in the training and testing samples respectively. This dataset is used throughout the book.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

  from gensim.summarization import keywords

Any command-line input or output is written as follows:

  pip install gensim==3.8.3

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Without natural language processing (NLP) tools, the marketing team can only do basic operations with these text messages and data.”

Get in touch

Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, email us at customercare@

packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Harsh Dhanuka

1 年

Very good Professor Chris

Carlo Lisi

1 年

I'll be getting this book.

Paul Marshall

Data Engineering Lead @ Meta | MS in Computer & Information Systems

1 年

Awesome!

1 次回应

查看更多评论

要查看或添加评论，请登录

Chris Kuo, Ph.D., CPCU的更多文章

Chapter 15 - Real-World Use Cases

2023年11月16日

Chapter 15 - Real-World Use Cases

(Click https://a.co/d/6Zu9MfV to access the book) Innovation comes by referencing best practices and applications.
Chapter 4: Isolation Forest

2023年2月4日

Chapter 4: Isolation Forest

https://a.co/d/aiKKJQk If you were asked to separate the above trees one by one, which tree will be the first one to…

2 条评论
Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

2023年1月28日

Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

The Empirical Cumulative Distribution-based Outlier Detection (ECOD) has a very intuitive approach: Outliers are the…

2 条评论
Chapter 2: Histogram-based Outlier Score (HBOS)

2023年1月21日

Chapter 2: Histogram-based Outlier Score (HBOS)

Consider multi-dimensional data like a data frame in an Excel Spreadsheet. The columns are the dimensions or variables,…

1 条评论
Handbook of Anomaly Detection: Chapter 1 Introduction

2023年1月14日

Handbook of Anomaly Detection: Chapter 1 Introduction

Chapter 1: Introduction Insurance fraud, cyber hacking, malfunctioning equipment, and production failure are examples…
Increase in the business investment

2018年5月17日

Increase in the business investment

U.S.
A Review on The 2017 Tax Reform

2017年12月31日

A Review on The 2017 Tax Reform

The biggest year-end news is that Congress passes tax reform. I found that the tax statistics cited in many news…

1 条评论
Machine learning or econometrics?

2017年2月9日

Machine learning or econometrics?

In recent years we see fruitful developments and undeniable success in machine learning. The development in…

7 条评论

See all articles

Chris Kuo, Ph.D., CPCU的更多文章

Chapter 15 - Real-World Use Cases

Chapter 4: Isolation Forest

Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

Chapter 2: Histogram-based Outlier Score (HBOS)

Handbook of Anomaly Detection: Chapter 1 Introduction

Increase in the business investment

A Review on The 2017 Tax Reform

Machine learning or econometrics?