课程: Deep Learning: Getting Started

Spam classification problem

- [Instructor] How do we do deep learning with unstructured text data? We will learn with an example in this chapter. First, let's take a look at the problem we are going to solve. Spam classification is a popular use case with filtering emails and chat messages. In this use case, we have an SMS Data Set, that contains 2 variables. The feature variable is the SMS message. It is unstructured text with 1 or 2 lines. The target variable is the message type. It is classified as either ham or spam. The goal for this example, is to build an ANN that can classify text messages as ham or spam. This is a simple use case but demonstrates the key steps needed to classify unstructured data with a deep learning model. The key difference between classifying structured and unstructured data with queries is the pre-processing needed to prepare data. For text, we need to clean it, remove stop words and lemmatize them before converting them into numeric representations. We will use TF-IDF in this example. We can also use word embeddings for this purpose. The code for this example is available in the notebook code_05 XX Spam classification example. Before we begin, we need to install the dependent libraries in Python, for text pre-processing. Let's run the code in section 5.1. It may take some time to install based on what is already installed on your system.

内容