课程: Complete Guide to NLP with R

Understanding corpora and sources - R教程

课程: Complete Guide to NLP with R

Understanding corpora and sources

- [Instructor] Natural language processing revolves around several data objects, corpus, documents, tokens, and DTMs or document-token matrices. A corpus is a collection of documents, and those documents come from various sources. The tm package provides for the import of many sources, and let's take a minute to examine how sources are brought into corpora. A corpus is an R object much like a data frame or a list, it contains documents in a consistent structure that simplifies manipulating and performing research on the text. Think of a corpus like an egg carton, the eggs are documents and the egg carton is a corpus. When eggs are placed in an egg carton, it is easy to confirm their shape, size, and condition. When they're loose on the countertop, it is more difficult to individually assess each egg, using a corpus to contain text is the same thing. When contained in a corpus, it is easier to confirm a document's size, content, and format. With the tm package, different corpora are included and can also be created with plugin packages. Here is a table showing their basic capabilities and how they are installed into tm. I've included this table with the example files. Natural language processing can import several types of documents with special functions called sources. Different sources are used for different types of documents, tm provides the command getSources to provide a list of available sources. Here's a table briefly describing these document sources and the corpora able to use them. This table is also included in the example files for this chapter. Each of these sources are used to describe different types of documents. In the next lessons, we will dive deeper into each of these sources and corpora.

内容