课程: Complete Guide to NLP with R
Working with document metadata
- [Instructor] The tm package for R provides a lot of different metadata options. Metadata is information about the data, the content of the corpus that you're working with. Metadata can include things like authors or the data publication or it can contain anything that you think is useful. tm provides three different types of metadata and it can be a bit confusing as to what appears when and why you might want to use one or the other type of metadata. Let's spend a couple of sessions talking about metadata, why it's important and how to use it effectively. Now, to demonstrate metadata, of course, I'll need to bring in the tm library and I can do that in line one. Before you go any further, it's important to check your working directory. The working directory needs to point to a directory called poetry, and if you look in the lower right-hand corner you'll see what that directory looks like. It's the exercise files. I'm going to use RStudios set as working directory to set this as the current working directory. Now, I can run line six to bring in a new VCorpus, which I'll use for demonstration purposes. Once I've done that, let's take a quick look at the different type of metadata and I've outlined this in lines 10, 11, and 12. I'll clear the console and then I'll select line 10 and run that line. This is providing me with something called corpus level metadata. And you can see that right now, it doesn't include very much. In line 11, I ask for the local metadata. The difference between these two is with the corpus metadata, the metadata is stored as part of the corpus. So it could be things like who created this corpus or when they created it. At local, information is stored at the document level, and let's take a quick look at that. In this case, you can see that I have one set of local metadata for each document. In this case, we're looking at poetry_9622.txt which is one of the poetry files in the poetry directory. I don't have an author. I do have a date stamp, which is the time that I actually imported it, and I have an ID, which is the poetry_9622.txt. And we'll take a look at how to change that and update that in the second. The third line line 12 pulls up something called indexed. I'm going to clear the console and run line 12 and you'll see that I have a data frame with zero columns and 26 rows. The indexed metadata is actually stored as a data frame. You can add metadata to a corpus by simply inserting information into this data frame. And again, we'll talk a bit more about that particular type of metadata. One note, simple corpus, which is one of the many types of corpus provided by the tm package does not have local metadata, only indexed and corpus. Now let's move on. In line 17 and 18, we're going to create some corpus metadata. Look at line 17, it's meta and then it declares what corpus we're working with. In this case, new VCorpus. The tag for the metadata is going to be mnrMeta and the type is going to be corpus. In intuit, I'm putting, well, a random string called My Stuff. We'll run that line. And then let's take a look at the corpus data. Now, you'll remember the bottom two lines, attribute class one corpus meta but the top two lines dollar sign mnrMeta, are set to My Stuff. Well, that's obviously new and you can see exactly how that was placed into the corpus by line 17. In line 22, we can create some local metadata and you'll see that it says meta, that's the command we're going to affect new VCorpus. The tag is due date. The type is local. What we're going to put into it is a sequence of dates. I'm using seq.date starting at 2023-01-13. I'm going to set these due dates every week and the number of dates that I'm creating is equal to the length of this new VCorpus. Let's run line 22. And before we check all of this data, let's take another look at something else. In line 26, I'm using the meta command again and you'll notice that I'm using bracket bracket one. Bracket bracket of course, in R, refers to the content of an indexed object. So I'm placing a metadata into the contents of the first object of new VCorpus. The tag is going to be comment and the value is going to be great writing. Let's run line 26. Then we will run line 27 which does pretty much the same thing but in the second document, then I'll run line 28 which does the same thing but in the third document. I'm going to clear the console. And now let's take a look at line 30 which will show us all of this metadata we've added. And it shows us this because we're not assigning anything. So I've run line 30. Let's scroll up to the top first of all. Poetry_1020.txt is the first document and you'll note that at the bottom it says, Comment, "Great writing." And that comes from line 26 of the code. Likewise, in poetry_12031.txt, the comment is "another story." In the third, you can see the due date has been added and changed but it also has a description that says, "A pirate story." This comes from line 28. Now getting back to these due dates, you'll notice that these are all new and were of course added by line 22 above. These dates are all a sequence one week apart. So that's local metadata. And again, you can see that I asked for that when I ran line 30 where the type was local. The third type of metadata we've been talking about is indexed metadata. In line 34, I've said meta and the tag is random letter. The type is indexed and into this particular tag, I'm going to place random letters. I'm going to place the same number as the length of the new VCorpus. Let's run line 34 and then use line 36 to examine the indexed metadata. Well, no surprise, what you'll see here is 26 values. The column is labeled random letter. And this actually is a data frame that's stored within new VCorpus. So that's the three types of metadata we've been talking about. Indexed, local, and corpus. Let's take a look at how to actually use this metadata in a real world application.
随堂练习,边学边练
下载课堂讲义。学练结合,紧跟进度,轻松巩固知识。