课程: Complete Guide to NLP with R

今天就学习课程吧!

今天就开通帐号,24,700 门业界名师课程任您挑!

Remove lines from a corpus

Remove lines from a corpus

Quanteda provides a way to remove unnecessary documents from a corpus. It's called corpus underbar trim. And let's take a look at how to use it. In line three, I bring in the quanteda library, then line six retrieves a sampleCorpus for us to experiment with. The summary of sampleCorpus shows us that it's a fairly simple corpus, but it's got types and tokens and sentences and some document variables called someInfo. Now, let's suppose that we don't want short sentences in our corpus. Well, I've set this up in line 11 where it says corpus underbar trim. We're trimming the sampleCorpus, and what we're trimming by is sentences with a minimum token of six. So it will select sentences shorter than six. And let's run that. And what we see is we now have only two documents within our sampleCorpus. That's because these lines are each longer than six words. Likewise, we can remove sentences that have a pattern. In line 17…

内容