Practial Guide on Text Mining and Feature Engineering in R
The ability to deal with text data is one of the important skills a data scientist must posses. With advent of social media, forums, review sites, web page crawlers companies now have access to massive behavioural data of their customers.
Yes, companies have more of textual data than numerical data. No doubt, this data will be messy. But, beneath it lives an enriching source of information, insights which can help companies to boost their businesses.
That is the reason, why natural language processing (NLP) a.k.a Text Mining as a technique is growing rapidly and being extensively used by data scientists.
In this tutorial, you'll about text mining from scratch. We'll follow a stepwise pedagogy to understand text mining concepts. Later, we'll work on a current kaggle competition data set to gain practical experience, which is followed by two practice exercises.
For this tutorial, the programming language used is R. However, the techniques explained below can be implemented in any programming language.
Make sure you've finished the regular expression tutorial before starting with text mining.
Table of Contents
- What are Regular Expressions ? When do you use them ?
- What is String Manipulation ?
- List of String Manipulation Functions
- List of Regular Expression CommandsMetacharacters
- Sequences
- Quantifiers
- Character Classes
- POSIX character classes
- Practice Examples on Regular Expressions
Feel free to drop your suggestions, experience or any new technique you've used while dealing with string variables in a data set.