Clean data: A prerequisite for datascience
A quick read of my previous article "Machine Learning in one week" should give you a good idea of the Machine Learning (ML) pipeline, the courses and tools that can help you get hands-on with ML.
Last week, I picked up the old known text/sentiment analysis problem ... but this time using a bit of R and python with tweets hashtagged #makeinindia. However, very soon I realized that a bunch of the usual stuff needs to be done to get the data to a "good" shape on which one can do any meaningful analysis.
Why this article? While artificial intelligence will change the way developers write code, I wanted to affirm that software architects still have a role to determine the technology pieces that chain together a performant, secure and easy-to-use ML pipeline before a data scientist can do analysis. That also re-assures that folks like me can still contribute positively to the much hyped datascience-revolution (also termed as the 4th industrial revolution) with the knowledge one has gained in the past years as a software developer/architect and the knowledge one will gain as a datascientist in the future. It is commonly said that 80% of the time is spent in cleaning and manipulating data, and only 20% of the time actually analyzing it.
Getting tweets: Login to twitter and go to https://apps.twitter.com/ where you need to [1] create your app [2] get access tokens [3] get api keys
Write a python script (to see code, you need to navigate into the file SearchTweets.ipynb) to call the twitter search API using the token and api keys and get the tweets.
As the first data cleansing act, filter on re-tweets is done while the query is done. That is done in order to avoid noise in the data...in other words, avoids one tweet to dominate our analysis just because it has been re-tweeted.
Now the object returned by the API call is called query - which is a dictionary object of type twitter.api.TwitterDictResponse
Selecting columns: Since the dictionary returns a host of information that might not be of interest for our analysis, in python, one needs to walk through the dictionary object and extract only the columns that are necessary (text and user) for the analysis and build numpy 1-D arrays and then into numpy 2-D array. The following piece of code does that.
Removing duplicates: The numpy matrix is then transformed into a pandas dataframe before a de-duplication call is done in order to remove duplicates
Writing to a file: Pandas dataframe provides a method to write into a csv. The code below shows how to write the de-duplicated tweets in a pandas dataframe into a csv file.
Importing tweets: Once the csv file is prepared, move on to Azure ML studio https://studio.azureml.net and you've got to do a bunch of things like
(1) selecting only the relevant columns as the csv file now contains the row index and column index of the pandas dataframe
(2) cleaning data for rows where column data is missing
(3) remove the row that contains the pandas dataframe column index
(4) rename the columns with meaningful names
After all those steps described above the data is only now ready for analysis.
I could then (1) represent top users tweeting (2) remove white spaces (3) transform words to lower case (4) stem words and (5) remove stop words (6) create a word cloud using R scripts. The same could be achieved using the "Preprocess Text" module in Azure ML studio.
The last step of sentiment analysis is the artificial intelligence piece that adds the icing to the cake.
That proves that the icing on the cake is of no use if the cake (in this case, the data) is not clean and in the form that is consumable by a data scientist. It is the foundation for datascience.
Hence, picking up skills to clean data is an important aspect of machine learning.
Happy learning !