登录查看更多内容

Clean data: A prerequisite for datascience

Clement Selvaraj

Fulfilling the purpose of my creator

发布日期: 2017年12月4日

A quick read of my previous article "Machine Learning in one week" should give you a good idea of the Machine Learning (ML) pipeline, the courses and tools that can help you get hands-on with ML.

Last week, I picked up the old known text/sentiment analysis problem ... but this time using a bit of R and python with tweets hashtagged #makeinindia. However, very soon I realized that a bunch of the usual stuff needs to be done to get the data to a "good" shape on which one can do any meaningful analysis.

Why this article? While artificial intelligence will change the way developers write code, I wanted to affirm that software architects still have a role to determine the technology pieces that chain together a performant, secure and easy-to-use ML pipeline before a data scientist can do analysis. That also re-assures that folks like me can still contribute positively to the much hyped datascience-revolution (also termed as the 4th industrial revolution) with the knowledge one has gained in the past years as a software developer/architect and the knowledge one will gain as a datascientist in the future. It is commonly said that 80% of the time is spent in cleaning and manipulating data, and only 20% of the time actually analyzing it.

Getting tweets: Login to twitter and go to https://apps.twitter.com/ where you need to [1] create your app [2] get access tokens [3] get api keys

Write a python script (to see code, you need to navigate into the file SearchTweets.ipynb) to call the twitter search API using the token and api keys and get the tweets.

As the first data cleansing act, filter on re-tweets is done while the query is done. That is done in order to avoid noise in the data...in other words, avoids one tweet to dominate our analysis just because it has been re-tweeted.

Now the object returned by the API call is called query - which is a dictionary object of type twitter.api.TwitterDictResponse

Selecting columns: Since the dictionary returns a host of information that might not be of interest for our analysis, in python, one needs to walk through the dictionary object and extract only the columns that are necessary (text and user) for the analysis and build numpy 1-D arrays and then into numpy 2-D array. The following piece of code does that.

Removing duplicates: The numpy matrix is then transformed into a pandas dataframe before a de-duplication call is done in order to remove duplicates

Writing to a file: Pandas dataframe provides a method to write into a csv. The code below shows how to write the de-duplicated tweets in a pandas dataframe into a csv file.

Importing tweets: Once the csv file is prepared, move on to Azure ML studio https://studio.azureml.net and you've got to do a bunch of things like

(1) selecting only the relevant columns as the csv file now contains the row index and column index of the pandas dataframe

(2) cleaning data for rows where column data is missing

(3) remove the row that contains the pandas dataframe column index

(4) rename the columns with meaningful names

After all those steps described above the data is only now ready for analysis.

I could then (1) represent top users tweeting (2) remove white spaces (3) transform words to lower case (4) stem words and (5) remove stop words (6) create a word cloud using R scripts. The same could be achieved using the "Preprocess Text" module in Azure ML studio.

The last step of sentiment analysis is the artificial intelligence piece that adds the icing to the cake.

That proves that the icing on the cake is of no use if the cake (in this case, the data) is not clean and in the form that is consumable by a data scientist. It is the foundation for datascience.

Hence, picking up skills to clean data is an important aspect of machine learning.

Happy learning !

要查看或添加评论，请登录

Clement Selvaraj的更多文章

Things none will tell you about microservices

2018年6月6日

Things none will tell you about microservices

These are things you need to care Define a bounded context for a microservice Replicate or live with join pain Simplify…

1 条评论
Where's all my time gone?

2018年4月3日

Where's all my time gone?

Around two weeks after I had a draft of this blog, I came across the poem "there isn't time" (source) in my son's…

10 条评论
Building an intelligence platform

2018年3月13日

Building an intelligence platform

Disclaimer: I consider myself a beginner in Artificial Intelligence (AI) and this is just my opinion based only on my…

2 条评论
Why I left a great company

2018年2月26日

Why I left a great company

I wanted to put myself to test my survival without godfather(s) but with only God my Father. My God never gave up on me…

25 条评论
Stuff I learnt doing Loan Predictor

2017年12月14日

Stuff I learnt doing Loan Predictor

I'm thrilled that I moved to the 8th position (as on 14th Dec'17) in a matter of 4 days in a hackathon of 18600+…

2 条评论
Machine Learning in one week

2017年11月27日

Machine Learning in one week

Here is a summary prepared from a journal of my learning experience in the last one week 16th Nov – 24th Nov. I play…

6 条评论
The physics of change within an organization

2016年1月1日

The physics of change within an organization

The question is not whether we are able to change, but whether we are changing fast enough - Angela Merkel, Chancellor,…

4 条评论
Rejuvenate or Evaporate

2015年10月10日

Rejuvenate or Evaporate

Those were the closing, profound words of my finance professor, Olivier Tabatoni. Companies today appear and disappear…

7 条评论

See all articles

Clean data: A prerequisite for datascience

Clement Selvaraj

Fulfilling the purpose of my creator

Clement Selvaraj的更多文章

社区洞察

其他会员也浏览了

Non-linear Functional Data Analysis

Top 12 Python Skills Every Data Scientist Should Learn

Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed

The One-Click Data Scientist: The Power of #GPT-4's New .CSV File Analysis with Python

Unlocking the Power of Data Science with DSPy: Your Gateway to AI Mastery

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

KDnuggets 16:n42: Python Machine Learning Open Source Projects; Facebook Groups for Big Data & Data Science

Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)

Top Stories of 2017: 10 Free Must-Read Books for Machine Learning, Data Science; Python overtakes R, becomes the leader in Data Science

Our Data Science Journey: From Zero to Hero ??

Clement Selvaraj的更多文章

Things none will tell you about microservices

Where's all my time gone?

Building an intelligence platform

Why I left a great company

Stuff I learnt doing Loan Predictor

Machine Learning in one week

The physics of change within an organization

Rejuvenate or Evaporate

社区洞察

其他会员也浏览了

Non-linear Functional Data Analysis

Top 12 Python Skills Every Data Scientist Should Learn

Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed

The One-Click Data Scientist: The Power of #GPT-4's New .CSV File Analysis with Python

Unlocking the Power of Data Science with DSPy: Your Gateway to AI Mastery

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

KDnuggets 16:n42: Python Machine Learning Open Source Projects; Facebook Groups for Big Data & Data Science

Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)

Top Stories of 2017: 10 Free Must-Read Books for Machine Learning, Data Science; Python overtakes R, becomes the leader in Data Science

Our Data Science Journey: From Zero to Hero ??