登录查看更多内容

A Beginner's Guide to the Data Science Pipeline

Randy Lao ??

Manager, Affiliate Marketing @ Coursera | Free resources at my site ClaoudML.com

发布日期: 2017年11月6日

"Believe it or not, you are no different than Data. Put yourself into Data's shoes and you'll see why."

Once upon a time there was a boy named Data. Throughout his life, he was always trying to understand what his purpose was. What values do I have? What impact can I make on this world? Where does Data come from? See any similarities between you and Data? These questions were always in his mind and fortunately, through sheer luck, Data finally came across a solution and went through a great transformation.

It all started as Data was walking down the rows when he came across a weird, yet interesting, pipe. On one end was a pipe with an entrance and at the other end an exit. The pipe was also labeled with five distinct letters: "O.S.E.M.N.". Curious as he was, Data decided to enter the pipeline. Long story short... in came data and out came insight.

Reminder: This article will cover briefly a high-level overview of what to expect in a typical data science pipeline. From framing your business problem to creating actionable insights. Don't worry this will be an easy read!

Data Science is OSEMN

You're awesome. I'm awesome. Data Science is OSEMN. Why is data science "awesome" you may ask? Well, as the aspiring data scientist you are, you're given the opportunity to hone your powers of both a wizard and a detective. By wizard, I mean having the powers to predict things automagically! And by detective, it's having the ability to find unknown patterns and trends in your data!

Understanding the typical work flow on how the data science pipeline works is a crucial step towards business understanding and problem solving. If you are intimidated about how the data science pipeline works, say no more. This article is for you! I found a very simple acronym from Hilary Mason and Chris Wiggins that you can use throughout your data science pipeline. That is O.S.E.M.N.

OSEMN Pipeline

O - Obtaining our data
S - Scrubbing / Cleaning our data
E - Exploring / Visualizing our data will allow us to find patterns and trends
M - Modeling our data will give us our predictive power as a wizard
N - Interpreting our data

Business Question

So before we even begin the OSEMN pipeline, the most crucial and important step that we must take into consideration is understanding what problem we're trying to solve. Let's say this again. Before we even begin doing anything with "Data Science", we must first take into consideration what problem we're trying to solve. If you have a small problem you want to solve, then at most you'll get a small solution. If you have a BIG problem to solve, then you'll have the possibility of a BIG solution.

Ask yourself:

How can we translate data into dollars?
What impact do I want to make with this data?
What business value does our model bring to the table?
What will save us lots of money?
What can be done to make our business run more efficiently?

"Show me the money!"

Knowing this fundamental concept will bring you far and lead you to greater steps in being successful towards being a "Data Scientist" (from what I believe... sorry I'm not one!) But nonetheless, this is still a very important step you must do! No matter how well your model predicts, no matter how much data you acquire, and no matter how OSEMN your pipeline is... your solution or actionable insight will only be as good as the problem you set for yourself.

"Good data science is more about the questions you pose of the data rather than data munging and analysis" — Riley Newman

Obtain Your Data

You cannot do anything as a data scientist without even having any data. As a rule of thumb, there are some things you must take into consideration when obtaining your data. You must identify all of your available datasets (which can be from the internet or external/internal databases). You must extract the data into a usable format (.csv, json, xml, etc..)

Skills Required:

Database Management: MySQL, PostgresSQL,MongoDB
Querying Relational Databases
Retrieving Unstructured Data: text, videos, audio files, documents
Distributed Storage: Hadoops, Apache Spark/Flink

Scrubbing / Cleaning Your Data

Clean up on column 5! This phase of the pipeline should require the most time and effort. Because the results and output of your machine learning model is only as good as what you put into it. Basically, garbage in garbage out.

Objective:

Examine the data: understand every feature you're working with, identify errors, missing values, and corrupt records
Clean the data: throw away, replace, and/or fill missing values/errors

Skills Required:

Scripting language: Python, R, SAS
Data Wrangling Tools: Python Pandas, R
Distributed Processing: Hadoop, Map Reduce / Spark

"The man who is prepared has his battle half fought" — Miguel de Cervantes

Exploring (Exploratory Data Analysis)

Now during the exploration phase, we try to understand what patterns and values our data has. We'll be using different types of visualizations and statistical testings to back up our findings. This is where we will be able to derive hidden meanings behind our data through various graphs and analysis. Go out and explore!

“A ship in harbor is safe — but that is not what ships are built for.” — John A. Shedd.

Objective:

Find patterns in your data through visualizations and charts
Extract features by using statistics to identify and test significant variables

Skills Required:

Python: Numpy, Matplotlib, Pandas, Scipy
R: GGplot2, Dplyr
Inferential statistics
Experimental Design
Data Visualization

Tip: Have your "spidey senses" tingling when doing analysis. Have the sense to spot weird patterns or trends. Always be on the lookout for an interesting findings!

Design consideration: Most of the time people just go straight to the visual “lets get it done”. It’s all about the end user who will be interpreting it. Focus on your audience.

Modeling (Machine Learning)

Now comes the fun part. Models are general rules in a statistical sense.Think of a machine learning model as tools in your toolbox. You will have access to many algorithms and use them to accomplish different business goals. The better features you use the better your predictive power will be. After cleaning your data and finding what features are most important, using your model as a predictive tool will only enhance your business decision making.

Predictive Analytics is emerging as a game-changer. Instead of looking backward to analyze "what happened?" Predictive analytics help executives answer "What's next?" and "What should we do about it?” (Forbes Magazine, April 1, 2010)

Predictive Power Example: One great example can be seen in Walmart's supply chain. Walmart was able to predict that they would sell out all of their Strawberry Pop-tarts during the hurricane season in one of their store location. Through data mining, their historical data showed that the most popular item sold before the event of a hurricane was Pop-tarts. As crazy it sounds, this is a true story and brings up the point on not to underestimate the power of predictive analytics.

Objective:

In-depth Analytics: create predictive models/algorithms
Evaluate and refine the model

Skills Required:

Machine Learning: Supervised/Unsupervised algorithms
Evaluation methods
Machine Learning Libraries: Python (Sci-kit Learn) / R (CARET)
Linear algebra & Multivariate Calculus

"Models are opinions embedded in mathematics" — Cathy O'Neil

Interpreting (Data Storytelling)

It's story time! The most important step in the pipeline is to understand and learn how to explain your findings through communication. Telling the story is key, don't underestimate it. It's about connecting with people, persuading them, and helping them. The art of understanding your audience and connecting with them is one of the best part of data storytelling.

"I believe in the power of storytelling. Stories open our hearts to a new place, which opens our minds, which often leads to action" —Melinda Gates

Emotion plays a big role in data storytelling. People aren’t going to magically understand your findings. The best way to make an impact is telling your story through emotion. We as humans are naturally influenced by emotions. If you can tap into your audiences’ emotions, then you my friend, are in control. When you’re presenting your data, keep in mind the power of psychology. The art of understanding your audience and connecting with them is one of the best part of data storytelling.

Best Practice: A good practice that I would highly suggest to enhance your data storytelling is to rehearse it over and over. If you're a parent then good news for you. Instead of reading the typical Dr. Seuss books to your kids before bed, try putting them to sleep with your data analysis findings! Because if a kid understands your explanation, then so can anybody, especially your Boss!

“If you can't explain it to a six year old, you don't understand it yourself.” — Albert Einstein

Objective:

Identify business insights: return back to business problem
Visualize your findings accordingly: keep it simple and priority driven
Tell a clear and actionable story: effectively communicate to non-technical audience

Skills Required:

Business Domain Knowledge
Data Visualization Tools: Tablaeu, D3.JS, Matplotlib, GGplot, Seaborn
Communication: Presenting/Speaking & Reporting/Writing

Updating Your Model

Don't worry your story doesn't end here. As your model is in production, its important to update your model periodically, depending on how often you receive new data. The more data you receive the more frequent the update. Lets say you’re Amazon and you introduced a new feature for customers to buy “footwear feature”. You’re old model doesn’t have this and now you must update the model that includes this feature. If not, your model will degrade over time and won’t perform as good, leaving your business to degrade as well. The introduction to new features will alter the model performance either through different variations or possibly correlations to other features.

Conclusion

In Summary: form your business problem, obtain your data, clean your data, explore your data with visualizations, model your data with different machine learning algorithms, interpret your data by evaluation, and update your model.

Remember, we're no different than Data. We both have values, a purpose, and a reason to exist in this world.

Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning god, most of the impact will come from great features, not great machine learning algorithms. So, the basic approach is:

Make sure your pipeline is solid end to end
Start with a reasonable objective
Understand your data intuitively
Make sure that your pipeline stays solid

This approach will hopefully make lots of money and/or make lots of people happy for a long period of time.

So... the next time someone asks you what is data science. Tell them:

"Data Science is OSEMN"

I hope you guys learned something today! If there is anything that you guys would like to add to this article, feel free to leave a message and don’t hesitate! Any sort of feedback is truly appreciated. Don’t be afraid to share this! Thanks!

[Click to continue reading...]

Sardor Abdirayimov

Machine Learning Engineer | Kaggle Expert | IoT Enthusiast | YouTuber | Martial Artist

2 年

Well written article ?? very informative

Zahir Sikkandar Kasim

Telecom Professional | 3G, LTE, 5G. Network Performance Management in Circuit Core, Packet Core, and Deep Packet Inspection (DPI)

2 年

Simple and easy to understand

Adil R.

Data Scientist | Data Soothsayer | Industry 4.0

2 年

Wonderful read.Kudos to you but will love to know the list of books which we should read in order to understand more of what you put up there.

Zhenyu Zhang

PhD, interested in knowledge graph and artificial intelligence.

3 年

Interesting to read, easy to understand!

1 次回应

Parin M.

MBA, Data Scientist, Mechanical Engineer

3 年

Thanks much for this post. I admire the conceptual clarity of this article very much and I believe every beginner should explore this before beginning advanced learnings in Data Science. Therefore, I added this article as a learning resource in my article here: https://www.dhirubhai.net/pulse/beginners-basic-learning-resources-begin-data-science-parin-mansuri

1 次回应

查看更多评论

要查看或添加评论，请登录

Randy Lao ??的更多文章

7 Machine Learning Algorithms Made Easy

2018年11月16日

7 Machine Learning Algorithms Made Easy

My mommy always said machine learning was like a box of chocolates. You never know what you're going to get.

83 条评论
Data Science Connect | Microsoft

2018年10月23日

Data Science Connect | Microsoft

Data Science Connect is a free webinar for everything data. Our main goal is to help as many aspiring data scientists…

10 条评论
Data Science Connect | LinkedIn

2018年10月1日

Data Science Connect | LinkedIn

Data Science Connect is a free webinar for everything data. Our main goal is to help as many aspiring data scientists…

5 条评论
A Beginner's Guide to Machine Learning

2018年7月1日

A Beginner's Guide to Machine Learning

Should I learn now… or later? Learning is a universal skill/trait that is acquired by any living organism on this…

63 条评论
Machine Learning | Accuracy Paradox

2017年10月24日

Machine Learning | Accuracy Paradox

Imagine a little boy sitting all day in your room. Let's call him Daniel.

37 条评论
Reasons Why Employees Quit | Human Resources Analytics

2017年9月21日

Reasons Why Employees Quit | Human Resources Analytics

"Yeah, they all said that to me..

8 条评论
Learning Data Science with Kaggle's House Prices: Advanced Regression Techniques

2017年7月7日

Learning Data Science with Kaggle's House Prices: Advanced Regression Techniques

Moe Data Moe Problems? This is my second attempt of learning data science using another one of Kaggle's data sets. This…

10 条评论
Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

2017年6月25日

Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

Table of Contents: Setup Environment / Extract Data / Loading Data Data Wrangling 2.1 Missing Data Imputation (1)…

See all articles

A Beginner's Guide to the Data Science Pipeline

Randy Lao ??

Manager, Affiliate Marketing @ Coursera | Free resources at my site ClaoudML.com

Data Science is OSEMN

OSEMN Pipeline

Business Question

Obtain Your Data

Scrubbing / Cleaning Your Data

Exploring (Exploratory Data Analysis)

Modeling (Machine Learning)

Interpreting (Data Storytelling)

Updating Your Model

Conclusion

Randy Lao ??的更多文章

社区洞察

其他会员也浏览了

Data Pioneers vs. Data Settlers (Startup Pt. 13)

Why Large Corporations Embrace Data Science for a Competitive Advantage

My Top Five 'Not so Typical' Data Science Reads of?2018

What Do You Do with Data?

Data Science: A Game-Changer for Small Business Owners

What is “The Art of Thinking Like a Data Scientist” Workbook and Why It Matters

Top Seven Focuses to Develop a Winning Data Science Strategy

Mastery of Data Scie-nce: A Practical Guide to Impleme-ntation

Demystifying Data Science