The Data Science Process, Rediscovered

The Data Science Process, Rediscovered

Here is a popular recent post by KDnuggets Editor Matthew Mayo.  

Last week, KDnuggets top tweet was a Quora answer to What is the work flow or process of a data scientist?. This answer, written by Ryan Fox Squire, a self-described "Neuroscientist Turned Data Scientist," employed The Data Science Process as it described such a workflow.

 

 

 The Data Science Process is a framework for approaching data science tasks, and is crafted by Joe Blitzstein and Hanspeter Pfister of Harvard's CS 109. The goal of CS 109, as per Blitzstein himself, is to introduce students to the overall process of data science investigation, a goal which should provide some insight into the framework itself.

The following is a sample application of Blitzstein & Pfister's framework, regarding skills and tools at each stage, as given by Ryan Fox Squire in his answer:

Stage 1: Ask A Question

  • Skills: science, domain expertise, curiosity
  • Tools: your brain, talking to experts, experience

Stage 2: Get the Data

  • Skills: web scraping, data cleaning, querying databases, CS stuff
  • Tools: python, pandas

Stage 3: Explore the Data

  • Skills: Get to know data, develop hypotheses, patterns? anomalies?
  • Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data

  • Skills: regression, machine learning, validation, big data
  • Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data

  • Skills: presentation, speaking, visuals, writing
  • Tools: matplotlib, adobe illustrator, powerpoint/keynote

Squire then (rightfully) concludes that the data science work flow is a non-linear, iterative process, and that there are many skills and tools required to cover the full data science process. Squire also professes that he is fond of the Data Science Process as it stresses both the importance of asking questions to guide your workflow, and the importance of iterating on your questions and research, as one gains familiarity with one's data.

The Data Science Framework is an innovative framework for approaching data science problems. Isn't it?

Next, we look at CRISP-DM.

As a comparison to the Data Science Process put forth by Blitzstein & Pfister, and elaborated upon by Squire, we take a quick look at the de facto official (yet unquestionably falling out of fashion) data mining framework (which has been extended to data science problems), the Cross Industry Standard Process for Data Mining (CRISP-DM). Though the standard is no longer actively maintained, it remains a popular framework for navigating data science projects. 

Read the rest of the post on KDnuggets: 

The Data Science Process, Rediscovered

https://www.kdnuggets.com/2016/03/data-science-process-rediscovered.html

Kenneth Bassett

2024 Goal - Sustainability

8 年

I like that it is iterative, as any scientific process should be. As an engineer it bugs me that it doesn't have more action, as in solving a problem, but I suppose you would argue that in proving or disproving hypotheses, you are fulfilling your duties as a scientist.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了