登录查看更多内容

The Data Science Process, Rediscovered

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

发布日期: 2016年3月14日

Here is a popular recent post by KDnuggets Editor Matthew Mayo.

Last week, KDnuggets top tweet was a Quora answer to What is the work flow or process of a data scientist?. This answer, written by Ryan Fox Squire, a self-described "Neuroscientist Turned Data Scientist," employed The Data Science Process as it described such a workflow.

The Data Science Process is a framework for approaching data science tasks, and is crafted by Joe Blitzstein and Hanspeter Pfister of Harvard's CS 109. The goal of CS 109, as per Blitzstein himself, is to introduce students to the overall process of data science investigation, a goal which should provide some insight into the framework itself.

The following is a sample application of Blitzstein & Pfister's framework, regarding skills and tools at each stage, as given by Ryan Fox Squire in his answer:

Stage 1: Ask A Question

Skills: science, domain expertise, curiosity
Tools: your brain, talking to experts, experience

Stage 2: Get the Data

Skills: web scraping, data cleaning, querying databases, CS stuff
Tools: python, pandas

Stage 3: Explore the Data

Skills: Get to know data, develop hypotheses, patterns? anomalies?
Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data

Skills: regression, machine learning, validation, big data
Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data

Skills: presentation, speaking, visuals, writing
Tools: matplotlib, adobe illustrator, powerpoint/keynote

Squire then (rightfully) concludes that the data science work flow is a non-linear, iterative process, and that there are many skills and tools required to cover the full data science process. Squire also professes that he is fond of the Data Science Process as it stresses both the importance of asking questions to guide your workflow, and the importance of iterating on your questions and research, as one gains familiarity with one's data.

The Data Science Framework is an innovative framework for approaching data science problems. Isn't it?

Next, we look at CRISP-DM.

As a comparison to the Data Science Process put forth by Blitzstein & Pfister, and elaborated upon by Squire, we take a quick look at the de facto official (yet unquestionably falling out of fashion) data mining framework (which has been extended to data science problems), the Cross Industry Standard Process for Data Mining (CRISP-DM). Though the standard is no longer actively maintained, it remains a popular framework for navigating data science projects.

Read the rest of the post on KDnuggets:

The Data Science Process, Rediscovered

https://www.kdnuggets.com/2016/03/data-science-process-rediscovered.html

Kenneth Bassett

2024 Goal - Sustainability

8 年

I like that it is iterative, as any scientific process should be. As an engineer it bugs me that it doesn't have more action, as in solving a problem, but I suppose you would argue that in proving or disproving hypotheses, you are fulfilling your duties as a scientist.

???? ????? ???? ?????

Student at s

8 年

???? ????? ???? ?????

Student at s

8 年

查看更多评论

要查看或添加评论，请登录

查看全部

The Data Science Process, Rediscovered

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

更多精彩文章

社区洞察

其他会员也浏览了

Mastering the Art of Teamwork in Data Science: A Multidimensional Approach

Demystifying the Data Deluge: Your Journey to Data Scientist

Mastering the Craft: The Most Important Skills of Data Scientists

Data Scientist vs. Machine Learning Engineer: Unveiling the Distinctions

6 Data Science Lessons Learned the Hard Way: A Blog about the process of learning about Data Science

The Future of Data Science: A blog post about the future of data science.

Data Science requires heavy dose of statistics not less

Data Science in General as a topic

Top Data Science Resources on the Internet right now

How to approach Data Science in?2020?

KDnuggets: Personal History and Nuggets of Experience

2021年12月4日

Which Data Science Skills are core and which are hot/emerging ones?

2019年9月17日

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

2019年2月11日

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

2018年12月4日

How Important is that Machine Learning Model be Understandable?

2018年11月19日

Anticipating the next move in data science – my interview with Thomson Reuters

2018年11月18日

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

2018年10月31日

How many Data Scientists are there and is there a shortage?

2018年9月19日

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

2018年7月30日

SuperDataScience Podcast: Insights from the Founder of KDnuggets

2018年7月23日