登录查看更多内容

INTRO TO DATA SCIENCE… A PROJECT-BASED APPROACH

Chris Drew

Extreme empathy

发布日期: 2018年6月13日

I work at a pretty cool company wherein a lot of really talented folks are pretty accessible. Recently, while trying to figure out an dataset upload error in R that I was pounding my head about, I meandered up to our Chief Data Scientist to lament my n00b-iness. He just so happened to be temporarily stuck and pounding his fists about an RStudio error that was giving him fits.

Witnessing that in the wilds of the work of a bona fide data scientist gave me such a great sense of relief. It was true: even the pros get stuck on seemingly simple data science tasks!

During the Stanford SCI 01 course this past Spring our instructor, Mohammad, preached constantly about the frequency of errors even he received while working to solve problems with various datasets. (But when you're sitting in class watching a data science pro code, debug, and re-code in real time as students hurl questions/challenges his way, that proclamation is tough to swallow.) The diligence of working through those errors in the context of a project is exactly the interstices of where learning happens. This is one of the many small stepping stones to becoming a data scientist.

Below I share the project my team and I worked on and presented as our final presentation. Before doing so, let me pause to say:

Data Science is hard
Data Science is still a relatively nascent field requiring a whole host of interdisciplinary skills
I am not (yet) a data scientist

We spent 8 intense weeks diving deep into Correlation Analysis, Predictive Modeling, Multiple Linear Regression, Prediction Accuracy, Predictive Modeling Flow, Feature Selection, Classification, Distance Measures, Clustering, Web Scraping, Association Rule Mining, and much more! Even with such a rich syllabus and deep coverage, we still have a ways to go...

If you spend any time reading up on how to become a data scientist you'll quickly find at least two things. One of those things is: DO DATA SCIENCE PROJECTS.

And that is exactly what this class was all about. Each week we had lab work. And the culmination of the course was presenting our team project. I was lucky enough to team up with two wicked smart women Jamie Castro and Nay Mintin. In the slides below I've tried to focus primarily on my contributions to our project. I've included a 2-3 slides from Jamie and Nay's work - and those are labeled in the notes of the slide. Below I provide some additional context about the course and the project. But, first, here is an updated version of what our final presentation looked like:

Course Content:

For the final project each team had to select a dataset to work with. We had been equipped with all of the tools necessary to be able to do linear regression analyses and kmeans clustering. We had been equipped with practice and feedback on leveraging ggplot2 and plot.ly. Armed with this technology, our team selected a school shooting dataset.

Datasets:

One of the first things we quickly realized was: not all datasets are created equally. We likely could have chosen a cleaner (easier?) dataset to work with. Alas, once we chose our bed, we had to sleep in it. The tough thing about the dataset we selected was that it isn't a clean dataset. It required a ton of clean up before we could even make use of it. This is one of the dirty little secrets about data science: there are selection biases, data and tools limitations, and a plethora of other indirect inputs that affect the outputs from the model(s). During our presentation we didn't have much time to discuss this fact. But it was probably the most intensive and time consuming of our entire project - inclusive of all the coding errors!

Coding:

Throughout the class we focused entirely on working in R, which is hands down the friendliest of data science programming languages. The number of open source packages and capabilities available in this toolbox called R is pretty astounding. The approach of our instructor of requiring lab work and sheer repetition of tasks facilitated a relatively quick grasping of the syntax, libraries, etc.

The Final Project:

Our team name was 4Madison. The reason we chose this name - my daughter's name - is because in the days prior to the start of class there was a school shooting scare near her school. It ended up being an empty threat from a person with some mental instabilities. In addition to it freaking me the hell out, it also was the impetus for our group project:

Could we predict the next school shooting?

This was an audacious - if not totally macabre - grand tour question for us to pursue. But pursue it we did. And in the google slides below you can flip through some of our findings. The approach we took was to divide and conquer the dataset. My focus was on three features: weapon_type, weapon_source, and state. Nay focused on time and date and school demographic data. And Jamie focused on state legislation.

So as not to bury the lead, our predictive profile of what the next school shooting will consist of is this:

The next school shooting is likely to happen on a [Tuesday] at around [11:00am] using a [handgun] that the shooter will have procured from their [parents].

You can find our code in the slides and in some of the slide notes.

While there remains a ton of work to do on this dataset, there are more comprehensive, higher quality school shooting data set analyses out there (e.g. this one). If I could do it over again I likely would not have selected the Washington Post scrubbed version of school shooting data to work with.

It's a pretty sad dataset to work with. While it was a great learning experience, I'll be turning my attention to a new dataset with new challenges.

要查看或添加评论，请登录

Chris Drew的更多文章

Mental Health Side Project

2022年9月16日

Mental Health Side Project

I have a side project. And the project could use your help! The dream is to end young adult suicides.

1 条评论
IPO Bonanzas, Wealth Equality, and the Millionaire Metric

2020年9月17日

IPO Bonanzas, Wealth Equality, and the Millionaire Metric

(The opinions expressed here are my own. This has not been approved or endorsed by anyone mentioned below.

5 条评论
Business Writing Course: Intro to Sales and Marketing

2019年7月20日

Business Writing Course: Intro to Sales and Marketing

As many of you will know from my LI profile, I have had two pretty distinct careers: one in academia and another in…
AI, Machine Learning for Hybrid Structured-Unstructured Data

2018年4月13日

AI, Machine Learning for Hybrid Structured-Unstructured Data

[cross posted from DrChrisDrew.com] I recently completed a course at Stanford University: SCI 52 – Artificial…

2 条评论
Great Leadership Is Like Being A Parent

2017年11月17日

Great Leadership Is Like Being A Parent

(Originally posted at DrChrisDrew.com) "When we feel safe within an organization we will naturally combine our talents…

2 条评论
Title is Everything - Why leave it to chance?

2017年8月10日

Title is Everything - Why leave it to chance?

(Cross-posted at DrChrisDrew.com) I'll soon have my first (print) book published.
What Does Effective Sales Training Look Like?

2016年12月5日

What Does Effective Sales Training Look Like?

(Originally posted at drchrisdrew.com) As I work to lead and support the design of our forthcoming Global Sales Kickoff…

2 条评论
Hustle Points: Building a Culture of Winning (Small Things pt II)

2016年9月6日

Hustle Points: Building a Culture of Winning (Small Things pt II)

I played college basketball under the tutelage of one of the best college coaches of the '00s (2005 consensus Coach of…
Stanford Medical Center's Greatness: Leadership in the form of Small Things

2016年7月20日

Stanford Medical Center's Greatness: Leadership in the form of Small Things

(Cross-posted at DrChrisDrew.com) My wife and I were standing in the hallway, outside the cafeteria, at Stanford…

2 条评论
Defining Social Entrepreneurship and Disrupting Philanthropy (3/3)

2015年2月12日

Defining Social Entrepreneurship and Disrupting Philanthropy (3/3)

In this post - the final in my three-part series on Importance of Business Modeling in Social Ventures - I make two…

2 条评论

See all articles

INTRO TO DATA SCIENCE… A PROJECT-BASED APPROACH

Chris Drew

Extreme empathy

Chris Drew的更多文章

社区洞察

其他会员也浏览了

5 Books Every Data Professional Should?Read

3 Reasons You'll Never Be A Data Scientist

Making data science a team sport

Data Science Road Map for Aspiring Professionals

How to Build a Strong Data Science Portfolio for College Applications

Navigating the Intersection of Data Science and Humanity: My T-Shaped Journey

A primer on landing your first role in Data Science - Part II

#37 - Data Strategy, Data Science, CTO, Python & R with Boyan Angelov

Unlocking Data Science Excellence: Exploring the Top 5 Platforms Shaping the Future of Data Science ??

Tips for a Smooth Path to a Data Scientist Career

Chris Drew的更多文章

Mental Health Side Project

IPO Bonanzas, Wealth Equality, and the Millionaire Metric

Business Writing Course: Intro to Sales and Marketing

AI, Machine Learning for Hybrid Structured-Unstructured Data

Great Leadership Is Like Being A Parent

Title is Everything - Why leave it to chance?

What Does Effective Sales Training Look Like?

Hustle Points: Building a Culture of Winning (Small Things pt II)

Stanford Medical Center's Greatness: Leadership in the form of Small Things

Defining Social Entrepreneurship and Disrupting Philanthropy (3/3)

社区洞察

其他会员也浏览了

5 Books Every Data Professional Should?Read

3 Reasons You'll Never Be A Data Scientist

Making data science a team sport

Data Science Road Map for Aspiring Professionals

How to Build a Strong Data Science Portfolio for College Applications

Navigating the Intersection of Data Science and Humanity: My T-Shaped Journey

A primer on landing your first role in Data Science - Part II

#37 - Data Strategy, Data Science, CTO, Python & R with Boyan Angelov

Unlocking Data Science Excellence: Exploring the Top 5 Platforms Shaping the Future of Data Science ??

Tips for a Smooth Path to a Data Scientist Career