登录查看更多内容

Data Science

Ferrell Carr

发布日期: 2018年12月10日

+ 关注

The 5 questions data science answers

Question 1: Is this A or B? uses classification algorithms

This family of algorithms is called two-class classification.

It's useful for any question that has just two possible answers.

For example:

Will this tire fail in the next 1,000 miles: Yes or no?
Which brings in more customers: a $5 coupon or a 25% discount?

This question can also be rephrased to include more than two options: Is this A or B or C or D, etc.? This is called multiclass classification and it's useful when you have several — or several thousand — possible answers. Multiclass classification chooses the most likely one.

Question 2: Is this weird? uses anomaly detection algorithms

The next question data science can answer is: Is this weird? This question is answered by a family of algorithms called anomaly detection.

If you have a credit card, you’ve already benefited from anomaly detection. Your credit card company analyzes your purchase patterns, so that they can alert you to possible fraud. Charges that are "weird" might be a purchase at a store where you don't normally shop or buying an unusually pricey item.

This question can be useful in lots of ways. For instance:

If you have a car with pressure gauges, you might want to know: Is this pressure gauge reading normal?
If you're monitoring the internet, you’d want to know: Is this message from the internet typical?

Anomaly detection flags unexpected or unusual events or behaviors. It gives clues where to look for problems.

Question 3: How much? or How many? uses regression algorithms

Machine learning can also predict the answer to How much? or How many? The algorithm family that answers this question is called regression.

Regression algorithms make numerical predictions, such as:

What will the temperature be next Tuesday?
What will my fourth quarter sales be?

They help answer any question that asks for a number.

Question 4: How is this organized? uses clustering algorithms

Now the last two questions are a bit more advanced.

Sometimes you want to understand the structure of a data set - How is this organized? For this question, you don’t have examples that you already know outcomes for.

There are a lot of ways to tease out the structure of data. One approach is clustering. It separates data into natural "clumps," for easier interpretation. With clustering, there is no one right answer.

Common examples of clustering questions are:

Which viewers like the same types of movies?
Which printer models fail the same way?

By understanding how data is organized, you can better understand - and predict - behaviors and events.

Question 5: What should I do now? uses reinforcement learning algorithms

The last question – What should I do now? – uses a family of algorithms called reinforcement learning.

Reinforcement learning was inspired by how the brains of rats and humans respond to punishment and rewards. These algorithms learn from outcomes, and decide on the next action.

Typically, reinforcement learning is a good fit for automated systems that have to make lots of small decisions without human guidance.

Questions it answers are always about what action should be taken - usually by a machine or a robot. Examples are:

If I'm a temperature control system for a house: Adjust the temperature or leave it where it is?
If I'm a self-driving car: At a yellow light, brake or accelerate?
For a robot vacuum: Keep vacuuming, or go back to the charging station?

Reinforcement learning algorithms gather data as they go, learning from trial and error.

Transcript: Is your data ready for data science?

Welcome to "Is your data ready for data science?" the second video in the series Data Science for Beginners.

Before data science can give you the answers you want, you have to give it some high-quality raw materials to work with. Just like making a pizza, the better the ingredients you start with, the better the final product.

Criteria for data

In data science, there are certain ingredients that must be pulled together including:

Relevant
Connected
Accurate
Enough to work with

Is your data relevant?

So the first ingredient - you need data that's relevant.

On the left, the table presents the blood alcohol level of seven people tested outside a Boston bar, the Red Sox batting average in their last game, and the price of milk in the nearest convenience store.

This is all perfectly legitimate data. It’s only fault is that it isn’t relevant. There's no obvious relationship between these numbers. If someone gave you the current price of milk and the Red Sox batting average, there's no way you could guess their blood alcohol content.

Now look at the table on the right. This time each person’s body mass was measured as well as the number of drinks they’ve had. The numbers in each row are now relevant to each other. If I gave you my body mass and the number of Margaritas I've had, you could make a guess at my blood alcohol content.

Do you have connected data?

The next ingredient is connected data.

Here is some relevant data on the quality of hamburgers: grill temperature, patty weight, and rating in the local food magazine. But notice the gaps in the table on the left.

Most data sets are missing some values. It's common to have holes like this and there are ways to work around them. But if there's too much missing, your data begins to look like Swiss cheese.

If you look at the table on the left, there's so much missing data, it's hard to come up with any kind of relationship between grill temperature and patty weight. This example shows disconnected data.

The table on the right, though, is full and complete - an example of connected data.

Is your data accurate?

The next ingredient is accuracy. Here are four targets to hit.

Look at the target in the upper right. There is a tight grouping right around the bulls eye. That, of course, is accurate. Oddly, in the language of data science, performance on the target right below it is also considered accurate.

If you mapped out the center of these arrows, you'd see that it's very close to the bulls eye. The arrows are spread out all around the target, so they're considered imprecise, but they're centered around the bulls eye, so they're considered accurate.

Now look at the upper-left target. Here the arrows hit very close together, a tight grouping. They're precise, but they're inaccurate because the center is way off the bulls eye. The arrows in the bottom-left target are both inaccurate and imprecise. This archer needs more practice.

Do you have enough data to work with?

Finally, ingredient #4 is sufficient data.

Think of each data point in your table as being a brush stroke in a painting. If you have only a few of them, the painting can be fuzzy - it's hard to tell what it is.

If you add some more brush strokes, then your painting starts to get a little sharper.

When you have barely enough strokes, you only see enough to make some broad decisions. Is it somewhere I might want to visit? It looks bright, that looks like clean water – yes, that’s where I’m going on vacation.

As you add more data, the picture becomes clearer and you can make more detailed decisions. Now you can look at the three hotels on the left bank. You can notice the architectural features of the one in the foreground. You might even choose to stay on the third floor because of the view.

With data that's relevant, connected, accurate, and enough, you have all the ingredients needed to do some high-quality data science.

Ask a sharp question

We've talked about how data science is the process of using names (also called categories or labels) and numbers to predict an answer to a question. But it can't be just any question; it has to be a sharp question.

A vague question doesn't have to be answered with a name or a number. A sharp question must.

Imagine you found a magic lamp with a genie who will truthfully answer any question you ask. But it's a mischievous genie, and he'll try to make his answer as vague and confusing as he can get away with. You want to pin him down with a question so airtight that he can't help but tell you what you want to know.

If you were to ask a vague question, like "What's going to happen with my stock?", the genie might answer, "The price will change". That's a truthful answer, but it's not very helpful.

But if you were to ask a sharp question, like "What will my stock's sale price be next week?", the genie can't help but give you a specific answer and predict a sale price.

Examples of your answer: Target data

Once you formulate your question, check to see whether you have examples of the answer in your data.

If our question is "What will my stock's sale price be next week?" then we have to make sure our data includes the stock price history.

If our question is "Which car in my fleet is going to fail first?" then we have to make sure our data includes information about previous failures.

These examples of answers are called a target. A target is what we are trying to predict about future data points, whether it's a category or a number.

If you don't have any target data, you'll need to get some. You won't be able to answer your question without it.

Reformulate your question

Sometimes you can reword your question to get a more useful answer.

The question "Is this data point A or B?" predicts the category (or name or label) of something. To answer it, we use a classification algorithm.

The question "How much?" or "How many?" predicts an amount. To answer it we use a regression algorithm.

To see how we can transform these, let's look at the question, "Which news story is the most interesting to this reader?" It asks for a prediction of a single choice from many possibilities - in other words "Is this A or B or C or D?" - and would use a classification algorithm.

But, this question may be easier to answer if you reword it as "How interesting is each story on this list to this reader?" Now you can give each article a numerical score, and then it's easy to identify the highest-scoring article. This is a rephrasing of the classification question into a regression question or How much?

How you ask a question is a clue to which algorithm can give you an answer.

You'll find that certain families of algorithms - like the ones in our news story example - are closely related. You can reformulate your question to use the algorithm that gives you the most useful answer.

But, most important, ask that sharp question - the question that you can answer with data. And be sure you have the right data to answer it.

A model is a simplified story about our data. I'll show you what I mean.

Collect relevant, accurate, connected, enough data

Say I want to shop for a diamond. I have a ring that belonged to my grandmother with a setting for a 1.35 carat diamond, and I want to get an idea of how much it will cost. I take a notepad and pen into the jewelry store, and I write down the price of all of the diamonds in the case and how much they weigh in carats. Starting with the first diamond - it's 1.01 carats and $7,366.

Now I go through and do this for all the other diamonds in the store.

Notice that our list has two columns. Each column has a different attribute - weight in carats and price - and each row is a single data point that represents a single diamond.

We've actually created a small data set here - a table. Notice that it meets our criteria for quality:

The data is relevant - weight is definitely related to price
It's accurate - we double-checked the prices that we write down
It's connected - there are no blank spaces in either of these columns
And, as we'll see, it's enough data to answer our question

Ask a sharp question

Now we'll pose our question in a sharp way: "How much will it cost to buy a 1.35 carat diamond?"

Our list doesn't have a 1.35 carat diamond in it, so we'll have to use the rest of our data to get an answer to the question.

Plot the existing data

The first thing we'll do is draw a horizontal number line, called an axis, to chart the weights. The range of the weights is 0 to 2, so we'll draw a line that covers that range and put ticks for each half carat.

Next we'll draw a vertical axis to record the price and connect it to the horizontal weight axis. This will be in units of dollars. Now we have a set of coordinate axes.

We're going to take this data now and turn it into a scatter plot. This is a great way to visualize numerical data sets.

For the first data point, we eyeball a vertical line at 1.01 carats. Then, we eyeball a horizontal line at $7,366. Where they meet, we draw a dot. This represents our first diamond.

Now we go through each diamond on this list and do the same thing. When we're through, this is what we get: a bunch of dots, one for each diamond.

Draw the model through the data points

Now if you look at the dots and squint, the collection looks like a fat, fuzzy line. We can take our marker and draw a straight line through it.

By drawing a line, we created a model. Think of this as taking the real world and making a simplistic cartoon version of it. Now the cartoon is wrong - the line doesn't go through all the data points. But, it's a useful simplification.

The fact that all the dots don't go exactly through the line is OK. Data scientists explain this by saying that there's the model - that's the line - and then each dot has some noise or variance associated with it. There's the underlying perfect relationship, and then there's the gritty, real world that adds noise and uncertainty.

Because we're trying to answer the question How much? this is called a regression. And because we're using a straight line, it's a linear regression.

Use the model to find the answer

Now we have a model and we ask it our question: How much will a 1.35 carat diamond cost?

To answer our question, we eyeball 1.35 carats and draw a vertical line. Where it crosses the model line, we eyeball a horizontal line to the dollar axis. It hits right at 10,000. Boom! That's the answer: A 1.35 carat diamond costs about $10,000.

Create a confidence interval

It's natural to wonder how precise this prediction is. It's useful to know whether the 1.35 carat diamond will be very close to $10,000, or a lot higher or lower. To figure this out, let's draw an envelope around the regression line that includes most of the dots. This envelope is called our confidence interval: We're pretty confident that prices fall within this envelope, because in the past most of them have. We can draw two more horizontal lines from where the 1.35 carat line crosses the top and the bottom of that envelope.

Now we can say something about our confidence interval: We can say confidently that the price of a 1.35 carat diamond is about $10,000 - but it might be as low as $8,000 and it might be as high as $12,000.

We're done, with no math or computers

We did what data scientists get paid to do, and we did it just by drawing:

We asked a question that we could answer with data
We built a model using linear regression
We made a prediction, complete with a confidence interval

And we didn't use math or computers to do it.

Now if we'd had more information, like...

the cut of the diamond
color variations (how close the diamond is to being white)
the number of inclusions in the diamond

...then we would have had more columns. In that case, math becomes helpful. If you have more than two columns, it's hard to draw dots on paper. The math lets you fit that line or that plane to your data very nicely.

Also, if instead of just a handful of diamonds, we had two thousand or two million, then you can do that work much faster with a computer.

Find examples in the Azure AI Gallery

Microsoft has a cloud-based service called Azure Machine Learning Studio that you're welcome to try for free. It provides you with a workspace where you can experiment with different machine learning algorithms, and, when you've got your solution worked out, you can launch it as a web service.

Part of this service is something called the Azure AI Gallery. It contains resources, including a collection of Azure Machine Learning experiments, or models, that people have built and contributed for others to use. These experiments are a great way to leverage the thought and hard work of others to get you started on your own solutions. Everyone is welcome to browse through it.

If you click Experiments at the top, you'll see a number of the most recent and popular experiments in the gallery. You can search through the rest of experiments by clicking Browse All at the top of the screen, and there you can enter search terms and choose search filters.

Find and use a clustering algorithm example

So, for instance, let's say you want to see an example of how clustering works, so you search for "clustering sweep" experiments.

Here's an interesting one that someone contributed to the gallery.

Click on that experiment and you get a web page that describes the work that this contributor did, along with some of their results.

Notice the link that says Open in Studio.

I can click on that and it takes me right to Azure Machine Learning Studio. It creates a copy of the experiment and puts it in my own workspace. This includes the contributor's dataset, all the processing that they did, all of the algorithms that they used, and how they saved out the results.And now I have a starting point. I can swap out their data for my own and do my own tweaking of the model. This gives me a running start, and it lets me build on the work of people who really know what they’re doing.

And now I have a starting point. I can swap out their data for my own and do my own tweaking of the model. This gives me a running start, and it lets me build on the work of people who really know what they’re doing.

Find experiments that demonstrate machine learning techniques

There are other experiments in the Azure AI Gallery that were contributed specifically to provide how-to examples for people new to data science. For instance, there's an experiment in the gallery that demonstrates how to handle missing values (Methods for handling missing values). It walks you through 15 different ways of substituting empty values, and talks about the benefits of each method and when to use it.

Azure AI Gallery is a place to find working experiments that you can use as a starting point for your own solutions.

要查看或添加评论，请登录

Ferrell Carr的更多文章

wasmCloud

2022年8月11日

wasmCloud

wasmCloud Components WebAssembly Host Runtime wasmCloud Application Runtime Hot swappable Capabilities Composable…
Quick Analysis of Angular Bundle

2022年7月12日

Quick Analysis of Angular Bundle

A very useful tool we can use is the Webpack Bundle Analyzer, which is a Webpack plugin to visually and interactively…
NET 6 Highlights

2021年11月24日

NET 6 Highlights

.NET 6 is: Production stress-tested with Microsoft services, cloud apps run by other companies, and open source…
Install IIS on Windows 10

2020年11月20日

Install IIS on Windows 10

In the task bar by the Windows button Type 'control' and select Control Panel from the items that display. In the…
Ionic

2020年11月19日

Ionic

The Ionic category is for all posts related to modern Ionic Framework, including Ionic Angular, Ionic React, and Ionic…
Visual Studio 2021 and Python 3.9

2020年11月18日

Visual Studio 2021 and Python 3.9

This is instructions on how to set up python 3.9 on a Windows 10 device.
Xamarin Forms 5

2020年11月17日

Xamarin Forms 5
Vue 3.0 RC

2020年7月23日

Vue 3.0 RC

Vue 3.0 RC is out Reactivity API and Composition API.
Microsoft .Net Core SDK 3.1.300

2020年5月25日

Microsoft .Net Core SDK 3.1.300

The installation was successful. The following were installed at: 'C:\Program Files\dotnet' ? .
SSRS trick that some may miss.

2020年5月25日

SSRS trick that some may miss.

Install SQL Server. 1st download SQL Server.

See all articles

Data Science

Ferrell Carr

Question 1: Is this A or B? uses classification algorithms

Question 2: Is this weird? uses anomaly detection algorithms

Question 3: How much? or How many? uses regression algorithms

Question 4: How is this organized? uses clustering algorithms

Question 5: What should I do now? uses reinforcement learning algorithms

Transcript: Is your data ready for data science?

Criteria for data

Is your data relevant?

Do you have connected data?

Is your data accurate?

Do you have enough data to work with?

Ask a sharp question

Examples of your answer: Target data

Reformulate your question

Collect relevant, accurate, connected, enough data

Ask a sharp question

Plot the existing data

Draw the model through the data points

Use the model to find the answer

Create a confidence interval

We're done, with no math or computers

Find examples in the Azure AI Gallery

Find and use a clustering algorithm example

Find experiments that demonstrate machine learning techniques

Ferrell Carr的更多文章

社区洞察

其他会员也浏览了

What Is Hypothesis Testing in Data Science

Understanding IQR (Interquartile Range) in Data Science A Comprehensive Guide

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

How data science solves real-world challenges?

What is data science?

Data Science Workflow: From Data Collection to Insights

Why Use Variance and Standard Deviation in Data Science: Understanding Measures of Dispersion

Thinking about Data Science

Resampling Techniques: Unlocking the Hidden Potential of Your Data

How is Data Science Changing the World?

Question 1: Is this A or B? uses classification algorithms

Question 2: Is this weird? uses anomaly detection algorithms

Question 3: How much? or How many? uses regression algorithms

Question 4: How is this organized? uses clustering algorithms

Question 5: What should I do now? uses reinforcement learning algorithms

Transcript: Is your data ready for data science?

Criteria for data

Is your data relevant?

Do you have connected data?

Is your data accurate?

Do you have enough data to work with?

Ask a sharp question

Examples of your answer: Target data

Reformulate your question

Collect relevant, accurate, connected, enough data

Ask a sharp question

Plot the existing data

Draw the model through the data points

Use the model to find the answer

Create a confidence interval

We're done, with no math or computers

Find examples in the Azure AI Gallery

Find and use a clustering algorithm example

Find experiments that demonstrate machine learning techniques

Ferrell Carr的更多文章

wasmCloud

Quick Analysis of Angular Bundle

NET 6 Highlights

Install IIS on Windows 10

Ionic

Visual Studio 2021 and Python 3.9

Xamarin Forms 5

Vue 3.0 RC

Microsoft .Net Core SDK 3.1.300

SSRS trick that some may miss.

社区洞察

其他会员也浏览了

What Is Hypothesis Testing in Data Science

Understanding IQR (Interquartile Range) in Data Science A Comprehensive Guide

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

How data science solves real-world challenges?

What is data science?

Data Science Workflow: From Data Collection to Insights

Why Use Variance and Standard Deviation in Data Science: Understanding Measures of Dispersion

Thinking about Data Science

Resampling Techniques: Unlocking the Hidden Potential of Your Data

How is Data Science Changing the World?