Data Science

The 5 questions data science answers

Question 1: Is this A or B? uses classification algorithms

This family of algorithms is called two-class classification.

It's useful for any question that has just two possible answers.

For example:

  • Will this tire fail in the next 1,000 miles: Yes or no?
  • Which brings in more customers: a $5 coupon or a 25% discount?

This question can also be rephrased to include more than two options: Is this A or B or C or D, etc.? This is called multiclass classification and it's useful when you have several — or several thousand — possible answers. Multiclass classification chooses the most likely one.

Question 2: Is this weird? uses anomaly detection algorithms

The next question data science can answer is: Is this weird? This question is answered by a family of algorithms called anomaly detection.

If you have a credit card, you’ve already benefited from anomaly detection. Your credit card company analyzes your purchase patterns, so that they can alert you to possible fraud. Charges that are "weird" might be a purchase at a store where you don't normally shop or buying an unusually pricey item.

This question can be useful in lots of ways. For instance:

  • If you have a car with pressure gauges, you might want to know: Is this pressure gauge reading normal?
  • If you're monitoring the internet, you’d want to know: Is this message from the internet typical?

Anomaly detection flags unexpected or unusual events or behaviors. It gives clues where to look for problems.

Question 3: How much? or How many? uses regression algorithms

Machine learning can also predict the answer to How much? or How many? The algorithm family that answers this question is called regression.


Regression algorithms make numerical predictions, such as:

  • What will the temperature be next Tuesday?
  • What will my fourth quarter sales be?

They help answer any question that asks for a number.

Question 4: How is this organized? uses clustering algorithms

Now the last two questions are a bit more advanced.

Sometimes you want to understand the structure of a data set - How is this organized? For this question, you don’t have examples that you already know outcomes for.

There are a lot of ways to tease out the structure of data. One approach is clustering. It separates data into natural "clumps," for easier interpretation. With clustering, there is no one right answer.


Common examples of clustering questions are:

  • Which viewers like the same types of movies?
  • Which printer models fail the same way?

By understanding how data is organized, you can better understand - and predict - behaviors and events.

Question 5: What should I do now? uses reinforcement learning algorithms

The last question – What should I do now? – uses a family of algorithms called reinforcement learning.

Reinforcement learning was inspired by how the brains of rats and humans respond to punishment and rewards. These algorithms learn from outcomes, and decide on the next action.

Typically, reinforcement learning is a good fit for automated systems that have to make lots of small decisions without human guidance.

Questions it answers are always about what action should be taken - usually by a machine or a robot. Examples are:

  • If I'm a temperature control system for a house: Adjust the temperature or leave it where it is?
  • If I'm a self-driving car: At a yellow light, brake or accelerate?
  • For a robot vacuum: Keep vacuuming, or go back to the charging station?

Reinforcement learning algorithms gather data as they go, learning from trial and error.

Transcript: Is your data ready for data science?

Welcome to "Is your data ready for data science?" the second video in the series Data Science for Beginners.

Before data science can give you the answers you want, you have to give it some high-quality raw materials to work with. Just like making a pizza, the better the ingredients you start with, the better the final product.

Criteria for data

In data science, there are certain ingredients that must be pulled together including:

  • Relevant
  • Connected
  • Accurate
  • Enough to work with

Is your data relevant?

So the first ingredient - you need data that's relevant.


On the left, the table presents the blood alcohol level of seven people tested outside a Boston bar, the Red Sox batting average in their last game, and the price of milk in the nearest convenience store.

This is all perfectly legitimate data. It’s only fault is that it isn’t relevant. There's no obvious relationship between these numbers. If someone gave you the current price of milk and the Red Sox batting average, there's no way you could guess their blood alcohol content.

Now look at the table on the right. This time each person’s body mass was measured as well as the number of drinks they’ve had. The numbers in each row are now relevant to each other. If I gave you my body mass and the number of Margaritas I've had, you could make a guess at my blood alcohol content.

Do you have connected data?

The next ingredient is connected data.


Here is some relevant data on the quality of hamburgers: grill temperature, patty weight, and rating in the local food magazine. But notice the gaps in the table on the left.

Most data sets are missing some values. It's common to have holes like this and there are ways to work around them. But if there's too much missing, your data begins to look like Swiss cheese.

If you look at the table on the left, there's so much missing data, it's hard to come up with any kind of relationship between grill temperature and patty weight. This example shows disconnected data.

The table on the right, though, is full and complete - an example of connected data.

Is your data accurate?

The next ingredient is accuracy. Here are four targets to hit.

Look at the target in the upper right. There is a tight grouping right around the bulls eye. That, of course, is accurate. Oddly, in the language of data science, performance on the target right below it is also considered accurate.

If you mapped out the center of these arrows, you'd see that it's very close to the bulls eye. The arrows are spread out all around the target, so they're considered imprecise, but they're centered around the bulls eye, so they're considered accurate.

Now look at the upper-left target. Here the arrows hit very close together, a tight grouping. They're precise, but they're inaccurate because the center is way off the bulls eye. The arrows in the bottom-left target are both inaccurate and imprecise. This archer needs more practice.

Do you have enough data to work with?

Finally, ingredient #4 is sufficient data.

Think of each data point in your table as being a brush stroke in a painting. If you have only a few of them, the painting can be fuzzy - it's hard to tell what it is.

If you add some more brush strokes, then your painting starts to get a little sharper.

When you have barely enough strokes, you only see enough to make some broad decisions. Is it somewhere I might want to visit? It looks bright, that looks like clean water – yes, that’s where I’m going on vacation.

As you add more data, the picture becomes clearer and you can make more detailed decisions. Now you can look at the three hotels on the left bank. You can notice the architectural features of the one in the foreground. You might even choose to stay on the third floor because of the view.

With data that's relevant, connected, accurate, and enough, you have all the ingredients needed to do some high-quality data science.

Ask a sharp question

We've talked about how data science is the process of using names (also called categories or labels) and numbers to predict an answer to a question. But it can't be just any question; it has to be a sharp question.

A vague question doesn't have to be answered with a name or a number. A sharp question must.

Imagine you found a magic lamp with a genie who will truthfully answer any question you ask. But it's a mischievous genie, and he'll try to make his answer as vague and confusing as he can get away with. You want to pin him down with a question so airtight that he can't help but tell you what you want to know.

If you were to ask a vague question, like "What's going to happen with my stock?", the genie might answer, "The price will change". That's a truthful answer, but it's not very helpful.

But if you were to ask a sharp question, like "What will my stock's sale price be next week?", the genie can't help but give you a specific answer and predict a sale price.

Examples of your answer: Target data

Once you formulate your question, check to see whether you have examples of the answer in your data.

If our question is "What will my stock's sale price be next week?" then we have to make sure our data includes the stock price history.

If our question is "Which car in my fleet is going to fail first?" then we have to make sure our data includes information about previous failures.

These examples of answers are called a target. A target is what we are trying to predict about future data points, whether it's a category or a number.

If you don't have any target data, you'll need to get some. You won't be able to answer your question without it.

Reformulate your question

Sometimes you can reword your question to get a more useful answer.

The question "Is this data point A or B?" predicts the category (or name or label) of something. To answer it, we use a classification algorithm.

The question "How much?" or "How many?" predicts an amount. To answer it we use a regression algorithm.

To see how we can transform these, let's look at the question, "Which news story is the most interesting to this reader?" It asks for a prediction of a single choice from many possibilities - in other words "Is this A or B or C or D?" - and would use a classification algorithm.

But, this question may be easier to answer if you reword it as "How interesting is each story on this list to this reader?" Now you can give each article a numerical score, and then it's easy to identify the highest-scoring article. This is a rephrasing of the classification question into a regression question or How much?

How you ask a question is a clue to which algorithm can give you an answer.

You'll find that certain families of algorithms - like the ones in our news story example - are closely related. You can reformulate your question to use the algorithm that gives you the most useful answer.

But, most important, ask that sharp question - the question that you can answer with data. And be sure you have the right data to answer it.

A model is a simplified story about our data. I'll show you what I mean.

Collect relevant, accurate, connected, enough data

Say I want to shop for a diamond. I have a ring that belonged to my grandmother with a setting for a 1.35 carat diamond, and I want to get an idea of how much it will cost. I take a notepad and pen into the jewelry store, and I write down the price of all of the diamonds in the case and how much they weigh in carats. Starting with the first diamond - it's 1.01 carats and $7,366.

Now I go through and do this for all the other diamonds in the store.

Notice that our list has two columns. Each column has a different attribute - weight in carats and price - and each row is a single data point that represents a single diamond.

We've actually created a small data set here - a table. Notice that it meets our criteria for quality:

  • The data is relevant - weight is definitely related to price
  • It's accurate - we double-checked the prices that we write down
  • It's connected - there are no blank spaces in either of these columns
  • And, as we'll see, it's enough data to answer our question

Ask a sharp question

Now we'll pose our question in a sharp way: "How much will it cost to buy a 1.35 carat diamond?"

Our list doesn't have a 1.35 carat diamond in it, so we'll have to use the rest of our data to get an answer to the question.

Plot the existing data

The first thing we'll do is draw a horizontal number line, called an axis, to chart the weights. The range of the weights is 0 to 2, so we'll draw a line that covers that range and put ticks for each half carat.

Next we'll draw a vertical axis to record the price and connect it to the horizontal weight axis. This will be in units of dollars. Now we have a set of coordinate axes.

We're going to take this data now and turn it into a scatter plot. This is a great way to visualize numerical data sets.

For the first data point, we eyeball a vertical line at 1.01 carats. Then, we eyeball a horizontal line at $7,366. Where they meet, we draw a dot. This represents our first diamond.

Now we go through each diamond on this list and do the same thing. When we're through, this is what we get: a bunch of dots, one for each diamond.

Draw the model through the data points

Now if you look at the dots and squint, the collection looks like a fat, fuzzy line. We can take our marker and draw a straight line through it.

By drawing a line, we created a model. Think of this as taking the real world and making a simplistic cartoon version of it. Now the cartoon is wrong - the line doesn't go through all the data points. But, it's a useful simplification.

The fact that all the dots don't go exactly through the line is OK. Data scientists explain this by saying that there's the model - that's the line - and then each dot has some noise or variance associated with it. There's the underlying perfect relationship, and then there's the gritty, real world that adds noise and uncertainty.

Because we're trying to answer the question How much? this is called a regression. And because we're using a straight line, it's a linear regression.

Use the model to find the answer

Now we have a model and we ask it our question: How much will a 1.35 carat diamond cost?

To answer our question, we eyeball 1.35 carats and draw a vertical line. Where it crosses the model line, we eyeball a horizontal line to the dollar axis. It hits right at 10,000. Boom! That's the answer: A 1.35 carat diamond costs about $10,000.

Create a confidence interval

It's natural to wonder how precise this prediction is. It's useful to know whether the 1.35 carat diamond will be very close to $10,000, or a lot higher or lower. To figure this out, let's draw an envelope around the regression line that includes most of the dots. This envelope is called our confidence interval: We're pretty confident that prices fall within this envelope, because in the past most of them have. We can draw two more horizontal lines from where the 1.35 carat line crosses the top and the bottom of that envelope.

Now we can say something about our confidence interval: We can say confidently that the price of a 1.35 carat diamond is about $10,000 - but it might be as low as $8,000 and it might be as high as $12,000.

We're done, with no math or computers

We did what data scientists get paid to do, and we did it just by drawing:

  • We asked a question that we could answer with data
  • We built a model using linear regression
  • We made a prediction, complete with a confidence interval

And we didn't use math or computers to do it.

Now if we'd had more information, like...

  • the cut of the diamond
  • color variations (how close the diamond is to being white)
  • the number of inclusions in the diamond

...then we would have had more columns. In that case, math becomes helpful. If you have more than two columns, it's hard to draw dots on paper. The math lets you fit that line or that plane to your data very nicely.

Also, if instead of just a handful of diamonds, we had two thousand or two million, then you can do that work much faster with a computer.

Find examples in the Azure AI Gallery

Microsoft has a cloud-based service called Azure Machine Learning Studio that you're welcome to try for free. It provides you with a workspace where you can experiment with different machine learning algorithms, and, when you've got your solution worked out, you can launch it as a web service.

Part of this service is something called the Azure AI Gallery. It contains resources, including a collection of Azure Machine Learning experiments, or models, that people have built and contributed for others to use. These experiments are a great way to leverage the thought and hard work of others to get you started on your own solutions. Everyone is welcome to browse through it.

If you click Experiments at the top, you'll see a number of the most recent and popular experiments in the gallery. You can search through the rest of experiments by clicking Browse All at the top of the screen, and there you can enter search terms and choose search filters.

Find and use a clustering algorithm example

So, for instance, let's say you want to see an example of how clustering works, so you search for "clustering sweep" experiments.

Here's an interesting one that someone contributed to the gallery.

Click on that experiment and you get a web page that describes the work that this contributor did, along with some of their results.


Notice the link that says Open in Studio.

I can click on that and it takes me right to Azure Machine Learning Studio. It creates a copy of the experiment and puts it in my own workspace. This includes the contributor's dataset, all the processing that they did, all of the algorithms that they used, and how they saved out the results.And now I have a starting point. I can swap out their data for my own and do my own tweaking of the model. This gives me a running start, and it lets me build on the work of people who really know what they’re doing.


And now I have a starting point. I can swap out their data for my own and do my own tweaking of the model. This gives me a running start, and it lets me build on the work of people who really know what they’re doing.

Find experiments that demonstrate machine learning techniques

There are other experiments in the Azure AI Gallery that were contributed specifically to provide how-to examples for people new to data science. For instance, there's an experiment in the gallery that demonstrates how to handle missing values (Methods for handling missing values). It walks you through 15 different ways of substituting empty values, and talks about the benefits of each method and when to use it.

Azure AI Gallery is a place to find working experiments that you can use as a starting point for your own solutions.

要查看或添加评论,请登录

Ferrell Carr的更多文章

  • wasmCloud

    wasmCloud

    wasmCloud Components WebAssembly Host Runtime wasmCloud Application Runtime Hot swappable Capabilities Composable…

  • Quick Analysis of Angular Bundle

    Quick Analysis of Angular Bundle

    A very useful tool we can use is the Webpack Bundle Analyzer, which is a Webpack plugin to visually and interactively…

  • NET 6 Highlights

    NET 6 Highlights

    .NET 6 is: Production stress-tested with Microsoft services, cloud apps run by other companies, and open source…

  • Install IIS on Windows 10

    Install IIS on Windows 10

    In the task bar by the Windows button Type 'control' and select Control Panel from the items that display. In the…

  • Ionic

    Ionic

    The Ionic category is for all posts related to modern Ionic Framework, including Ionic Angular, Ionic React, and Ionic…

  • Visual Studio 2021 and Python 3.9

    Visual Studio 2021 and Python 3.9

    This is instructions on how to set up python 3.9 on a Windows 10 device.

  • Xamarin Forms 5

    Xamarin Forms 5

  • Vue 3.0 RC

    Vue 3.0 RC

    Vue 3.0 RC is out Reactivity API and Composition API.

  • Microsoft .Net Core SDK 3.1.300

    Microsoft .Net Core SDK 3.1.300

    The installation was successful. The following were installed at: 'C:\Program Files\dotnet' ? .

  • SSRS trick that some may miss.

    SSRS trick that some may miss.

    Install SQL Server. 1st download SQL Server.

社区洞察

其他会员也浏览了