A simple approach to solve a data science problem
In this blog post, I have examined a simple approach and methodology for developing analytics solutions. During my initial days of analyzing data, I used many spreadsheets, but did not follow a good methodology to approach problems. There’s only so much that you can sort, filter, pivot, and script when working with a single data set in a spreadsheet. You can spend extensive time diving into the data, slicing and dicing, pivoting one way or the other, only to find that the best you can do is show the biggest and the smallest data points. You don’t gain any real insights. The sheets full of data that you end up with are a lot more interesting for you than they are for the managers you share them with.
Analytics solutions look at data to unveil stories about current and future happenings. To be effective in a data science role, you must improve your storytelling game. The same results can be shown in different ways, and sometimes in many different ways. Your success depends on making the audience see what YOU are seeing.
People have biases that influence how they receive your results. You need to find a way to make your results relevant to each of them—or at least make the results relevant to the stakeholders. You primarily need to carry out two tasks. The first is to find a way to make your findings interesting for non-technical people. This can be done via statistics, top-n reporting, a good storyline, and visualization. I call this the “BI/BA of analytics,” or the simple descriptive analytics. Business intelligence (BI)/business analytics (BA) dashboards are a useful form of data presentation, but they characteristically rely on the viewer to find insights.
This has value and some amount of usefulness, but usually are limited to cool visualizations that I call “Sesame Street Analytics.”
The PBS show, Sesame Street, used to have a segment that taught children to recognize differences in images. It had the musical tagline “One of these things is not like the others.” Visualizations with anomalies identified in contrasting colors helped the audience immediately see how “one of these things is not like the others”. You do not need a story if you are able to show this properly. People simply look at your visualization and understand what you are trying to convey.
Your second task is to make the data interesting to your peers, the technical people, and your new data science friends. You can do this with models and analytics. Your visualizing and storytelling must be at a completely new, greater level. If you only present “Sesame Street Analytics” to a technical audience, you can expect to hear “That’s just visualization; I want to know why it is an outlier.” You need to do more to impress this audience, using real algorithms and analytics.
This blog post sets you on the track towards impressing both audiences.
Analytics methodology and approach
One of the key factors that determines how successful your solution will be in solving an analytics problem is how you approach it. In the case of analytics problems, you can use two broad approaches, or methodologies, to get to insightful solutions. Based on your background, you will have some pre-determined disposition in terms of how you want to approach problems. The ultimate goal is to convert data into value for your company. You get to that value by finding insights that solve technical or business problems. The two broad approaches, shown in Figure 1, are the “explore the data” approach, and the “solve the business problem” approach.
Figure 1: The two approaches to developing analytics solutions
These are the two main approaches that I use, and you will find literature about many granular, systematic methodologies that support some variation of each of these approaches. Most of the analytics literature guides you to the problem-centric approach. If you are strongly aware of the data that you have but not sure how to use it to solve problems, you may find yourself starting in the statistically centered exploratory data analysis (EDA) space that is most closely associated with statistician John Tukey. This approach frequently has some quick wins along the way in finding statistical value in the data rollups and visualizations used to explore the data.
Most domain data experts favor starting with EDA because it helps you understand the data and get the quick wins that allow you to pacify the stakeholders while you get into the more time-consuming part of the analysis. Your stakeholders often have hypotheses (and some biases) related to the data. Early findings from this side often sound like “You can see that issue X is highly correlated with condition Y in the environment; hence, you should address condition Y to reduce the number of times you see issue X.”
Figure 2 shows how to look at these processes as a comparison. There is no right or wrong side to start on; depending on your analysis goals, either direction or approach is viable. Note that this model comprises of data acquisition, data transport, data storage, sharing, or streaming, and secure access to that data, all of which you must consider if the model is to be implemented on a production data flow—or “operationalized.” The previous, simpler model that shows a simple data and data science combination (refer to figure 1) still applies for exploring a static data set or stream that you can play back and analyze using offline tools.
Figure 2: Exploratory Data Versus Problem Approach Comparison
Common Approach Walkthrough
Though it is a common belief that analytics is done only by math PhDs and statisticians, general analysts and industry subject matter experts (SMEs) now routinely use software to explore, predict, and preempt business and technical problems in their respective areas of expertise. You and other “citizen data scientists” have a variety of software packages at your disposal today to enable you to find interesting insights and build useful models.
You can start from either side when you understand the validity of both approaches. The important thing to remember is that many of the people you work with might be starting at the other end of the spectrum. You need to be aware of this as you start sharing your insights with a wider audience. When either audience asks, “What problem does this solve for us?” you can present the relevant findings in the relevant manner.
Let’s begin on the data side
While model building, you skip over the phases of transport, store and secure, as you take a batch of useful data, based on your assumptions, and try to test some hypothesis about it. For example, let’s take analyzing data from different networks. Through some grouping and clustering of your trouble ticket data, you may see many issues on your network router with a specific version of the software.
In this case, you can form an analysis that proves your theory that the problems are in fact related to the version of software that is running on the suspect network routers. For the data first approach, you need to decide on the problems you want to solve. You also need to use data to guide you to what is possible, based on your knowledge of the environment.
What do you need in the example of the suspect routers? Clearly, you must obtain data about the network routers when they showed the issue, as well as data about the same types of routers that have not exhibited that issue. You require both of these types of information to find the underlying factors that may or may not have added to the issue you are researching. Discovering these factors is a form of inference, as you would like to infer something about all of your routers, based on comparisons of differences in a set of devices that exhibit the issue and a set of devices that do not. You will later use the same analytics model for prediction.
You can normally skip the “production data” acquisition and transport parts of the model building phase. Although in this case you have a data set to work with for your analysis, consider here how to automate the acquisition of data, how to transport it, and where it will live if you plan to put your model into a fully automated production state so it can notify you of devices in the network that meet these criteria. On the other hand, full production state is not always essential. Sometimes you can just take a batch of data and run it against something on your own machine to get insights; this is valid and common. Sometimes you can accumulate sufficient data about a problem to solve it, and you can obtain insight without having to employ a full production system.
In a diametrically opposite manner, a common analyst approach is to begin with a known problem and determine what data is necessary to solve that problem. You often have to seek things that you don’t know to look for. Think about this example: Maybe you have customers with service-level agreements (SLAs), and you realize that you are giving them discounts because they are having voice issues over the network and you are not meeting the SLAs. This is costing your company money. You delve into what you need to analyze so as to understand why this happens, perhaps using voice drop and latency data from your environment. When you finally get the data, you make a proposed model that identifies that higher latency with particular versions of software on network routers is common on devices in the network path for customers who are asking for refunds.
Then you deploy the model to flag these “SLA suckers” in your production systems and then confirm that the model is effective as the SLA issues have gone away. In this case, deploy means that your model is examining your daily inventory data and searching for a device that matches the parameters that you have seen are problematic. What may have been a very complex model has a simple deployment.
Whether you are starting at data or at a business problem, eventually, solving the problem represents the value to your company and to you as an analyst. Both approaches follow a lot of the same steps on the journey of analytics, but frequently use different terminology. They are both about turning data into value, irrespective of starting point, direction, or approach. Figure 3 provides a more exhaustive perspective that illustrates that these two approaches can work in the same environment on the same data and the exact same problem statement. In simpler terms, all of the work and due diligence needs to be done to have a fully operational (with models built, tested, and deployed), end-to-end use case that provides real, continuous value.
Figure 3: Detailed Comparison of Data Versus Problem Approaches
The industry today offers a wide range of detailed approaches and frameworks, such as CRISP-DM (Cross-Industry Standard Process for Data Mining) and SEMMA (Sample Explore, Modify, Model, and Assess), and they all usually follow these same principles. Select something that matches your style and roll with it. Regardless of your approach, the primary goal is to create useful solutions in your problem space by merging the data you have with data science techniques to develop use cases that bring insights to the forefront.
Distinction Between the Use Case and the Solution
Before we go further, let’s simplify a few terms. Basically, a use case is a description of a problem that you solve by combining data and data science and applying analytics. The underlying algorithms and models constitute the actual analytics solution. Taking the case of Amazon as an example, the use case is getting you to spend more money. Amazon does this by showing you what other people have also purchased along with the item that you are purchasing. The thought behind this is that you will buy more things because other people like you needed those things when they bought the same item that you did. The model is there to uncover that and convey to you that you may also need to purchase those other things. Quite helpful, right?
From the exploratory data approach, Amazon might want to utilize the data it has about what people are buying online. It can then accumulate the high patterns of common sets of purchases. Then, for patterns that are close but missing just a few items, Amazon might assume that those people simply “forgot” to purchase something they needed because everyone else purchased the complete “item set” found in the data. Amazon might then use software implementation to find the people who “forgot” and remind them that they might require the other common items. Then Amazon can validate the effectiveness by tracking purchases of items that the model suggested.
From a business problem approach, Amazon might want to increase sales, and it might assume, or find research suggesting that if people are reminded about the common companion items to the items they are currently viewing or have added to their shopping carts, they often purchase these items. In order to execute this, Amazon might gather buying pattern data to find out these companion items. The company might then propose that people may also want to purchase these items. Amazon can then validate the effectiveness by tracking purchases of suggested items.
Do you see how both these approaches reach the same final solution?
The Amazon case is about increasing sales of items. In predictive analytics, the use case may be about predicting home values or car values. Simply put, the use case may be the ability to predict a continuous number from historical numbers. No matter the use case, you can basically view analytics as the application of data and data science to the problem domain. You can choose how you want to approach finding and building the solutions-either by using the data as a guide or by dissecting the stated problem.
Originally appeared on Edvancer.