AN APPROACH TO ANALYZING DATA
R Ravi Shankar
Director Data Science @ Honeywell | M Tech (Data Science), MBA, PMP, Six Sigma Green Belt
For someone who loves analyzing data, there are opportunities seen in any data that is around. Oftentimes, the brain can go whirly in trying to fathom the approach to analysis and this can result in several iterations before we get to the goal of realizing value from the analysis. Even if the problem statement is defined by the customer, there is still merit in exploring what else the data might have to offer by way of insights. This led me to think of a generic approach that I could employ when facing a set of data. To make this article meaningful, I have taken the following data set as an example.
1.??????Start by defining the customers – A look at the data can tell us who can be the prospective customers. In the case of the airlines data, the customers could be the Airlines, the OEMs of the planes, the airport authorities and the passenger.
2.??????List out the potential pain points of each customer and/or their areas of interest – For example:-
a.??????Airlines - Delays by airport, time of day, month of year, day of week, causes (weather, carrier, security etc)
b.??????OEMs of planes – Delay by plane type
c.??????Airport authorities – Delays due to security across months and days of week
d.??????Passenger – Delay prediction when booking a ticket.
3.??????Examine the data thoroughly – Going one column at a time, understand:-
领英推荐
a.??????The meaning of each column heading
b.??????The type of data in each
c.??????The source of data (which system does it come from, accuracy of data etc)
d.??????How many rows of data are missing and what to do about them?
For example, where does the delay due to weather come from? Is it manually input and by whom? How much credence to give to it? What is the meaning of delay due to National aviation system? This is a very important step for data scientists as it also helps build an understanding of the domain. I believe a good data scientist is one who has sufficient of breadth of knowledge to get a perspective as well as one who is smart enough to gain the necessary depth of the domain knowledge in a short time.
4.??????Think what type of graphs need to be plotted to visualize each of the customer persona’s pain points/areas of interest (Descriptive analysis). For the airlines, bar graphs of delays categorized by Airport and causes and trend lines of delays by time of day, day of the month and month of the year might be enough to begin with. For the passenger, we need to be able to show a predicted delay based on the ticket being booked. Working backwards, we would need to have a prediction model built on training data and tested on a separate set to determine model parameters.
5.??????Data will point to areas of interest; the next step is to deep dive into possible causes and remedial actions. For example, if it is seen that the delays due to security peak in certain months which also correlate to a larger number of planes flying in those periods, it would be good to get additional data on number of passengers. If the delays due to security are observed to correlate with number of passengers, perhaps augmenting security staff in times of peak forecasted demand could help mitigate the issue.
A logical approach as above will make one feel less overwhelmed when faced with a large data set and not knowing exactly what to do with it. Do you have any other ways that have worked for you? Feel free to comment.