Data Analytics Literacy, a Lecture for Brazilian workers at a Truck manufacturer company
https://unsplash.com/@adeolueletu

Data Analytics Literacy, a Lecture for Brazilian workers at a Truck manufacturer company

I was invited to talk about Data Analytics Literacy to a group of Logistics workers at Scania Brazil, a major Swedish manufacturer focusing on heavy commercial vehicles. This lecture series aims to explain and motivate workers to adhere to new technologies. Analytics is not unique to the audience, but because they are literate in creating reports and dashboards using Power BI, this lecture would be an excellent opportunity to show them the next steps.

Data Literacy is a comprehensive concept that includes reading, understanding, building, and communicating data as information. It is definitely not only Analytics because it involves the ability and competencies to work with data. Analytics can improve decision-making and be part of a literacy program. Analytics is used to answer business questions progressively and is key to anyone who pursues data literacy.

There are four stages of business questions that can be enabled by Analytics: descriptive, diagnostic, predictive, and prescriptive.

The first two are entirely based on past data to explain what happened and why it happened. The last two are focused on the future. They need historical (past) data to identify a pattern, predict what probably will happen, and prescribe how to act to avoid undesirable situations or recommend actions to reinforce desirable results.

With that in mind, I built a simple Power BI and Python notebook to demonstrate how easy it is to put this concept into practice. My goal was to prove that it is not rocket science and does not require fancy Artificial Intelligence knowledge. They (and you) can start this literacy right now with the data you are currently using in any report or dashboard.

My first step was to find a dataset that would make sense for Logistics workers. And I found this at Kaggle. With minimal effort, I built the Power BI page to explain what happened, the first stage of Analytics literacy.

No alt text provided for this image

Based on the analysis above, I found that the number of late transportations is pretty much high, and it would require further investigation. For this particular goal, or why it happened, I built a new visualization.

No alt text provided for this image

Based on this analysis, we can get some insights regarding Destinations, Origins, Suppliers, and Customers getting the most significant number of late deliveries. I could decide to avoid some destinations, work with customers to understand if the problem would be on their side, or even change the origins to see if the problem is on my Distribution Center locations.

This is what can be done when you are looking at past data. The idea is that we can go a step further and put a statistical algorithm to help us predict when a delivery will be late or what will happen. At this moment, it is crucial to have a business question in mind. My business question is, what trips have a higher probability of experiencing delays, and how can I avoid this from happening.

After I imported the spreadsheet and made some fundamental transformations, I had a dataset to start building my predictive model.

No alt text provided for this image


With fundamental concepts on statistics, like having part of the dataset to train the statistical model and another piece to test the model accuracy, and understanding that, based on your business problem, you will use a group of candidate algorithms, I built five predictive models to see which one could be used to tell me if one specific trip would be late or not. For this particular goal, I used Classification algorithms because the answer will be yes, one or no, zero.

No alt text provided for this image


I used Logistic Regression, Decision Tree, Random Forest, Gradient Boosting Models, and XGBoost algorithms. As you can see above, it is easy to import the algorithm and fit the model with the training data (Xtrain and ytrain). Then I test the model by predicting the test data (Xtest) and comparing it to the actual results (ytest). In the example above, XGBoost made correct predictions in 88% of the cases. Because I will use this model to predict new results, I re-trained the model using all available data (train and test), performed the prediction with all data (X), and compared it to accurate results (y). Naturally, the return is better, and now I got almost 93% of correct predictions. I added one column for each algorithm and saved them into a new spreadsheet to analyze and compare the algorithm performances.

No alt text provided for this image

As you can see, all algorithms had excellent performance. From now on, I can select one of these algorithms and start the prediction journey of what will happen.

The last stage of Analytics will use the late trip predictions to recommend alternatives to avoid it. Recommender systems can be very complex. My goal is not to explain how it works extensively, but it is easy to understand when comparing it to other recommendations we receive and probably even notice. If you are a Netflix subscriber, you see some recommended films or series. If you are an e-commerce user on all platforms, you receive recommendations. That's what we will do now, but naturally, limiting the scope of this recommendation.

I needed an algorithm to rank the possibility of getting different outcomes when the prediction indicates that the trip would be late. I selected the Nearest Neighbors algorithm. It is simple, straight, and does not require much computational power to give good alternatives. Perfect for my goal to show you how easy it is.

First, I built some fake data based on random numbers. It is a disaster for our tests, but I do not have any new data to use or compare. Remember that by using random numbers, I can get combinations of Origins, Destinations, Suppliers, and Customers that would never happen. But even in a not-so-good environment, I hope the algorithm will give me good insights.

No alt text provided for this image


As you can see above, the Decision Tree prediction gave some rows with late deliveries represented by 1 in Predict column. Now I will prepare a matrix with the available data. I am far from saying that I used the best approach. But remember, my goal is only to show you it is not rocket science. Basic statistics concepts will do the trick. I built a matrix with Origins and Destination (I assume the root cause of late deliveries can be explained by those columns) and set the distance as a recommendation measure. Returning to the Netflix example, this measure would be the number of likes on each film by each user (movie and user for Netflix are the origin and destination of this model).

No alt text provided for this image


After building the matrix, I applied the Nearest Neighbors algorithm and asked the model to return the nearest neighbor for the predictions flagged as late.

No alt text provided for this image


As you can see below, it was not perfect. But it changed 50% of the predictions! If this is true, and I am not telling this is, since we are working with a public dataset and random data, who is the human being that would change the trip origin and avoid 50% of the late trips?

No alt text provided for this image


This last analysis shows that it is easy to get a recommendation to avoid undesirable outcomes. And to me, I expect to show you that you can start using Analytics to its full potential. Do you need some help with statistics, Python, or self-service BI? It is precisely the Data Literacy program's goal. Understanding people's needs, not the technology, will make us comfortable using our data to unlock possibilities, improve business results, and transform the organization's Data Culture into a Data-Driven organization.

You can find my Phyton notebook and Power BI here.

要查看或添加评论,请登录

Celso Poderoso的更多文章

社区洞察

其他会员也浏览了