Machine Learning From Scratch [Part 2]
This is part two of Machine Learning from Scratch. You're about to follow a straight forward and short tutorial about plotting a technical bar chart using Python, Pyplot, and a statistics tool called Decile.
This article, along with many others, is also available in my personal blog:
https://brunocamps.com/2019/11/machine-learning-from-scratch-part-2/
In this lesson, you'll learn how to:
- Work with collections library and Counter module
- Work with bucketed lists and deciles
- Plot bar charts at an advanced level with histograms
- Generate a line chart (X and Y axis) from the lists
- Generate a bar chart
We'll keep studying data visualization with Pyplot. Visualizing data is a good part of a data scientist or machine learning engineer. The data itself is not that valuable - we must be smart enough to analyze it and display in an understandable way.
As we've seen in part 1, Pyplot is an easy and fast library to plot your data, but it certainly has its limitations.
Now, let's jump straight into our next task.
Let's now declare a list of grades that will be our data object this time and also import the Counter module from the Collections library.
from collections import Counter grades = [83, 95, 91, 87, 70, 0, 85, 82, 100, 67, 73, 77, 0]
Also, we need to import Pyplot. Assuming that you're using the jupyter notebook from the previous lesson, you just need to run the cell where you imported the module.
Now, let's declare our histogram using Counter. Let's bucket all grades by decile and put 100 with the 90s. Also, let's print our histogram variable and check out its content.
A decile is a descriptive statistics' concept which "is any of the nine values that divide the sorted data into ten equal parts so that each part represents 1/10 of the sample or population".
To determine our decile from the grades, we'll use the Counter, which is a dict subclass for counting hashable items. It returns its elements as dictionary values.
#Bucket grades by decile, but put 100 in with the 90s histogram = Counter(min(grade // 10 * 10, 90) for grade in grades) print(histogram)
We want the minimum value of the iteration (grade // 10 * 10, 90). We're using // to return only the integer of the division.
You've probably observed the output of our histogram:
Counter({80: 4, 90: 3, 70: 3, 0: 2, 60: 1})
That is what a decile looks like.
Now, let's print our histogram and see what it looks like.
plt.bar([x + 5 for x in histogram.keys()], #Shift bar right by 5 histogram.values(), #give each bar its correct height 10, #Give each bar a width of 10 edgecolor=(0, 0, 0)) #x-axis from -5 to 105 #y-axis from 0 to 5 plt.axis([-5, 105, 0, 5]) plt.xticks([10 * i for i in range(11)]) #x-axis labels at 0, 10, ..., 100 plt.xlabel("Decile") plt.ylabel("# of Students") plt.title("Distribution of Exam 1 Grades") plt.show()
That's how our distribution of the grades will look like:
Statistics play a significant role in machine learning. Sometimes, pure statistics will satisfy your project's objective. There is a huge discussion about whether statistics tools are machine learning or not - and that's merely a discussion.
We should be concerned about objective goals for our machine learning projects - no matter how you call it (AI, Data Science, Statistics…). It doesn't matter if you're running a basic linear regression or a hardcore deep learning framework, you must deliver practical results.
By the end of this article, you've had more contact with Python handling data and visual demonstrations using Pyplot. In the next article (Part 3), we'll jump into Numpy, which is widely used for numerical computing.
If you missed Part 1, here it is:
https://www.dhirubhai.net/pulse/machine-learning-from-scratch-part-1-bruno-campos/