I did machine learning in C
Introduction
First of all, I want to clarify that all the code that I wrote in C to perform machine learning was entirely done for self teaching purposes. None of this was requested as an academic assignment.
The idea is to craft a machine learning model that performs binary classification with a single neuron entirely built in the C programming language from scratch. In the future I pretend to continue improving this model to be a neural network and beyond a deep neural network in order to perform a more varied sort of classifications. In this case, my model will classify the MNIST dataset which consists of hand-drawn digits and the task for this model is to identify said digits. As it is binary classification, in this case I’m only training the model with 0s and 1s.
How different is it from coding ML models in Python? Of course Machine Learning is a subject that involves a lot of mathematics, and Python encourages the use of math by performing under the hood operations like memory management leaving the developer the chance to focus on the logic of their software. So with this in mind, we can all agree that coding machine learning in C is a complete disadvantage when it comes to logic and math because I actually have to take care of the memory and create all those Python under the hood operations by myself. Even though the meta is to code ML in Python, I want to challenge myself and reach some conclusions with this project.
Why wouldn’t I use C++ since it allows you to use classes and therefore use objects just as Python? The reason why is mainly personal preferences. C and C++ are on par when it comes to performance, but the biggest disadvantage in C++ is its syntaxis. I already have to deal with a lot of operations in regards to memory to also deal with C++ tough syntaxes. Even if I can’t use classes as built-ins in C, I don’t really have the need in this case.
Procedures
So, how do I do it? The main goal is to replicate an already wrote Python code that only uses NumPy to classify MNIST, so the idea is to create all the operations that I used with NumPy from scratch, figure a way to obtain the MNIST dataset for C and then put all the operations together to create a model that can be trained. The source code of this previously done project is here.
The dataset
First goal is going to be to obtain this dataset for C, the idea is that I need to only obtain a small portion of the dataset that includes 0s and 1s. Since MNIST contains a large amount of images (these being matrices) I need a way to control all of that data in multi-dimensional arrays and that involves allocating and freeing memory for each image and then separating those into training sets and validation sets in order to train the AI.
At first I thought of using an already existing solution for importing the dataset into C, though the ones that I found were either not easy to adapt for binary classification or just had to include a lot of extra processing for only the dataset. The second solution was to just an existing binary MNIST and create a parser to my own code. For this I picked Python and stored the dataset with Numpy’s savetext method. This is the only part of the project where I used a language that isn’t C, and I did it because the dataset already exists and I really needed an easy method that adapts to my own necessities. This is the declaration of my binary MNIST loader from C (declaration of the matrix struct further in this article):
matrix matLoadTxt(char *filename)
{
????matrix parsed;
????FILE *f = NULL;
????size_t sizeBuffer = 4096;
????char *buffer = malloc(sizeBuffer);
????char *token = NULL;
????int shapeFound = 0; // Variable to skip extra operations
????int i, j = 0, k;
????f = fopen(filename, "r");
????if (!f)
????{
????????fprintf(stderr, "Error: failed to open file %s", filename);
????????exit(1);
????}
????while (getline(&buffer, &sizeBuffer, f) != -1)
????{
????????token = strtok(buffer, " #(,)\n");
????????if (shapeFound == 0 && !(strcmp(token, "Shape:")))
????????{
????????????parsed.shape = malloc(sizeof(int) * 2);
????????????for (i = 0; i < 2; i++)
????????????{
????????????????token = strtok(NULL, " #(,)\n");
????????????????parsed.shape[i] = atoi(token);
????????????}
????????????parsed.mat = malloc(sizeof(double *) * parsed.shape[0]);
????????????for (i = 0; i < parsed.shape[0]; i++)
????????????{
????????????????parsed.mat[i] = malloc(sizeof(double) * parsed.shape[1]);
????????????}
????????????shapeFound = 1;
????????}
????????else
????????{
????????????for (k = 0; k < parsed.shape[1] && token; k++)
????????????{
????????????????parsed.mat[j][k] = strtod(token, NULL);
????????????????token = strtok(NULL, " #(,)\n");
????????????}
????????????j++;
????????}
????}
????free(buffer);
????fclose(f);
????printf("%s loaded correctly\n", filename);
????return parsed;
}
With this, I’m now able to easily manipulate the set.
The math
Machine Learning has a lot to do with Linear Algebra and Calculus, due to the fact that the operations the software computes are mostly taken from these two subjects in order to make predictions and learn from them.
NumPy takes an approach in development in which you can easily manage arrays. Their shape, product, addition, subtraction, etc, are reached in a really ‘pythonic’ way. As I have to implement an easy way to work with arrays, and more specifically with matrices; I have to begin by establishing what a matrix will be from the start. For this, I declared a structure called matrix that will store an array of arrays of double type and an array of integers that will store the shape of a matrix. And the explanation for this is the fact that doubles will provide a better precision, and the shape array will help determine the length of the matrix. This will come handy since C arrays of numbers need an extra variable that carries the length of each array for cases like going through one of these arrays with a for loop.
This is the declaration of the matrix structure:
typedef struct matrix
{
????double **mat;
????int *shape;
} matrix;
From this I can already start to code the methods that I used with NumPy to create my model. So I reviewed my older code and looked for each method I used and recreated it. I’m not going to explain all of them since the matrix_operations file is the longest in my repository with roughly 500 lines of code. This are all the prototypes of the functions instead:
double random_normal();
void print_matrix(matrix mat)
matrix create_random_normal_matrix();
void delete_matrix(matrix mat);
matrix T(matrix mat);
matrix dot(matrix mat1, matrix mat2);
double sum(matrix a);
matrix whereMoreThanPointFive(matrix mat);
matrix matLog(matrix mat);
matrix matAdd(matrix mat, double n);
matrix matSub(matrix mat, double n);
matrix matSubLeft(double n, matrix mat);
matrix matMul(matrix mat, double n);
matrix matDiv(matrix mat, double n);
matrix matDivLeft(double n, matrix mat);
matrix matExp(matrix mat);
matrix matAddOfMatrices(matrix mat1, matrix mat2);
matrix matMulElementWise(matrix mat1, matrix mat2);
matrix matSubElementWise(matrix mat1, matrix mat2);
matrix matIsEqualElementWise(matrix mat1, matrix mat2);
matrix matLoadTxt(char *filename);;
You can check out each of the functions in the repo.
To understand a little better what supervised learning is all about, you have to picture yourself actually learning. Imagine that you are learning something entirely new that you had no clue what was about, that is the whole idea behind supervised learning and that’s why the training process is so important. Because indeed a computer knows what is 0 and what is 1, but what the machine doesn’t know is what they look like. What goes internally for a model must be entirely its decision, the machine must draw its own path to determine which are the characteristics in a matrix to understand why half of them mean zero and why the other half means one. It’s a well known fact that when we first create a model it must have random values. We, the developers can’t really set up any sort of pattern for these numbers, they must be random, due to the fact that here is where the machine enters and delivers what they believe is better to obtain better results. Supervised learning consists of telling a machine what they have and what they have to obtain, and if they fail, keep attempting to improve their performance. In other words, it is trial and error.
The architecture
By having the dataset and all the required math functions, I could proceed to design the actual neuron. What I need is a structure that stores a matrix for the weights, a matrix for the anchors and a double for the bias:
领英推荐
typedef struct neuron
{
????matrix W;
????double b;
????matrix A;
}neuron;
The weights are the parameters the neuron uses to process the inputs and give outputs. The bias is another parameter that works with the weights before the output is sent to the activation function. The anchors are the spots of focus the neuron will use in order to make predictions. All of these values will be updated as the neuron is trained. At first, the bias will be a 0.0 and the anchors will be an empty matrix. On the other hand the weights must be already initialized with random values so the model can configure itself more freely. With a random normal distribution in order to prevent symmetry or values that are too high or too low. I also had to declare how many the input features will be, so for that I used the generator of random normal matrices for my neuron:
neuron create_neuron(int nx)
{
????neuron n;
????if (nx < 1)
????{
????????fprintf(stderr, "Error: nx must be a positive integer\n");
????????exit(1);
????}
????n.W = create_random_normal_matrix(1, nx);
????n.b = 0;
????n.A.shape = malloc(sizeof(int) * 2);
????n.A.shape[0] = 0;
????n.A.shape[1] = 0;
????n.A.mat = NULL;
????return n;
}
Now with this, I had to create the functions that make this model make predictions and learn, starting with a Sigmoid function:
matrix sigmoid(matrix x)
{
????return matDivLeft(1, matAdd(matExp(matMul(x, -1)), 1));
}
The Sigmoid function is what is used as activation function, which means, this is the last part of the model that will be used to make predictions. Sigmoid is commonly used for binary classification due to the fact that its range of outputs is way smaller than other activation functions. The Sigmoid function looks like this:
sigmoid(x) = 1 / (1 + e^-x)
With this, I could make the model actually make predictions (even though the predictions will be totally random). For this, I had to create a Forward Propagation function. What forward propagation does in synthesis, is use all the parameters of the model before going through the Sigmoid function. Just imagine it as if it was just processing the inputs and trying to understand them.
matrix forward_prop(neuron *n, matrix X)
{
????matrix x = matAdd(dot(n->W, X), n->b);
????delete_matrix(n->A);
????return n->A = sigmoid(x);
}
As said, at first the model will do everything randomly, but that is due to the fact that I haven’t given him a way to learn yet. First off, I’m going to have to measure the cost (or loss). Cost is basically the difference between the predicted and the desired output. With this value, the model will be able to improve taking into consideration failure and success. For this task, as it is binary classification, I used Logistic Regression. This is my implementation in C:
double cost(matrix Y, matrix A)
{
????/*
????To avoid division by zero errors:
????1.0000001 - A instead of 1 - A
????*/
????matrix y = matMulElementWise(Y, matLog(A));
????matrix x = matLog(matSubLeft(1.0000001, A));
????x = matMulElementWise(matSubLeft(1, Y), x);
????double c = sum(matAddOfMatrices(y, x));
????return (c * -1) / Y.shape[1];
}
Now I can monitor the progress of the model through training. Now all that remains to do before building the train function for the model. I have to code a function that allows the model to actually learn and improve through each iteration of the training. For this, I have to use the Gradient Descent optimization function. What this will do is update the bias and weights of the neuron. It is actually quite interesting how gradient descent and other optimization algorithms work. I wrote an article where I covered them more deeply here.
This is the implementation of gradient descent:
void gradient_descent(neuron *n, matrix X, matrix Y, matrix A, double alpha)
{
????int m = Y.shape[1];
????matrix dZ = matSubElementWise(A, Y);
????matrix dW = matMul(dot(X, T(dZ)), 1.0/m);
????double db = sum(dZ) / m;
????matrix dW_scaled = matMul(dW, alpha);??
????matrix updated_W = matSubElementWise(n->W, T(dW_scaled));
????delete_matrix(n->W);
????n->W = updated_W;??
????n->b -= alpha * db;
}
And with this, my model was ready to begin the training. I designed the function train to be able to show the progress through as many iterations as wanted. Overall, the train function is not that impressive, it’s mainly gathering the other functions together in a sequence that goes by obtaining the predictions, obtaining the cost and make the model learn from that until the iterations reach the end:
void train (neuron *n, matrix X, matrix Y, int iterations, double alpha, int verbose, int step)
{
????int i;
????matrix A;
????double c;
????for (i = 0; i <= iterations; i++)
????{
????????A = forward_prop(n, X);
????????c = cost(Y, A);
????????gradient_descent(n, X, Y, A, alpha);
????????if (verbose && (i % step == 0 || i % iterations == 0))
????????????printf("Cost after %i iterations: %.16lf\n", i, c);
????????delete_matrix(A);
????}
}
I also declared a couple of functions that are not that relevant that are just to evaluate, with that I can obtain data like the accuracy, cost, etc. Here’s the definition of my main to test the model:
# include "neuron.h"
int main()
{
????neuron n = create_neuron(784);
????matrix X_train = matLoadTxt("X_train");
????matrix Y_train = matLoadTxt("Y_train");
????matrix X_dev = matLoadTxt("X_dev");
????matrix Y_dev = matLoadTxt("Y_dev");
????train(&n, X_train, Y_train, 1000, 0.5, 1, 100);
????matrix A = evaluatePrediction(&n, X_dev);
????double c = evaluateCost(&n, X_dev, Y_dev);
????printf("Validation accuracy: %lf\n", (sum(matIsEqualElementWise(A, Y_dev)) / Y_dev.shape[1] * 100));
????printf("Validation cost: %lf\n", c);
????
delete_matrix(X_train);
????delete_matrix(Y_train);
????delete_matrix(X_dev);
????delete_matrix(Y_dev);
????return 0;
}
So the model has 784 input features and is tested with the X_train and Y_train matrices. X contains the input images and Y the expected outputs. I set that the model does 1000 iterations and by every 100 steps, it displays the cost.
Results
With 1000 iterations, this is what I obtained:
X_train loaded correctly
Y_train loaded correctly
X_dev loaded correctly
Y_dev loaded correctly
Cost after 0 iterations: 3.4321249616198397
Cost after 100 iterations: 0.0271270106558991
Cost after 200 iterations: 0.0191987197872680
Cost after 300 iterations: 0.0156183696853705
Cost after 400 iterations: 0.0134611493548037
Cost after 500 iterations: 0.0119901158387433
Cost after 600 iterations: 0.0108992398517611
Cost after 700 iterations: 0.0100466880474873
Cost after 800 iterations: 0.0093587575537691
Cost after 900 iterations: 0.0087922181561509
Cost after 1000 iterations: 0.0083187004998270
Validation accuracy: 99.952719
Validation cost: 0.001466
The validation accuracy was 99.95 and the validation cost was 0.001. What it means by ‘validation’ is that it is predicting data that the model hasn’t seen yet. Which means, it correctly adapts to new data and is able to make quite good predictions.
Was it worth it?
Before starting this project I was already aware that the model was just going to exist to prove to myself that I can do something like this. It was really fun to revisit all these interesting concepts of low level programming languages and trying to grab something completely different from the paradigma of low level as machine learning. So, I believe it was completely worth it as a learning experience. Of course I recommend this to anyone who just like me enjoys these fields of computer science, but I don’t think it’s a good idea for people who are just getting started either with low level or machine learning. A lot of the built-in functions of Python and methods brought by NumPy just made me appreciate Python a little more. Don’t get me wrong, I still like C a lot, but it is clearly not worth it to implement something like this if it’s not for learning. And what I mean by this, is probably if I had to deliver a project that involves machine learning, I probably wouldn’t use C, or C++, or anything that doesn’t allow an easy use of math like Python.
Overall, I would say that it is really easy in Python and NumPy benchmarks, though it is quite harder when you have to create the benchmark by yourself.