Machine Learning and Primality Test
Machine Learning is everywhere. I just came back from the NodeJS Interactive conference in Vancouver, Canada, and one of the talks was about the use of Machine Learning in the browser via TensorFlow JS (TFJS) which can now even be used by Service Workers for applications that are offline, working in the browser itself. Yes, chances are the JavaScript code running in your browser is running some convoluted ML model.
Machine Learning to some degree is a commodity nowadays. It definitely helps if you understand the concepts deeply, especially if you work for heavyweight tech companies producing Machine Learning technology like Microsoft or Google, but in general even if you're a casual developer, Machine Learning is becoming just a couple of APIs that you have at your disposal, doesn't matter which programming language you're using.
What developers (ML developers or not) need to understand is that Machine Learning is as good as your training data is. They need to pay 10x more attention to the quality, modeling and vectorization of their training data rather than spending a lot of time understand the mathematics of each ML algorithm. Schematically, think about the difference between traditional models and ML models in this way:
Hence the result of an ML system are the Rules based on your training data, which is nothing but the Input and Output to the problem you're trying to model.
To illustrate how simple it is, I built a system to check whether a number is Prime or not. Goal was to build something that is better than random (low bar, I know), hence mission accomplished would be an accuracy > 50% (well, actually there are ways to do guesswork that are smarter for primality tests). Here are the steps I took:
- My training data will be 30,000 numbers, 1/3 of them will be primes, and the other 2/3 won't. I'm also working with large numbers only (10+ digits). In order to quickly check whether the number is prime I'm using Miller-Rabin Primality Test, which is ridiculously fast (like O(log(n)) where n is the input number) although not 100% accurate, but honestly, it is like 99.9999...% accurate that it doesn't really matter the odds here - for all practical purposes, it is deterministic.
- I'm modeling the training data in binary. It makes it easier to vectorize it (needed step to provide the data as the input to the ML model). What I do is represent each digit of the input number in binary, for example, if the number I'm working with is "42", I'd then represent 2 in binary (0010) and 4 in binary (0100) and then concatenate both strings, hence Representation(42) = 00100100.
- The last digit in the input will be either 1 (prime) or 0 (not prime).
- Next step was to find a nice ML library. Well I'm a C# dev, so I was looking for a .Net library. I found (guess what) ML.Net, ideal for what I needed (https://www.microsoft.com/net/apps/machinelearning-ai)
- If you follow the tutorial for the ML.Net, you will notice that there will be a need to write some (not a lot) of code to model your data in the right format for the models.
- When it actually comes time to choosing the different models, the library offers you many-many different models, from fast trees to supporting vector machines. In general you should study which models are best for which task (classification, ranking, dense data, sparse data, etc.) and try it. Also, for some of these models you can adjust few parameters like for example the number of layers in a deep-neural-network (DNN), but most of the time they do some inner-optimizations by themselves. I decided to go with Logistic Regression Classifier for my example here.
- So what were the results? Not that great, although the accuracy is high at 85%! :) the confusion matrix is in the code and there is a lot of room for improvement. But this was just a way to get the hands into ML and to prove that with < 500 lines of code one can even build a system, from scratch, that creates the training data for a very hard problem (primality test of very large numbers), model the input, vectorize it, pick a model, train the model, test the model, and get the stats about it. From this point one the grind work begins: find new features, find the right proportion of +ve/-ve samples in your training data, model it differently, try different models, tweak your models, and so on.
That's it for now! Thanks, Marcelo
Code and output are here: https://github.com/marcelodebarros/dailycodingproblem/blob/master/MLPrimalityTest.cs
I could not spend much time on it..I like your article...very interesting and intriguing...let's chat about it offline...