Machine Learning and Primality Test

Machine Learning is everywhere. I just came back from the NodeJS Interactive conference in Vancouver, Canada, and one of the talks was about the use of Machine Learning in the browser via TensorFlow JS (TFJS) which can now even be used by Service Workers for applications that are offline, working in the browser itself. Yes, chances are the JavaScript code running in your browser is running some convoluted ML model.

Machine Learning to some degree is a commodity nowadays. It definitely helps if you understand the concepts deeply, especially if you work for heavyweight tech companies producing Machine Learning technology like Microsoft or Google, but in general even if you're a casual developer, Machine Learning is becoming just a couple of APIs that you have at your disposal, doesn't matter which programming language you're using.

What developers (ML developers or not) need to understand is that Machine Learning is as good as your training data is. They need to pay 10x more attention to the quality, modeling and vectorization of their training data rather than spending a lot of time understand the mathematics of each ML algorithm. Schematically, think about the difference between traditional models and ML models in this way:

Hence the result of an ML system are the Rules based on your training data, which is nothing but the Input and Output to the problem you're trying to model.

To illustrate how simple it is, I built a system to check whether a number is Prime or not. Goal was to build something that is better than random (low bar, I know), hence mission accomplished would be an accuracy > 50% (well, actually there are ways to do guesswork that are smarter for primality tests). Here are the steps I took:

  1. My training data will be 30,000 numbers, 1/3 of them will be primes, and the other 2/3 won't. I'm also working with large numbers only (10+ digits). In order to quickly check whether the number is prime I'm using Miller-Rabin Primality Test, which is ridiculously fast (like O(log(n)) where n is the input number) although not 100% accurate, but honestly, it is like 99.9999...% accurate that it doesn't really matter the odds here - for all practical purposes, it is deterministic.
  2. I'm modeling the training data in binary. It makes it easier to vectorize it (needed step to provide the data as the input to the ML model). What I do is represent each digit of the input number in binary, for example, if the number I'm working with is "42", I'd then represent 2 in binary (0010) and 4 in binary (0100) and then concatenate both strings, hence Representation(42) = 00100100.
  3. The last digit in the input will be either 1 (prime) or 0 (not prime).
  4. Next step was to find a nice ML library. Well I'm a C# dev, so I was looking for a .Net library. I found (guess what) ML.Net, ideal for what I needed (https://www.microsoft.com/net/apps/machinelearning-ai)
  5. If you follow the tutorial for the ML.Net, you will notice that there will be a need to write some (not a lot) of code to model your data in the right format for the models.
  6. When it actually comes time to choosing the different models, the library offers you many-many different models, from fast trees to supporting vector machines. In general you should study which models are best for which task (classification, ranking, dense data, sparse data, etc.) and try it. Also, for some of these models you can adjust few parameters like for example the number of layers in a deep-neural-network (DNN), but most of the time they do some inner-optimizations by themselves. I decided to go with Logistic Regression Classifier for my example here.
  7. So what were the results? Not that great, although the accuracy is high at 85%! :) the confusion matrix is in the code and there is a lot of room for improvement. But this was just a way to get the hands into ML and to prove that with < 500 lines of code one can even build a system, from scratch, that creates the training data for a very hard problem (primality test of very large numbers), model the input, vectorize it, pick a model, train the model, test the model, and get the stats about it. From this point one the grind work begins: find new features, find the right proportion of +ve/-ve samples in your training data, model it differently, try different models, tweak your models, and so on.

That's it for now! Thanks, Marcelo

Code and output are here: https://github.com/marcelodebarros/dailycodingproblem/blob/master/MLPrimalityTest.cs

I could not spend much time on it..I like your article...very interesting and intriguing...let's chat about it offline...

要查看或添加评论,请登录

Marcelo De Barros的更多文章

  • Fibonacci and Covid

    Fibonacci and Covid

    What function best approximates the growth in covid cases world-wide? They keep referring to the covid growth as an…

  • Life is a Sport

    Life is a Sport

    Life is a Sport: you're born (start) and then you die (end), and the goal is maximize the number of points in between…

    2 条评论
  • 3 Favorite Leadership Books

    3 Favorite Leadership Books

    Don't think I'm going to go for "From Good to Great" or "Love'em or Lose'em"..

  • Everything is Algorithms

    Everything is Algorithms

    Hi network, I wanted to show to you guys how a tiny, tiny, really tiny, hypothetical feature can uncover, underneath…

  • The End Of Snap?

    The End Of Snap?

    Opinions are mine and not my current or former employers’ Snapchat got kicked in their butt again this week, a whooper…

    3 条评论
  • Time to stop the Bullshit about Women and STEM

    Time to stop the Bullshit about Women and STEM

    Two weeks ago I was trying to solve a palindrome algorithmic problem on LeetCode. "Damn, timing out", I said to my…

    1 条评论

社区洞察

其他会员也浏览了