Keras Neural Networks to Win NVIDIA Titan X
Abhishek Thakur
AI @ Hugging Face ?? | World’s 1st 4x Kaggle GrandMaster ?| GitHub Star ?? | 150k+ LinkedIn Followers,?100k+ YouTube Subscribers ??
In late 2015, NVIDIA joined hands with Codalab in an ongoing competition for AutoML. The competition ended on May, 1st, 2016. The prizes were NVIDIA Titan X for the top 3 teams.
In this post, I describe how I secured a place in top 3, thus, winning a Titan X. The results have not been finalized yet and I hope nobody will kick me out of top 3 when the results are final ;)
AutoML (Automatic Machine Learning)
The AutoML competition lasted for almost a year with 6 phases. With some background in the applied machine learning field and my in-house AutoML framework, I was able to win the first phase of the competition. One month back, I decided to spend a few hours for the final phase, which consisted of both a CPU track and a GPU track (CPU for AutoML and GPU for GPU based algorithms to compete for Titan X)
In every phase of the competition, we were provided with 5 different datasets. The datasets were fully anonymized with no information about the type of features. We were, however, given some information, for example:
-
The type of features (binary, numeric, categorical, mixed)
-
The type of dataset (dense or sparse)
-
Total number of features
-
Total number of samples for training, validation and test
-
Missing features, if any
-
Type of target (categorical, numeric)
-
Number of target variables
-
The evaluation metric
The Datasets
For the final phase, the datasets were named as follows:
-
Evita
-
Flora
-
Helena
-
Tania
-
Yolanda
Since all the datasets were different, same model cannot be used for all of them. I decided to follow simple rules to build neural networks with Keras.
In case you don't know about Keras, it is a python library for deep learning and neural networks, built on top of Theano and Tensorflow. Keras provides very simple APIs to build your own complicated neural networks for a wide variety of tasks. More about Keras can be found on its website: https://www.keras.io
Neural networks need a lot of tuning, preprocessing of the dataset and a lot of time is spent in optimization of the millions of hyper-parameters. Some people say that there is no rule of thumb for choosing the size of neural networks, and their parameters. Well, after reading this post, you might change your opinion. Even if there are no rules of thumb, there are certain rules that can be followed and you can build and optimize neural networks easily. Thanks to Keras, we don't need to spend time writing our own neural network in Theano, which can be a very tedious task.
The sparse datasets were treated differently during preprocessing since Keras doesn't support sparse arrays at the moment. This can easily be changed by looking into Keras code. I decided to be lazy and follow a different processing of these sparse datasets. In the end, the laziness paid off and the results were much better than traditional machine learning algorithms.
For the aforementioned datasets, I followed simple rules, described by the chart below:
If a dataset is sparse, we should go for decomposition methods such as Singular Value Decomposition or Non-Negative Matrix Factorization. I didn't use NMF at all. It was all SVD. How does one select the number of components in SVD? After playing around with 100s of datasets, I have seen that 100-200 components perform best. Increase in the number of components doesn't give any substantial improvement in results and is much slower. Starting with 100 components, testing a few models and increasing the number to 200 or maybe 300 in some cases should be a “rule of thumb” for sparse datasets.
In case of a dense dataset, Z-Score Scaling works best with all kinds of neural networks. It was, however, observed that in case of a particular dataset, scaling between 0 and 1 gave slightly better results than Z-Score Scaling.
After preprocessing, its all about building neural networks. I decided on what kind of architecture I needed (figure above) and then started with the parameters. It has been observed that Batch Normalization always works! Batch normalization with a dropout of 0.2-0.3 was initially fixed. All my neural networks had a PreLU activation for all layers except the final one.
We also need to choose the output layer activation, which is a pretty easy task.
-
Multilabel Classification – Sigmoid
-
Multiclass Classification – Softmax
-
Regression – Linear
Finally, we need a loss function and an optimizer. For binary/multiclass/multilabel classification tasks, we chose, categorical cross-entropy, also known as multiclass log loss as the loss function. For regression tasks, mean squared error was the best choice. For optimizers, the best choices are SGD (takes a lot of time to converge) and Adam (very fast convergence). Due to a time limit, I chose Adam. Some experimentation was also done with SGD but Adam gave the best performance.
Rules for choosing the neural network parameters:
Start low. In case of neural networks its better to start with a single layer network with 300-500 neurons, measure the performance, add one more layer with same number of neurons and observe the performance.
Increase High. If the low number of neurons and two layers don't work, increase the number of neurons to 1200-1500 and observe the performance again. In case the performance doesn't get better, increase the dropout for both layers to 0.4-0.5.
Go very high. If the above configuration doesn't work either, go for a big neural network, with two layers, same architecture as above, 8000-10000 neurons per layer and a dropout of 0.8-0.9.
If everything fails, go for feature engineering, different kind of normalization of the dataset, remove batch normalization and repeat the steps above. If there is no improvement in performance, go for Gradient Boosting (xgboost) or random forest.
Final Network Architectures
The final architectures are described below. It should be noted that, it took only a couple of hours to find good performing networks after following the rules mentioned above.
Evita Net:
Evita was the only dataset in which I added two new features, namely, count of 0s and 1s per sample.
Flora Net:
A basic network, very similar to Evita with 200 SVD components and linear output layer.
Helena Net:
Very small network with Standard Scaling inputs.
Tania Net:
One of the most interesting networks with great performance.
Yolanda Net:
Again, a small but effective neural net.
We see how very simple neural networks were used and gave a very good performance. It becomes easy to build neural networks with good performance with the rules described above. I will soon be releasing code and advanced writeup on tuning parameters for deep networks using Keras.
Since the results are not final yet, I should not get my hopes high. After they are final, I get the following and won't have to spend money on Amazon GPU instances, which, by the way, aren't good enough!
Building ML Solutions
8 年Congrats!
Engineer
8 年Very Nice! Thanks a lot.
Staff Engineer at VMware
8 年Very helpful!
Head of Engineering, Data & AI @ RiverBank | Transforming Financial Services through Data Engineering and AI
8 年Very insightful! thanks for sharing