Start with Deep Learning - simply

Marios Michailidis

Data Scientist at H2O.ai, Fellow ,PhD, Kaggle GrandMaster

发布日期: 2017年8月18日

Preamble

Today’s topic is Deep Learning. If people have not done already, I invite them to read a similar article on how to start with data science since deep learning can be regarded as a sub-domain of data science and knowledge of the first, helps with the latter. Also some more information regarding similar topics can be found in the following Facebook page that I invite people to follow.

To understand some basic concepts about deep learning, no specialised background is needed. I have decided to explain neural networks (and consequently deep learning) as simply as I possibly can so that even non technical people can understand – at least the basic concepts. This might help to bridge the gap and make people more aware of deep learning – in other words demystify it. I will then share what-I-think is a good pipeline to learn this, from novice to master-kaggler level in a series of articles.

I will focus only on the types of networks I commonly use – feel free to expand your knowledge base with other types. If you find the intro too basic, just skip it and move onto the programming languages. Following the examples and links past the “what is deep learning” point requires some basic data science knowledge and basic coding skills. All the tools I will be discussing in this article are open source as my objective is deep learning to be accessible by anyone. There are many different Deep learning structures ideal for different tasks (for example image classification, text analysis etc) and I will make a different article to cover each one of these. The current article will focus on basic understanding of deep learning and a specific kind of neural networks called multi-layer perceptions.

Why deep learning?

Deep learning is becoming (if it has not become already) a big thing in the data science world. By this point people should know that deep learning has immensely helped in predictive modelling , especially in tasks like image classification, sound classification , information retrieval and many many other data-related problems. It is also famous for being the main force behind the algorithmic solution of AlphaGo – an AI used to beat the human champion at Go.

Chatbots , robots and other AI are being build with deep learning at their centre. I have used deep learning models thousands of times in my work and in (kaggle) predictive modelling competitions with (some) success and I have found them very useful to win or to get very accurate predictive models in general. However I am by no means specialized (or even ‘good’) in this area. I see this as an advantage while writing this article.

What is deep learning?

Background

Before I start posting the pipeline regarding deep learning, allow me to naively explain what a simple neural network is. The deep learning models are basically neural networks. They have re-surfaced in recent years thanks to the advent of computational power derived from modern GPUs (that favour parallelism), which allowed them to flourish and become deeper in structure and faster to train. GPU is the main processing power behind computers’ graphic cards (thank you video games!) You may have a look here for such cards. All of these cards (even the cheaper ones) could make a neural network train 50-100 times faster than in a typical computer.

Linear Regression as a simple neural network

There are many neural network structures, but lets start with a simple one, the multi-layer perception. I assume people are a bit familiar with regression. I will not get into the technicalities of it.

Lets assume you have three variables historically for many employees in some companies:

Age (X1)
years of working experience( X2)
number of kids (X3)

You also know their income (y) and you would like to associate the variables with the Income – your target or (as-often-referred-to) the y variable. The result of a linear regression is the following in very simple terms:

To get the income of a person, you need to multiply Age by X1, years of working experience by X2 and number of kids by X3.

Regression will find the values (X1,X2,X3) that when multiplied with the variables, their sum will give a value close to the actual Income. For example:

X1 could be 1000
X2 could be 100 and
X3 could be 10

So if a person is 30 years old, with 5 years of working experience and 1 kid, his income would be:

income = 1000×30 + 100×5 + 10×1 + 5000= 35,510$.

You may have noticed I put a constant term of 5000 at the end. This is also called bias in neural networks. The constant/bias is independent of the other variables and you may regard it as the minimum protected salary in this scenario – it is not affected by other factors, everyone should at least get this amount, young or old, experienced or not, with kids or no kids. There is a mathematical optimisation method that takes place to find these X1,X2,X3 (and the constant) based on the known income values. Another common optimisation method to find these units is called Gradient Descent – you can learn more about it in the following video

1 layer neural network

Back to the example. Here we have a vector of values (lets call them weights) that serve as multipliers for our variables (Age, experience, Kids) to derive the income. What if we we had more multipliers? In the previous example if someone has age of 100, his base income would be:

Income = 100×1000=100,000 + … (plus the rest)

which is unlikely as most people are retired by that age and their income is lower. What if we had a new set of weights that penalised age and boosted years of experience? Lets assume the new set of weights/multipliers for X1,X2,X3 is found by the optimisation method to be (-1000,3000,1000). Constant is again 5000$. If the elder person has 40 years of working experience and 2 kids, his salary would be:

Income = -1000×100 + 3000×40 + 2×1000 + 5000 =27,000$.

With the old method he would have had 109,020$!

An average of the 2 scores is likely to give a better estimate than using only the first set of weights. Likewise in other scenarios, the first set of weights will give more sensible results than only using the second.

Lets recap, we have 2 set of weights:

1) [1000, 100, 10, 5000]

2) [-1000, 3000, 1000, 5000]

(age, experience, kids, bias)

If we multiple these with the variables, we can get an estimate for the income assuming we take an average of the 2 estimates (or we could take a sum in other situations). Repetition is the mother of learning – the weights are found through an optimization method that tries to make the prediction as close to income as possible.

What I have explained so far, is in fact a neural network (no deep at all however!). The structure could be illustrated as:

Input data → multiplied with 2 sets of weights → produce 2 scores → we can average/sum them

This can be considered as a neural network with 1 hidden layer and 2 neurons. The 2 neurons in this case is the 2 set of weights. The number of neurons may not be defined through some strict criteria. People normally try different possible numbers of hidden neurons (or sets of weights) to get a better estimate of the target. The concept of the layer could be better understood if we add one more layer.

2-layer neural network

In the previous example our 2 neurons (or set of weights) produce 2 scores. We could have one more set of weights that attempts to find a way to combine these 2 scores. For example:

X1(or neuron 1) could be 0.7
X2(or neuron 1) could be 0.3 and
constant is 500

In other words if neuron 1 was to yield a score of 109,020 and neuron 2 a score of 27,000. The income would be:

Income= 0.7× 109,020 + 0.3×27,000+ 500 =84,914$.

The following graph portrays the network's architecture along with the weights for each neuron.

Output layer

The neural network commonly has a final layer that allows you to do one last transformation on that last score (e.g the 73,500$) before concluding the prediction. In many situations it remains the same (so we just multiply by 1). The following Image demonstrates the scoring outputs of each neuron , given a person with the following characteristics :

1) 100 years old

2) 40 years of working experience

3) 2 kids

So we could understand multi-layer perception as different set of multipliers for our features/variables (like age or kids) on different levels.

Disclaimer:

This representation of a neural network is very simplistic, yet useful in illustrating how it works. It should be noted that there are extra transformations that take place after the multiplications like taking the logarithm of the score. These transformations (often called activations) help in optimisation. Typical neural networks may have many layers and thousands of hidden neurons so that it becomes deep and difficult to track these equations, but you should feel reassured that they are there to help you get a better score (assuming your testing method is sound which we will discuss later on).

Learn the theory about deep learning

Previously, I naively explain what the output of a typical neural network looks like and defined it as a set of multipliers on various levels. I also briefly mentioned about some optimisation taking place which finds the optimal weights that make the predictions as best as possible. From now on I will list a few resources that provide a deeper (and more technical) understanding of deep learning.

If you are a bit familiar with coding and you want to understand the main optimisation method behind neural networks and generally how they work, look nowhere else than this blog. This PhD from Stanford explains this with limited maths and simple examples. Although the illustrated examples are written in Javascript which is not very commonly used for deep learning, some of the examples have been re-written in python.

What seems to be extremely hot right now is Andrew Ng’s courses about deep learning on coursera. Many people have claimed it was very useful and although I have not thoroughly looked at it, I am pretty sure it will be top quality. Another good option from Coursera is Geoffrey Hinton’s course on neural networks.

If you prefer books-for a deeper (ha!) and more mathematical representation of neural networks, you may try Goodfellow et al book about deep learning in pdf format. Additionally there is a collection of nice videos regarding deep learning that people may fined useful, however some of these may be more advanced than others. A nice and visual overview of how deep learning is used in practice along with some theory may be viewed in these slides.

I personally started with neural nets (which are fundamentally deep learning) in java using encog – a package for neural networks. If someone is familiar with Java I need to state that the examples, tutorials and book were really insightful and helped me a lot to understand how to construct neural networks in an object-oriented way.

I would expect that either the course or the 1st book would be enough to give you a good understanding of deep learning. You may use the rest as supplementary sources to dive deeper or to validate what you’ve learnt.

Multi-layer perception models

These are the most common neural networks used in standard regression and classification tasks with tabular (or Excel type) data and have the structure mentioned in in the beginning. There are different options for R and python.

For R

You may find details on how to install R here and a book to get you started with R here.

H2O Deep learning

The most efficient, easiest-to-use with solid performance choice (especially if you don’t have a GPU) is the deep learning implementation from H2O (which by the way is my favourite native H2O algorithm). H2O is open source. To install H2O for R, follow the instructions. The official deep learning tutorials for R could be found here. I myself started with the H2O deep learning package with the following deep learning benchmark from Arno Candel in Africa Soil competition. This tutorial was also part of my team's 5th place in that competition. The slides associated with the tutorial can be found here and the actual code here.

Lets break down the statement of the deep learning model – it is actually quite similar in most deep learning packages and I wont spend as much time with other packages as many things are justr repeated:

m3 <- h2o.deeplearning(
training_frame=sampled_train,
validation_frame=valid
x=predictors, 
y=response, 
activation="RectifierWithDropout"
overwrite_with_best_model=F,  
hidden=c(100,50),
dropouts c(0.5,0.5), 
epochs=10,
adaptive_rate=F,
rate=0.01, 
rate_annealing=2e-6,            
momentum_start=0.2, 
momentum_stable=0.4, 
momentum_ramp=1e7, 
l1=1e-5,
l2=1e-5)

First of all this list may seem daunting to begin with, however most of these things are just values you need to keep changing in order to get better predictions. Sometimes it is far better to know which parameters are important for improving your predictions rather than knowing thoroughly what they do. I have seen that in competitive challenges too. The people that create the tools are often not able to beat those that use them and exhaust all possible values for these parameters.

Before I break down the list above I would like you to recall that the weights in deep learning models are found through an optimization method. This method is iterative. Initially the algorithm sets some random values on these weights (for example we could multiply Age with 122341.123, experience with 0.34235 and no. Kids with 3 - randomly) and then in every iteration the optimisation slightly changes these values until it reaches some sensible values that give good estimates in respect to the target variable (lets say income). We can control how quickly this learning process is via increasing/decreasing the number of iterations (often called epochs) or via changing something called the learning rate. The latter states how much 'trust' to put in every new update on the weights (after each iteration). The optimum values for these parameters are found via scoring a different set – a validation set . That set is there to tell us whether the model is learning in the right manner or it needs more/less epochs. We normally monitor metrics that shows how much is the error of our predictions at the end of each iteration/epoch in order to decide where to stop the optimisation process.

Breaking down the list of parameters:

1. training_frame is the name of your data – when you import them in R you need a name

2. validation_frame is the name of some other data we use to measure how good our predictions are. In other words we use 2 different datasets, one to create the deep learning model (and estimate the weights) and another one to validate/test it.

3. x contains the columns you want to use from the training frame to estimate weights for. For example , we may have many columns, but we only want Age, Experience and number of kids. In this case 'predictors' will be a list containing these 3 names.

4. y Is the name of the target variable. In our case it could be income.

5. epochs is the number of iterations the optimization algorithm lasts for. No data scientist ever knows the "correct" number of iterations in advance. You just need to keep changing this and comparing the performance (e.g the accuracy of predictions) in the validation data. Same applies to most other metrics. So if you don't know which values to pick – you are not alone, I too just keep changing values and compare the performance.

6. rate is the learning rate that controls how much each iteration should affect the weights.

7. hidden is a list that shows the deep learning model's architecture. What the (100,50) says is that there are 2 hidden layers where the 1st one has 100 hidden unints/neurons (or 100 set of weights) and the second layer has 50 hiden units. In the first example we only had 2 set of weights (one to capture elder people more accurately and one to capture younger people more accurately) and then 1 more to combine these (with 0.7 and 0.3 if you remember). That architecture would have been (2,1). The best performing architecture which could be (100,50,25) or (1000,500,500) or (128,64) or something else, is found in accordance with the performance on the validation data. You cannot know the best performing structure in advance – you have to try and see what works best.

8. dropouts has the same arhitecture with hidden (e.g. same number of layers) and shows the proportion of times a neuron will be ignored while training. Following the naive example I gave at the top, a 0.5 (or 50%) dropout in the 1st layer, means that 1 out of 2 neurons/units may not be scored. Imagine you had the elder person (from the example) to score and the second neuron (or set of weights) that would have given a more reasonable estimate gets ignored (or dropped out) for that iteration. The model will be able to see how off/wrong that first neuron was and make harder adjustments to fix it. In general, dropout helps to make certain the model is more generalisable and ensures no neurons give off estimates. Best values are found through validation again.

9. l1 and l2 are referred to as regularization values. They are there to impose a penalty on the possible values the weights can take. The optimization method will try to find a route to solve this problem via converging to a good set of weights. However this path is by no means a cakewalk. The algorithm might lose its path and get stuck in certain places or get completely "out of the road". This might happen when an update of the weights could push them to very big levels. It is almost like trying to make a jump to cover more ground while climbing a mountain, but ending up falling down. L1 and l2 values ensure that this does not happen or control the extend that this can happen via making the weights' update more difficult. You may regard these regularization penalties like puting some sort of weight-liffting around your legs to control the marching ahead.

10. activation shows the transformation (which I briefly referenced before) that takes places after the score is summed at each neuron/layer. If you don't know what to put then leave it as it is (Rectifier) – most of the times it works well. Rectifier is also often referred to as Relu. The part stating "WithDropout" means we allow for dropout (as explained above) and from my experience this should always be there – some dropout , even very little seems to help in most problems.

11. adaptive_rate,rate_annealing,momentum_start ,momentum_stable, momentum_ramp are all values you need to change to control the rate by which the algorithm learns. It is not important to know exactly what they are, nor I have found them extremely useful – most of the times the default values seem fine, but you may try different values and see how the performance in validation data changes.

12. overwrite_with_best_model just specifies with T(rue) or F(alse) whether we would like to keep only the best mode or in other words the iteration/epoch that had the best score based on the validation data.

More about the h2o deep learning parameters , you can find at the documentation.

From my experience, I have found the following useful:

2 hiden layers 400 , 200 to start with and then divert up or down based on performance
dropouts between 0.4-0.6
rate at 0.01
l2 regularization at 0.00001
Almost always Rectifier With Dropout.
epochs 10-50
adaptive_rate to True
I also tend to do bagging – as in running many different neural networks (with the same architecture) and averaging their predictions. I normally bag 10-20 times.
I also scale the data before running it. In its most simple form, scaling means dividing your data with a value to ensure there are no huge numbers that could potentially force the algorithm to make 'extreme leaps'. For example if we are considering Age, scaling could be dividing all values with 100. E.g someone who is 50 years old, he will have a scaled value of 0.5. Lets just say the optimisation method flavors smaller values.

These parameters are fairly standard in any package/application and I will not spend so much time epxlaining them in other tools.

For additional information about deep learning with R, there is a series of videos giving more insight:

Video 1
Video 2
Video 3
Video 4
Video 5

There is another nice video explaining the H2o flow in general .

Opening a small parenthesis at this point.

People might say that because I now (as of June 2017) work for H2O, I am keen to advertise their algorithms, however this statement would be wrong. I always liked the H2O package and I have specifically highlighted their deep learning implementation MANY times in all presentations/talks I had regarding kaggle and tools I had found useful. For example look at this in 2015 (bottom) where I referenced the same tutorial as I did above. Also in my slides regarding tools that helped me in kaggle competitions in 2016 (slide 29) and pretty much every other time I was involved in a similar talk/discussion/presentation like these slides in March 2017 (slide 25) and this presentation a few months ago in May.

I am now closing the parenthesis.

Keras for deep learning

I will not spend much time on Keras in R as I mostly use it with Python – besides the R interface is fairly new and I have not tested it myself. However theoretically speaking it should be pretty powerful.

For Python

You may read how to start with python here or better look at this article.

Keras

Keras is a python library dedicated to deep learning. I can proudly say that its name comes from a Greek word and it is one of my favourite machine learning packages which I use very often. Before I start with some good resources regarding multi-layer perception, I need to state that the main computations behind keras are happening due to some other libraries .

In other words Keras is the front-end and the back-end is either one of these libraries:

Theano
Tensorflow from Google
CNTK from Microsoft

I personally use either Theno or Tensorflow as Keras’ backend, depending on the task. For multilayer perception models like the ones we have discussed so far, I prefer Theano. CNTK has very good credentials but it is quite new to Keras and I have not tried it yet.

The most difficult thing to do while setting up keras is actually to install these back-ends - especially if you want to run them with GPU. Understanding deep learning is easy compared to installing them properly in all operating systems :D. It has taken me hours or even days to install these and I need to warn you that it is no easy task depending on the operational system you are running it from. I will share a few resources that might help you to install these, but you might need to research more – this could have easily been a separate article itself (e.g. how to install these)! The most painful by far has been windows…

Before I share tips for installing the backends, it is vital to know if you have a GPU-accelarated grahics’ card. If you do, it can make the training of your models up to 100 times faster and its worth the effort in going through the installation.

In windows you can easily test this if you press right click at desktop and you see the nvidia panel in the menu.
For linux, pressing ‘nvidia-smi’ in the terminal should return something.
For mac you may open System Information-->Graphics-->Displays. If you see the word ‘Nvidia’ in the graphics’ card name in the Hardware section, then you probably have a GPU.

Once you verify you have a GPU, you basically need to do 2 things:

Install CUDA. To install the CUDA package, consider the following link.
Then you need to install CUDNN. You normally need to create an account with NVIDIA and then you will be bale to download this for free.

Once CUDA is installed (assuming you have a GPU), you may proceed with installing the packages. If you did not have a GPU (so only CPU), you could have skipped all the above and start from here.

To install Theano, consider this link based on your operational system. In most cases a “pip install” in the command line will do the trick.
For Tensorflow a “pip install” could work, but in most cases I have found that insufficient. Your best source is probably this, based on your operational system. For windows consider this, but it might not work (as it did not work for me).

Also consider the following links that attempt to do the installations of all the above in different operating systems to get an idea of problems/solutions:

Install keras with tensorflow, keras in linux with Gpu support.
Install keras with Theano and Tensorflow, GPU or CPU. And another blog that does the same. And another one.
Install keras with GPU on mac

Once installed, keras has a ‘.keras’ file that normally gets saved in your main user directory and contains the information for its backend. For instance, this is where you would specify if you wanted Theano or Tensorflow.

If you managed to install keras in your operating system in python, you are ready to test it! If you managed to do so including all bakcends plus GPU support in all operatinal systems – then you are already a legend – feel good with yourself. This might as well be the most important achievement in your life so far, so make certain you enjoy this moment before you proceed to the tutorials – you have earned this and none will ever be able to take this away from you!

A really neat tutorial to start with keras is the following one. It includes some data preprocessing steps, data visualisation and typical model building snippets of code. This short pdf explains multi-layer perception using keras with Theano and Tensorflow as back-ends. This step by step tutorial in python is clean and provides a good start for any beginner. Many resources about keras can be found in the main GitHub repo. A few different network architectures may be viewed here. Another excellent git-based tutorial with installation instructions for all the aforementioned packages, sample datasets and python notebooks that are very informative can be found here. You don't need to go beyond chapter 1 for now. Ultimately there can be no better option to learn keras than from a book written from its creator-Fran?ois Chollet.

I could list many more resources, but at this point I think it could be more useful to show you a typical architecture that has helped me a lot in many kaggle contests. The architecture is similar with the one we have seen above in H2O and in many of the tutorials listed right above. We will scrutinise all elements in order to get familiar with the Keras’ syntax. The following few lines of code have helped me many times:

from keras.layers.normalization import BatchNormalization

from keras.models import Sequential

from keras.layers.core import Dense, Dropout, Activation

from keras import regularizers 



model = Sequential()


model.add(Dense(output_dim=400, input_dim=3, kernel_initializer ='lecun_uniform', kernel_regularizer=regularizers.l2(0.00001))) 

model.add(Activation('relu'))

model.add(BatchNormalization())

model.add(Dropout(0.5))



model.add(Dense(200,  kernel_initializer='lecun_uniform', kernel_regularizer =regularizers.l2(0.00001))) 

model.add(Activation('relu'))

model.add(BatchNormalization())

model.add(Dropout(0.4))



model.add(Dense(1,  kernel_initializer='lecun_uniform'))

model.add(Activation('linear'))


model.compile(loss='mean_squared_error', optimizer='adam')



model.fit(X,y,epochs=10,batch_size=64,shuffle=True)

And now translation:

The first 4 lines which start with ‘from keras…’ just tell python to import these sub-packages from keras – consider it as some default lines you need to put there.
The model = Sequential() tells keras that we are going to build a deep learning network sequentially via adding an element at a time.
The model.add() statement normally adds a layer to the network.
A Dense layer is similar to all the ones we have seen so far. It is difficult to understand the concept at this point until we examine other layers (like convolutional) in other deep learning tasks (like image classification). For now consider it typical.
Inside the Dense layer, we can specify certain things:

Output dim: this is the size of the hidden neurons. In our naive example in the beginning we had only 2 in the first layer.
input_dim: is the number of features/columns in our model. In our naive example we had 3 (Age, experience and kids)
kernel_initializer: This is something we see for the first time. Previously I mentioned that the weights are initialised with some random values. Then the optimisation algorithm starts changing these random values until they become more sensible. Previously I metaphorically compared this optimisation algorithm to climbing a mountain- where it takes certain effort and steps to get to the top, but if not careful, might end up to the bottom in no time! This kernel initializer attempts to give some more sensible starting weights so that when the algorithm starts climbing the mountain it does not start from the bottom, but from somewhere higher. However depending on the problem there may be different initializers that work best. You may find most of the available initializers here. For me 'lecun_uniform' normally works best.
kernel_regularizer is the regularisation term we have already seen.

Activation refers to the transformation of each layer. You may find more information about available transformations here. Relu (or Rectifier) is the one I use the most.
BatchNormalization is a semi-new term. When the optimisation algorithm runs it can either update the weights one row at a time or after processing multiple rows together. Using the same (boring) mountain example, instead of reassessing whether the move made was good or not after each step, we can wait for the climber to go a bit further in order to have a more holistic view of the progress (and determine how good or bad it was). Determining the number of steps (or number of rows) executed before updating the weights can be considered as a batch of rows/steps and can help to make more sensible weights’ updates. However making the batch too big (e.g many-many rows together) could make it difficult to understand which part of route was good and which was bad and might slow the convergence of the algorithm. Normalisation refers to scaling (the activation of) this group of rows to ensure they don’t have extreme values – again it aids convergence.
Dropout is the same as we have seen before.
At this point the first layer is complete. We can go ahead and add another one that may or may not have the same characteristics. For example we may choose not to add batch normalization or to add different activations or to completely remove regularization. Up to you – there is no correct answer – you need to try and find what works best against some validation data.
Once we finish with intermediate layers, we may add one more (output) layer to specify the prediction. This is illustrated by model.add(Dense(1, kernel_initializer='lecun_uniform')) .The prediction is only one column. In a classification problem, we would have to put as output the number of different possible classes. For instance if we were predicting if an animal is a dog, a cat or a turtle, we would have to put ‘Dense(3,...’
After this we finalise the structure through compiling (e.g. model.compile() ):

loss refers to what kind of error we want to minimise. Returning to my dump mountain example. How do we determine the success criteria when we climb the mountain? What does the best climbing looks like? Is it the one that gets us faster to the top? Is the one that gets us exactly to the top? Likewise there are many ways to determine what is considered a prediction error. For example there are losses/metrics that cannot forgive very big errors, while others tend to give equal weight to any error. To learn more about available losses – you may have a read here. In practice you need to understand which kind of error is more important for the problem you try to solve.
Optimizer refers to the type of learning. Adam is a type of adaptive rate (which means weights are updated each time taking into account how much they have been updated so far). More available optimizers can be found here.

Once the model architecture is finished we can train the model and estimate the weights through model.fit():

X contains the features (like Age, Experience, kids)
y is the target variable (like income)
epochs are the number of iterations to train the model.
batch_size refers to how many rows we group together when updating the weights.
Shuffle means we mix the data in every new iteration/epoch. It helps the optimization to explore slightly different routes.

It may be a lot to dig in, but this architecture has helped me many times to create powerful models. I normally start with these values and then I slowly deviate from them based on performance on some validation data.

PyTorch

Although I have not used PyTorch myself, I am obligated to mention it as I have heard very good things about it. If any ,take into account that you don’t need to install all these back-ends you saw above, possibly with similar performance. You may find a couple of tutorials here. Also on first sight this tutorial seems very clean and very similar to keras so that I think I understand quite well- without having played with the code myself. There is a nice video too.

H2O deep learning

Similar as before. Find details on how to install h2o for python here. However I really recommend this post from my colleague Lauren– even though it is assuming you are a macbook user it is still very useful and contains many resources. The syntax is pretty similar with R’s. There is a notebook and code in the examples ‘section in git.

Conclusion

Deep learning is an emerging field in machine learning that has received (and will continue to receive) much attention. It comes in many forms, packages, programming languages and tasks. In this article we examined some basic concepts behind deep learning and we looked at a commonly-used one called the multi-layer perception. We demonstrated a few packages that are able to build very competitive deep learning models in python and R. Obviously there are many other kinds of networks suited for different tasks. In the next article I will focus solely on image classification and the types of networks commonly used of this task. Ultimately I invite people to follow my work on GitHub too as it can be deemed fairly relevant to this topic as well as my Facebook page for more information regarding open source data science!

Ajay Chadha, PRM, PMI-ACP

VP, Transformation Lead| Market Risk SME| Regulatory Stress Tesitng | FRTB

7 年

Need your help to get some good examples of deep learning in algorithm trading and possibly with python code. Material shared by you is really good and bookmarked for complete reading.

Jeong-Yoon Lee

Sr. Manager, Applied Science

7 年

Great article! You should definitely try Keras's CNTK backend for LSTM/RNN. It's ~10x faster than the TF backend. ;) BTW, Arno Candel showed me new H2O at KDD, and I must confess that I'm very impressed! You guys are doing really great work. Hats off!

2 次回应

黄华南

创办人40 年大数据人工智能自动绳神经网络在中国及国际大型及国企金融银行供应链优化改革创新投资技术创新策略培训应用，于货币预算经贸资本市场结构改革及再生能源生物科技供应链优化5G创新防范资产债务泡沫破灭病毒造成景气衰退危机

7 年

Deep learning help you tracking into the causes, consequences of problem solving for future disruptive innovation

2 次回应

Anurag Upadhyaya

AI product manager @ SymphonyAI | Driving Generative AI in B2B PaaS | Data Scientist | BITS Pilani

7 年

Thanks for the treasure Grandmiester ..

1 次回应

Saurabh Vyas

Cofounder and COO at Edify | Ex-BCG | 40U40 | MDI Gurgaon

7 年

Fantastic article Marios! Bookmarked. ??

2 次回应

查看更多评论

要查看或添加评论，请登录

Marios Michailidis的更多文章

Hiring channels for entering the data science field

2018年1月25日

Hiring channels for entering the data science field

Quite often I receive emails from people wanting to enter the field of data science in the actual sense. They have…

2 条评论
Social Machine Learning with H2O, Twitter, python

2017年9月10日

Social Machine Learning with H2O, Twitter, python

Preamble A few weeks back we witnessed the debate between Zuckerberg and Musk regarding the future of AI (or Artificial…

6 条评论
Command line tools for Machine learning

2017年8月29日

Command line tools for Machine learning

Preamble I thought I should make an article with my favourite tools that can be used from the command line and be run…

11 条评论
Regularized Greedy Forest (RGF) - Nice alternative to tree-boosting

2017年8月7日

Regularized Greedy Forest (RGF) - Nice alternative to tree-boosting

Preamble I am a big fan of open source tools and technologies , hence I would like to share a tool-and-algorithm I like…

6 条评论
How to start in Data Science

2017年7月24日

How to start in Data Science

Many people have asked me how to improve or even how to start with data science (possibly moved by my kaggle experience…

29 条评论
Presentation for data science festival 2017 London at dunnhunby

2017年4月26日

Presentation for data science festival 2017 London at dunnhunby

The slides for my part of the presentation for the data science festival 2017 are inside StackNet's github repo…

6 条评论
StackNet Meta Modelling Framework

2017年3月15日

StackNet Meta Modelling Framework

Ever wondered how is it possible to combine hundreds of different machine learning models to win data science…

4 条评论
How we won 2nd place at Avito Avito Duplicate Ads Detection

2016年8月31日

How we won 2nd place at Avito Avito Duplicate Ads Detection

In the following post, published in kaggle's (e.g Home of Data Science) official blog namely "No Free Hunch" , you can…

3 条评论
Journey to #1

2016年5月18日

Journey to #1

It was a great Journey to achieving the number 1 spot in kaggle. As much as I would like to focus on the achievement, I…

6 条评论
1st place in Dato (Kaggle) - How we did it

2015年12月3日

1st place in Dato (Kaggle) - How we did it

You can find out from the kaggle's blog: Interview of Dato Winners

23 条评论

See all articles

Start with Deep Learning - simply

Marios Michailidis

Data Scientist at H2O.ai, Fellow ,PhD, Kaggle GrandMaster

Why deep learning?

What is deep learning?

Background

Linear Regression as a simple neural network

1 layer neural network

2-layer neural network

Output layer

Disclaimer:

Learn the theory about deep learning

Multi-layer perception models

For R

For Python

Conclusion

Marios Michailidis的更多文章

社区洞察

其他会员也浏览了

The Impact of Deep Learning on Object Recognition

Exploring the Frontier of Deep Learning: Diverse Applications Across Industries

Shades of Knowledge-Infused Learning for Enhancing Deep Learning

Deep learning

All You Need To Know About Deep Learning !

Future of Deep Learning _ Where are we heading towards

Unlocking the Power of Deep Learning: Start with Machine Learning First! ????

Future of Deep Learning _ Where are we heading towards

Deep Learning Demystified : The Advanced Side of AI

Everything You Need To Know About Deep Learning!

Why deep learning?

What is deep learning?

Background

Linear Regression as a simple neural network

1 layer neural network

2-layer neural network

Output layer

Disclaimer:

Learn the theory about deep learning

Multi-layer perception models

For R

For Python

Conclusion

Marios Michailidis的更多文章

Hiring channels for entering the data science field

Social Machine Learning with H2O, Twitter, python

Command line tools for Machine learning

Regularized Greedy Forest (RGF) - Nice alternative to tree-boosting

How to start in Data Science

Presentation for data science festival 2017 London at dunnhunby

StackNet Meta Modelling Framework

How we won 2nd place at Avito Avito Duplicate Ads Detection

Journey to #1

1st place in Dato (Kaggle) - How we did it

社区洞察

其他会员也浏览了

The Impact of Deep Learning on Object Recognition

Exploring the Frontier of Deep Learning: Diverse Applications Across Industries

Shades of Knowledge-Infused Learning for Enhancing Deep Learning

Deep learning

All You Need To Know About Deep Learning !

Future of Deep Learning _ Where are we heading towards

Unlocking the Power of Deep Learning: Start with Machine Learning First! ????

Future of Deep Learning _ Where are we heading towards

Deep Learning Demystified : The Advanced Side of AI

Everything You Need To Know About Deep Learning!