Auto Encoders
Data compression is a big topic that's used in computer vision computer networks and many more. Data compression is to convert our input into a smaller representation that we recreate to a degree of quality, the smaller representation is what would be passed around and when anyone needed the original, they would reconstruct it from the smaller representation.
Auto Encoders are unsupervised neural networks that use machine learning to do the compression, Auto Encoder is to learn a compressed distributed representation for the given data, typically for the purpose of dimensionality reduction.
We already have principal component analysis then why do we need auto-encoders?
Auto Encoder can learn nonlinear transformations, unlike PCA with a non-linear animation function and multiple layers now it doesn't have to learn dense layers, so it can use convolutional layers to learn, which could be better for video image and series of data, it may be more efficient in terms of, model parameters to learn several layers with an Auto Encoder rather than learn one huge transformation with PCA. Auto Encoder also gives a representation as the output of each layer and having multiple representations of different dimensions is always useful, Auto Encoder could let you make use of pre-trained layers from another model to apply transfer learning to prime the encoder or the decoder despite the fact.
Two main applications of Auto Encoders
1. Data denoising
2. Dimensionality reduction
Data visualization are now with appropriate dimensionality constraints. Auto Encoders can learn data projections that are more interesting than PCA.
Auto-Encoders are simple learning networks that aim to transform inputs into outputs with the minimum possible error; this means that we want the output to be as close to input as possible. Auto Encoder neural network is basically an unsupervised machine learning algorithm that applies back propagation setting the target values to be equal to the inputs.
Key features about auto-encoders
Is an unsupervised machine learning algorithm that is similar to PCA, but minimizes the same objective function. Auto Encoder is a neural network whose target output its input Auto-Encoders although is quite similar to PCA, but they are more flexible when compared to the other Auto-Encoders can represent both linear and nonlinear transformation in encoding but PCA can only perform linear transformation.
Components of Auto Encoders:
Three main layers of Auto encoders are:
1. Encoder
2. Code
3. Decoder
Encoder Layer: Encoder is the part of the network that compresses the input into a latent space representation. Encoder layer Encodes the input image as a compressed representation in a reduced dimension now the compressed image typically looks garbled nothing like the original image.
Code Layer: This component represents the latent space, now code is the part of the network that represents the compressed input fed to the decoder.
Decoder Layer: This layer basically decodes the encoded image back to the original dimension the decoded image is a lossy reconstruction of the original image now it reconstructs the input from the latent space representation
Properties of Auto Encoders:
- Data-specific: Auto Encoders are only able to compress data similar to what they have been trained on.
- Lossy: The decompressed outputs will be degraded compared to the original inputs.
- Learned automatically from examples: It is easy to train specialized instances of the algorithm that will perform well on a specific type of input.
Training of an auto encoder:
There are four hyper parameters that we need to set before training them.
1. Code Size
2. Number of Layers
3. Loss Function
4. Number of node per layers
Code size: Code size represents the number of nodes in the middle layer smaller size results in more compression.
Number of layers: auto encoder can be as we want it to be we can have two or more layers in both the encoder and decoder without considering the input and the output.
Loss function: use either use mean squared error or binary cross-entropy, if the input values are in the range 0 to 1 then we typically use cross-entropy otherwise we use the mean squared error.
Architecture of an auto encoder
An auto encoder we add a couple of layers in between the input and output and the sizes of these layers are smaller than the input layer. let's say the input vector has a dimensionality of 'N' which means that the output will also have a dimensionality of n now we make the input go through a layer of size P where the value of P is less than N and we ask it to reconstruct the input now the auto encoder receives unlabeled input which is then encoded to reconstruct the input one important part of the auto-encoders is the bottleneck now the bottleneck approach is a beautifully elegant approach to a representation learning specifically for deciding which aspects of observed data are relevant information and what aspects can be thrown away it does this by balancing two criteria that is the compactness of representation measured as the compressibility number of bits needed to store the representation and the second one information the representation retains about some behaviourally relevant variables now it assumes we know what the behaviourally relevant variables are and how they are related to observed data or at least we have to have data to learn or approximate the Joint Distribution between observed and relevant variables.
Encoder:
Encoder the encoder is a neural work its input is a data point X its output is a hidden representation Z and it has weights and biases theta now to be concrete let's say X is a 28 by 28 pixel photo of a and written number the encoder encodes the data which is 784 dimensional into a latent representation space Z which is much less than 784 dimensions now this is typically referred to as a bottleneck because the encoder must learn an efficient compression of the data into this lower dimensional space .
Decoder: Decoder is another neural net its input is the representation Z it outputs the parameters to the probability distribution of the data and has weights and biases as Phi running with a handwritten digit example let's say the photos are black and white and represent each pixel as 0 or 1, the probability distribution of a single pixel can be then represented using a Bernoulli distribution the decoder gets as input the latent representation of a digit said and outputs 784 Bernoulli parameters one for each of the 784 pixels in the image the decoder decodes the real valued numbers in Z into 784 real valued numbers between 0 & 1 information is lost because it goes from a smaller to a larger dimensionality.
how do we find out how much information is lost the loss function equation helps us find the value now this measure tells us how effectively the decoder has learned to reconstruct an input image X given its latent representation Z the loss function of the variational auto encoder is the negative log likelihood with a regularizer because there are no global representations that are shared by all data points we can decompose the loss function into only terms that depend on a single data point so the loss function the first term is the reconstruction loss or expected negative log likelihood of the Ith data point the expectation is taken with respect to the encoders distribution over the representing this term encourages the decoder to learn to reconstruct the data now if the decoders output does not reconstruct the data well it will incur a large cost in the loss function.
2nd term is a regularizer that we throw in this is the kullback leibler divergence between the encoders distribution, this divergencemeasures how much information is lost when using Q to represent P it is one measure of how close Q is to P if the encoder output representations said that are different than those from a standard normal distribution it will receive a penalty in the loss this regularizer term means keep the representation Z of each digit sufficiently diverse if we didn't include the regularizer the encoder could learn to cheat and give each data point a representation in a different region of Euclidean space.
Types of auto-encoders:
1. Convolution Auto Encoders
2. Sparse Auto Encoders
3. Deep Auto Encoders
4. Contractive Auto Encoders
Convolution Auto Encoders
Convolution operator allows filtering an input signal in order to extract some parts of its content auto-encoders in their traditional formulation do not take into account the fact that the signal can be seen as a sum of other signals convolution auto-encoders use the convolution operator to exploit this observation basically they learn to encode the input in a set of simple signals and then try to reconstruct the input from them modify the geometry or the reflectance of the image now the encoder consists of three convolutional lists the number of features changes from 1 the input data to 16 for the first convolutional layer then from 16 to 32 for the second layer and finally from 32 to 64 for the final convolution and layer now while transacting from one convolutional layer to another the shape undergoes an image compression so the decoder consists of three de-convolution layers arranged in sequence for each D convolution operation we reduce the number of features to obtain an image that must be the same size as the original image so in addition to reducing the number of features de-convolution involves a shape transformation of the images. a convolution is the continue case defined as the integral of the product of two functions after one is reversed and shifted as a result a convolution produces a new function and it is a commutative operation now in the Tauri discrete space the convolution operation is defined as the following equation:
Use cases of convolution Auto Encoders:
? Image Reconstruction
? Image Colorization
Use Case1: Image Reconstruction
The first one is image reconstruction now the convolution Auto Encoders learn to remove noise from a picture or reconstruct the missing parts, so the input noisy version becomes the clean output version the network also fills the gap in the image.
Usecase2: Image colorization
Convolution Auto Encoders maps circles and squares from an image to the same image but with red and blue respectively purple is formed sometimes because of blend of colors where network hesitates between circle or square.
Sparse Auto Encoders:
Sparse Auto Encoders offer us an alternative method for introducing an information bottleneck without requiring a reduction in the number of nodes at our hidden layers rather we will construct loss functions such that we penalize activations within a layer so for any given observation we will encourage our network to learn an encoding and decoding which only relies on activating a small number of neurons now this is a different approach towards regularization as we normally regularize the weights of a network not the activations.
There are two main by which we can impose this sparsity constraint both involve measuring the hidden layer activations for each training batch and adding some term to the loss function in order to penalize excessive activations the first one is the l1 regularization
we can add a term to our loss function that penalize is the absolute value of the vector of activations a in layer H for observation I now scaled by a tuning parameter lambda the next one is the KL divergence in essence KL divergence is a measure of the difference between two probability distributions we can define a sparsity parameter Rho which denotes the average activation of a neuron over a collection of samples this expectation can be calculated in the following equation.
KL divergence between two Bernoulli distributions can be written as such now this loss term is visualized below for an ideal distribution of Rho equals to 0.2 corresponding with the minimum penalty at this point.
Deep Auto Encoders:
Its extension of the simple auto encoder is basically the deep Auto Encoders now the only difference to its simpler counterpart is number of hidden layers the additional hidden layers enabled the auto encoder to learn mathematically more complex underlying patterns in the data thefirst layer of the Deep auto encoder may learn first-order features in the raw input the second layer may learn second-order features corresponding to patterns in the appearance of first-order features deeper layers of the deep-auto encoder tend to learn even higher-order features to put everything together we need additional layers to be able to handle more complexdata, such as the data we use in collaborative filtering.
Use cases of Deep Auto Encoders
? Image Search
? Data Compression
? Topic Modeling & Information Retrieval (IR)
UseCase1: Image Search
search deep Auto Encoders are capable of compressing images into 30 number vectors image search therefore becomes a matter of uploading an image which the search engine will then compressed to 30 numbers and compare that vector to all the others in it index vectors containing similar numbers will be returned for the search query and translated into their matching image. a more general case of image compression is data compression and deep Auto Encoders are useful for semantic hashing next up is the topic modelling and information retrieval.
Use Case2: Data Compression
Usecase3: Topic Modeling & Information Retrieval(IR)
statistically modeling abstract topics that are distributed across a collection of documents in brief each document in a collection is converted to a bag of words and those word counts are scaled to decimals between 0 & 1 which may be thought of as a probability of a word occurring in the document for example one document could be the question and others could be the answers now a match the software would make using vector spaced measurements.
Contractive Auto Encoders:
A contractive Auto Encoder is an unsupervised deep learning technique that helps a neural network encodes unlabeled training data. This is accomplished by constructing a loss term which penalizes large derivatives of our hidden layer activations with respect to the input training examples, essentially penalizing instances where a small change in the input leads to a large change in the encoding space.
auto-encoders one would expect that for very similar inputs the learned encoding would also be very similar we can explicitly train our model in order for this to be the case by requiring that the derivative of the hidden layer activations are small with respect to the input in other words for small changes to the input we should still maintain a very similar encoded state now this is quite similar to a denoising auto-encoder in the sense that these small perturbations to the input are essentially considered noise and that we would like our model to be robust against noise now we can accomplish this by constructing a loss term which penalizes large derivatives of our hidden layer activations with respect to the input training examples essentially penalizing instances where a small change in the input leads to a large change in the encoding space now in fancier mathematical terms we can craft a regularization loss term as the squared Frobenius norm of the Jacobian matrix J for the hidden layer activations with respect to the input observations a Frobenius norm is essentially an l2 norm for a matrix and the Jacobian matrix simply represents all first-order partial derivatives of a vector valued function now for M observations and NH notes we can calculate the values using the equations: