The ANN-cillary of Deep Learning

The ANN-cillary of Deep Learning

Due to the impressive capabilities of GPT, there is a considerable amount of interest now in Large Language Models (LLMs) also known as Foundational Models. LLMs are a component of Deep Learning - a sub-domain of Machine Learning. I believe as technologists we are now venturing into a space which is only well understood by a few experts out there. If we need to commoditize the use of this phenomenal technology, enabling us to deploy them in use cases, there is a need for us to understand these technologies in greater depth. In my desire to seek introductory understanding of Deep Learning, I devoured several white papers and research papers. The content through most of them is still very technical and not easily understandable. Before I try to delve into the inner workings of LLMs (hope to cover in my next article), I believe we should try to learn a new set of terminologies which will help lay the path for more specifics.

?

Artificial neural network (ANN) is a primary technology concept behind LLMs and is inspired by biological neural networks. ANN comprises of artificial neurons which simplistically replicate neurons in human brains. Just like neurons in our brains are connected to each other using a network of synapses and edges, in ANN, artificial neurons are connected using hardware or software link allowing transmission of electrical signals. Biologically, as our learning evolves so that we start to build an understanding of something over time, our brain adds / adjusts 'weights'. Any understanding with weights that are close to zero is said to have lesser importance in the prediction process of other decisions compared to the ones with weights having a larger value. Along with weights, our brain adds a 'bias', which helps control the output of neuron independently of the input i.e., effectively cross pollinating our learnings from other learnings. Lastly, there is an 'activation function' which, based on a combination of the input, weights, and bias, decides whether a neuron could be fired or not in order to generate the output. This is a simplistic view of the biological neural network.

?

ANN tries to replicate these concepts using hardware / software approach. As the number of parameters i.e., tunable weights and biases increase, the ANN model can capture complex patterns and relationships within the data. Consequently, it comes at a cost of high computational resources. Currently, GPT-3 considers 175 billion parameters. OpenAI and Google Brain are targeting trillions of parameters in their next upgrades to match the number of neurons in the human brain.

No alt text provided for this image
Parameter size of each model


Training an ANN model

By now, you have a basic understanding of neural networks and their elements. Now let’s look at few terminologies related to training an ANN:

?

Feature: A feature is an individual measurable property or characteristic of the data used for training. For e.g., if the training data represents types of trees in the UK, the features could be number of stems, crowns, leaves, surrounding areas etc. The most fundamental part of training a model is identifying a feature in a data set. For a machine to understand these features, they are converted to ‘feature vector’ using a process called ‘feature engineering’. With recent advancements in feature learning, machines are now themselves capable of identifying new features in a dataset and learn patterns about them.?

One such algorithm for feature engineering is: Word embedding. This technique is used to represent a word using a vector with several dimensions. Each real number vector encodes the meaning of the word which helps the machine identify similar meaning words in vector space. Embedding projector - visualization of high-dimensional data (tensorflow.org) is an amazing tool to demonstrate the proximity of each word to its synonym or semantic relative. It shows the cosine / euclidean distance between words. For e.g., if you search ‘Hitler’, it will show high closeness to other fascists like Stalin and Mussolini. There are many free open-source pre-trained models for word embedding e.g. word2vec, Glove and also there is flexibility to train these models using your own data source.

?

Backpropogation: Backpropagation allows the neural network to adjust its parameters (weights and biases) to minimize the discrepancy between the predicted output and the actual output while training. This helps the network to learn and improve its performance over time, ultimately making more accurate predictions on unseen data. This process is called 'Fine tuning' the model and is commonly used in recent years.

?

One-shot (1S), few shots (FS) and Zero-shot (ZS) - The drawback with methods like fine tuning is that it requires a large training data set (labelled /un-labelled) to train the model and accordingly adjust parameters. To counter this, there are newer techniques which are being employed to train with none, one or few data objects. This replicates the human tendency to learn and synthesize new object categories from previously learned categories. Also, these new techniques prevent any biases which the machine learns due to narrowly distributed large dataset. 1S is more aligned to way tasks are communicated to the humans i.e. you show them once and then they will understand what requires to be done. FS requires providing a very limited number of examples to the machine typically between 10-100. ZS is a newest child on the block where no previous examples are given to the machine model but just a natural language instruction describing what is required to be done. Below is an example of ZS, 1S and FS compared to traditional fine-tuning method of training a machine model. An important point to note is that in fine-tuning, there are no instructions on the task but machine was fed with a big data set to learn from without understanding the ask. But in new approaches, machine is told what is to be done and then given few examples, so clearly a robust and self-learning mechanism but comes with its own challenges.

No alt text provided for this image
1S, FS and ZS in comparision to Fine-tuning

?

Well-known ANN architectures:

?ANNs have evolved over time with various architectures. Some important ones have been mentioned here:

?

Feedforward neural network (FNN) - As the name suggests, it is a unidirectional simple input-output node structure, with weights between them designed to calibrate input to match required output. This model has been in place for over 2 centuries and is based on a simple mathematical predictive model i.e., linear regression which was identified as a means of finding a good-enough linear fit to set of points.

?

Recurrent Neural network (RNN): RNN is a major advancement over FNN wherein each node on the network is connected to another node (see below fully connected RNN). This enables the network to have its own 'memory' as it learns from some of its prior inputs to influence its current input and output. RNN is best suited for applications involving sequential data such as language translation or speech recognition. It works well where the number of data elements in a sequence is not significantly high and there are not many long-term dependencies between current input and something that occurred in distant past. As the number of sequential data element grows, memory explodes with too many things to remember and consequently hampers the algorithm’s tendency to learn and further fine-tuning of its parameters will have no / minimal impact. This problem is called 'vanishing gradient' or 'exploding gradient' (wherein gradient is the measure of difference between actual output and correct output). To counter this LSTM was discovered (covered below).


Long Short-term memory (LSTM): The idea behind LSTM is to replicate human behavior of trying to create short-term memories of what occurred in the past (whether distant or recent). This makes it easy to remember instead of trying to remember all the minute details and exploding our memory. Hence, alongside memory in RNN, each layer was enhanced to have something called a 'gate' which was divided in 3 parts input gate, output gate and forget gate - effectively deciding what can come in, go out and be forgotten as it may not be relevant to the context. This significantly enhanced the performance of RNNs especially dealing with long term dependencies. But LSTMs are complex architectures and still don’t resolve very long-term dependencies completely. RNN and its variation LSTM has had an amazing run up till now within ML space as these 2 are extensively used in NLP, robotics, and speech recognition apps. With the advent of advanced algorithms like transformers, used by platforms such as GPT, the situation may change.

?

Transformers: RNN and LSTM are pre-dominantly sequential processing models i.e. they take one input at a time and generate one output which is slower and less intelligent. Transformers parallelize this by consuming all inputs at once and then doing something really clever using a combination of 2 main techniques called ‘positional encoding' and 'self-attention'. So, what this means is when we give input as 'I love reading books on science, please show me some examples', transformer will consume all words in one-go and derive an embedded vector using the technique we covered above. This helps the machine understand words semantically. Then it uses positional encoding to derive a context vector to understand what each word means given where it sits within a sentence and understands the overall context. Finally, the attention mechanism helps the machine derive an attention vector that indicates where it needs to focus more, such as "I need to pay more attention to...". So, in the above example the machine may focus on 'books', 'science', 'show' and 'examples' to help derive a response. This is a very high-level understanding of the model and if you are curious to understand more in-depth watch this video - Transformer Neural Networks - EXPLAINED! (Attention is all you need)

?

Hope this has been helpful. This topic is highly technical and involves understanding of core mathematical concepts, but I have tried to simplify the information in natural language to make it more understandable.

?

A self-declaration, title of this article was provided by ChatGPT when I fed it with this article??

Govind Singh

Chief Technology Officer, Wealthdoor, ex-HSBC, IT Delivery, Architect, Co-founder

1 年

Good read Vicky

回复

要查看或添加评论,请登录

Vicky Soneja的更多文章

社区洞察

其他会员也浏览了