Understanding Neural Networks by Building a Language Model from Scratch

Understanding Neural Networks by Building a Language Model from Scratch

Chat-based generative AI is really amazing to interact with and understanding how they fundamentally work is even more amazing. I decided to break it down into basics and implement a lightweight, but functional language model of my own with just a few Java classes in an effort to better understand the core concepts of this technology. This article describes the few Java classes necessary to create a rudimentary language model with all code open-sourced and available on GitHub (https://github.com/jkanalakis/LLMed). To be clear, my LLM implementation is far less sophisticated than those by OpenAI and Anthropic and only intended to explain how the core components work. But it is functional and highlights many of the challenges of production LLMs and how much sophistication goes into productizing them.

My implementation of a basic language model applies a layered neural network architecture consisting of only a few classes. These classes include the NeuralNetwork class for setting up the overall architecture, the DenseLayer class for building individual layers of the network, the TextCorpus class for handling the source textual data, and a few other support components like TextGenerator, ModelTrainer, and ModelManager.

So What is a Language Model?

A language model is essentially an algorithm that predicts the likelihood of a sequence of words. In my implementation, the language model learns to predict the next word in a sentence based on the context provided by the previous words. For example, if a sentence begins with “A car has four,” the language model should predict the next word to be “wheels”.

This is accomplished with the help of a neural network learning architecture and resembles how the human brain functions. The NeuralNetwork class in my project creates the high-level structure of the entire network similar to the illustration below. The structure is composed of vertical layers of neurons (gray circles). These basic computational units are each capable of performing very simple calculations. Data flows into the input layer on the left side, passes through the multiple hidden layers, and then exits through the output layer on the right. When these neurons work together across layers, they can identify complex patterns in data, like images or the structure of a language.

The DenseLayer class represents each vertical layer consisting of a set of neurons. Each neuron is connected to all other neurons in the previous and following layers. These connections have different weights that adjust during the training process to better predict the next word in a sequence of words.

The specific type of neural network I implemented is known as a feedforward neural network. This means that the data flows in only one direction: forward from the input nodes, through the hidden layers, to the output nodes. There are no cycles or loops in the data flow. This makes the architecture very straightforward and ideal for language processing since words in a sentence flow in only one direction.

The model training process is known as supervised learning. The ModelTrainer class orchestrates the language learning process by adjusting the weights of each neuron across the entire neural network to minimize the difference between the predicted next word and the actual next word in the training data. This iterative process is done over multiple times (referred to as epochs), gradually improving the model's ability to predict text.

Exploring the Code

The code for the language model essentially comes down to just a few classes that manage the language source text, define the network, train the network, and generate the new text.

Managing Source Text with TextCorpus

The TextCorpus class is responsible for handling the textual data that the language model will learn from. It's essentially the starting point of the language model. This class loads the text data, in my case several public domain books, and preprocesses them for the neural network. The first books I trained the model with were Victorian era books since they were freely accessible. But they weren’t good sources to train a language model because that’s not how I wanted the model to output text. So the first lesson learned was that you need really good reference text to train the language model and you need a lot of it. Training with one book kinda worked and training with 9 books worked a lot better. The downside is that it takes a long time to train, so more source text means longer training time. This class is also responsible for preprocessing the source text. Preprocessing involves tasks such as stripping out invalid characters, breaking text into individual words (tokenization), building up a vocabulary of all unique words, and converting those words into numerical indices that the neural network can process. This is critical because neural networks fundamentally understand numbers, not text. When you look at the code, you’ll note that I only have very basic prepossing rules, like converting hyphenated words into spaces and expanding contractions into separate words, for example can’t become can not.

Orchestrating Data Flow with NeuralNetwork

The NeuralNetwork class sits at the heart of the language model. This class represents the entire neural network architecture, which includes initializing and managing various layers of neurons (via the DenseLayer class), learning via forward and backward propagation of data, and handling word embeddings (how words or phrases are represented as numbers). The NeuralNetwork class manages the flow of data through the network. It initializes the network with a specified number of layers and specified number of nodes per layer. This sets the stage for the complexity and depth of the model. It takes some experimentation to pinpoint how many layers are necessary and how many neurons within each layer is necessary to get the language model to produce coherent results. Each layer within the network serves a purpose, which is managed through the DenseLayer class.

Packing Neurons with DenseLayer

The DenseLayer class represents a single layer of neurons in the neural network. As mentioned, each neuron within a dense layer is connected to all other neurons in the previous and next layers, hence the term ‘dense’. This class is essential for the neural network's learning process since this is where the actual computations happen. The neurons in these layers apply an activation function to their inputs to introduce non-linear properties to the model. This is important since language data is inherently non-linear. The most common activation function for a neural network is the sigmoid function. Each DenseLayer maintains its own set of weights and biases and these parameters are adjusted during the training process. It’s these iterative adjustments that define the ‘learning’ in machine learning. The layer’s methods handle tasks like calculating the output of the neurons, applying activation functions, and updating weights during back-propagation.

Sending Text through the Neural Network with ModelTrainer

Training a neural network is a complex task, and the ModelTrainer class abstracts this complexity. It ties together the source text corpus, the neural network, and the training process. The class uses forward propagation methods to feed input data through the network, calculate errors with a loss function, and then adjust the weights of the network via backpropagation to minimize these errors. This training happens over and over through multiple iterations, known as epochs. With each epoch, the model ideally gets better and better at predicting the next word of text output. It’s also worth noting that the ModelTrainer class also manages mini-batch training. This is a technique where training data is divided into smaller batches. This approach to training is even more efficient and can lead to faster convergence of the model. That's the point at which the model has learned as much as it can from the training data.

Creating Output with TextGenerator

Once the neural network is trained, the TextGenerator class uses the trained model to generate new text or complete given text fragments. It takes an input string, encodes it into a format the neural network can understand, feeds it through the network, and decodes the output back into human-readable text. The TextGenerator handles all of the nuances around text generation, like managing the randomness (temperature) of the generated text. A higher temperature leads to more creative, but potentially less coherent, text. A lower temperature results in more predictable text.

Saving Time with the ModelManager

Finally, the ModelManager class simply persists the trained model to disk and loads it back when required. I added this when training the model with more and more text began taking hours to process. So this functionality is essential for practical applications and enables the reuse of trained models without the need to retrain them each time. The class just uses Java’s serialization functions to save the entire state of a NeuralNetwork object to a file and load it back.

Pulling Everything Together

These classes work together to create, train, and utilize a complete language model. The process typically begins with the TextCorpus class, which loads and preprocesses the text data. The NeuralNetwork class, with its DenseLayers, then takes this data to learn the patterns that the words make within the text. The ModelTrainer class trains the network by adjusting weights based on the error between the network's predictions and the actual source data. After it is trained, the TextGenerator class uses the model to generate new text or complete some initial text. When the language model is tuned and? looks good, the ModelManager class saves the fully trained state of the model to be saved and later reloaded.

Tuning the Language Model

Fine-tuning the language model is a balance between not getting the results you want and overfitting. Overfitting means the model performs well on its training data, but it does not perform well with new unseen data. It's like memorizing the answers to specific questions in an exam without understanding the core concepts to apply to different questions. Fine-tuning the language model effectively requires understanding and adjusting several key variables. These variables significantly impact the model's learning process, efficiency, and the quality of its output. The Main class presents all of the following variables at the very top of the code for easy experimentation.

Epochs: An epoch represents one complete pass through the entire training dataset. More epochs allow the model to learn from the data repeatedly, which can improve its accuracy. However, too many epochs can lead to overfitting, where the model performs well on training data but poorly on new, unseen data. I started with 5 for faster development but landed with 20 after watching the error measurement level-off.

Embedding Size: Embeddings are essentially the numerical representations of your source data in a neural network. In coding terms, it’s the size of the vector space in which words are represented. Larger embeddings can capture more detailed relationships between words but also take longer to process and could result in overfitting. I looked for an embedding size that balanced accuracy with processing time, mostly in the range between 50 and 300.

Layer Sizes: Layer sizes define the number of neurons within each layer of the neural network. More neurons in a layer can increase the model’s capacity to learn complex patterns but also increase the risk of overfitting and take more time to process. I started with a small layer size, about 25 neurons, to quickly train the model and see immediate output within seconds. But that output was incoherent. So I landed with a pretty large layer size (closer to 500 neurons) to see better output even though it takes a couple of hours to train the model. Deep learning models often work best with many layers that have a higher number of neurons.

Learning Rate: The learning rate controls how much to change the model based on the estimated error each time the model weights are updated. A smaller learning rate may lead to more precise convergence but slows down the training process. I started with a higher rate at 0.01 and eventually brought it down to about 0.006 for better precision. A better, more complicated solution is to implement a learning rate scheduler or adaptive learning rate solution to adjust the learning rate dynamically during training for better results.

Batch Size: The batch size determines how many training samples to work through before updating the model’s internal parameters. A smaller batch size provides a more frequent update with a higher computational effort, whereas a larger batch size offers more stable but less frequent updates. Larger batch sizes can be more efficient on GPUs, but consume a lot of memory. A batch size of 64 would be great, but my laptop could really only handle around 8.

Temperature: In language models, the temperature sets the randomness of predictions. A higher temperature value increases diversity but can reduce coherency, while a lower temperature value produces more predictable and conservative text. Since I’m more interested in consistent responses, I landed with a value of 0.5.

Fine-tuning these variables significantly changes the language model's performance. It took some experimentation with different settings and monitoring the results to find the optimal configuration for my use case. There’s no one-size-fits-all setting. The optimal configuration depends on the nature of the data and the specific requirements of the model

Language Model Takeaways

I found that developing a robust language model involved navigating through several unexpected challenges from data preprocessing and parameter tuning to training time and text output accuracy.

I learned that having a lot of source text and effective preprocessing of that text data is paramount. I have a whole new appreciation for the amount of training text that goes into the large commercial LLMs that OpenAI and Anthropic have created. In my case, the TextCorpus class only loads a few public domain books, so that’s all that it knows about the English language. A richer, varied dataset drastically evolves the model's learning capability, yet requires more processing time and more sophisticated data handling techniques.

I also learned that the shape of the NeuralNetwork and DenseLayers are essential. Determining the optimal number of layers and neurons was a balancing act. Overly complex models led to overfitting, while simpler models simply didn’t perform. Parameter tuning, including learning rates and batch sizes, significantly impacted training time and model accuracy.

Text generation accuracy and coherence is what really matters. In the end, an LLM needs to produce output that makes sense to the user. Achieving high accuracy and coherence is challenging. It took multiple iterations and fine-tuning before output started to make sense.

Overall, building a language model from scratch was both enlightening and challenging. The code available on GitHub demonstrates the key parts of a working language model, from text corpus processing, to building the neural network, to training, and text generation. It illustrates the core elements of machine learning with a basic implementation that shows how layers of neurons work together to process and generate human-like language. I hope you find the project useful for understanding the internals of neural networks and LLMs.

Link to source code: https://github.com/jkanalakis/LLMed

Balachandar Ganesan

Principal Architect at Cognizant

1 年

Thanks for your time to create the open source example project and explain in detail.. It will help beginners to understand the concept lot better

回复
Meghan Stader

Changing the world one Engineer at a time!

1 年

Great read, thanks for taking the time to share your knowledge John! You have always been exceptional at breaking things down of others, and sharing your vast book of knowledge with all who want to learn!

回复

要查看或添加评论,请登录

John Kanalakis的更多文章

  • Eliminating Homework Can Help Students Thrive

    Eliminating Homework Can Help Students Thrive

    I was recently asked by a teacher about the impact that AI may have on student homework and cheating. While there are…

  • What's in Your Wallet? A Predatory Lender.

    What's in Your Wallet? A Predatory Lender.

    Why credit cards are predatory lenders and how to avoid them In the not-so-distant past, credit cards were hailed as…

  • Taking on Personal Financial Management with AI Agents

    Taking on Personal Financial Management with AI Agents

    Tendi is evolving how personal financial management software works by integrating a collection of specialized AI…

    3 条评论
  • How AI Generates Images

    How AI Generates Images

    Interactive chat LLMs like ChatGPT, Gemini, Llama, and Claude are well on their way to becoming mainstream and their…

    1 条评论
  • An AI Model for Human Financial Behavior

    An AI Model for Human Financial Behavior

    We all have our own unique perspective when it comes to managing our finances. Understanding how someone earns, saves…

    4 条评论
  • AI and Non-linear User Interfaces

    AI and Non-linear User Interfaces

    A non-linear, personalized user interface breaks away from the traditional, one-size-fits-all approach to user…

    2 条评论
  • Revolutionizing Credit Risk with Generative AI

    Revolutionizing Credit Risk with Generative AI

    Credit risk evaluation plays a crucial role in lending decisions, both for consumers and businesses. These evaluations…

    2 条评论
  • It’s a Wonderful Web3 Life

    It’s a Wonderful Web3 Life

    An old-timey banking structure of the 1920s could make a comeback in the 2020s with the help of Web3 technology…

    2 条评论
  • Panning for Prime Gold in the Murky Subprime Waters

    Panning for Prime Gold in the Murky Subprime Waters

    Equifax defines a subprime borrower as “anyone who doesn't have good enough credit to qualify for a creditor's prime…

社区洞察

其他会员也浏览了