How to Train Large Language Models:  A Survey of the Latest Techniques

How to Train Large Language Models: A Survey of the Latest Techniques

1. Introduction

This article surveys the major diverse methods used to train Large Language Models (LLMs) and the various inferencing techniques employed with these models. Large Language Models (LLMs) are quickly making groundbreaking advancements in Artificial Intelligence (AI). These models are built upon powerful Transformer architecture, trained using vast amounts of data, that have demonstrated remarkable capabilities in language understanding, generation, and reasoning. Lets take a closer look.

1.1. Transformer Architecture: Core Building Blocks

The purpose of the Transformer architecture is to convert sequences of continuous inputs into sequences of continuous outputs. This architecture contains two main blocks specified as encoders and decoders.

The encoder is tasked with mapping an input sequence to a sequence of continuous representations to then feed into a decoder.?

The decoder takes the output of the encoder, in combination with the decoder output from the previous time step, to generate an output sequence. (https://machinelearningmastery.com/the-transformer-model/ )

1.2. Attention Mechanism

A key feature of the Transformer architecture is the attention mechanism, which prioritizes essential elements of the input while discarding non-critical ones.

The most used attention mechanism integrated with Transformer architecture is the scaled-dot product mechanism.? This mechanism, in-turn, consists of two parts:

1.?? Self-attention: This part automatically adjusts the weight of different elements of the input sequence according to their influence on the output. This is particularly important in language processing tasks, where the meaning of words in a sentence can change based on the context.

2.?? Multi-head attention: This part is used to process multiple input sequences in parallel.?????????

In addition, recent research is exploring integrating external knowledge into attention mechanisms. Techniques such as “knowledge-enhanced attention mechanisms” or “graph attention networks” have been proposed to leverage external knowledge graphs or structured data to augment the model's understanding and improve its performance on specific tasks.?

2. Model Pre-Training and Fine-Tuning

LLMs are built on a deep learning neural network architecture known as Auto-Regressive Transformers. These models undergo a two-step training process involving extensive pre-processed text data. The initial phase employs self-supervised learning to predict unobserved or hidden aspects of input from any observed or unhidden part of the input. This is followed by aligning the model with human preferences.

2.1. Pretraining

During pre-training phase, LLMs are trained on a diverse dataset mixture spanning various domains including: English common crawl, preprocessed Common Crawl (C4), Wikipedia, Gutenberg and Book3, ArXiv, and Stack exchange. These datasets go through tokenization process [1] which breaks sentences into tokens.

Since LLMs are trained on internet text, which can contain content conflicting with human preferences [2], the model undergoes further training on smaller, task-specific containing datasets relevant to the intended tasks. This additional training phase, referred to as model fine- tuning, aims to bring the model closer to human values.

2.2. Model Fine-Tuning

Following pre-training, fine-tuning takes place to enhance the model’s performance in specific tasks. This allows the model to specialize and become more effective in performing the target task. [3].

Reinforcement Learning from Human Feedback (RLHF) is a method used to refine the LLM’s responses, aligning it more closely with human values, intentions, and expectations. RLHF aims to improve the response quality in terms of accuracy, relevance, coherence, and comprehensiveness. This entails harmonizing the responses with human values, mitigating harmful and undesirable behaviors, and personalizing the responses.

Using RLHF, the model learns from the feedback of human reviewers, enhancing its performance with each iteration. This enables it to produce superior responses and continually progress. Nevertheless, it is essential to understand that the LLM model is not flawless --- there are occasions when it makes mistakes or produces unanticipated results. Continuous efforts are made to refine and better the process, reducing such occurrences.

2.3. In-Context Learning

In-Context Learning (ICL) [4,5,6] refers to the ability of LLMs to use the context of the current conversation or task to inform its response or action(s) with only a few examples, even without any additional training or weight adjustment after initial training.

ICL allows LLMs to perform specific tasks and generate more relevant and appropriate responses, maintain a coherent dialogue by remembering previous conversations, and adapt the responses based on the user’s style, and preferences. More importantly, ICL allows LLMs to decipher the implicit clues and respond appropriately. ICL encompasses various learning types, including zero-shot learning, few-shot learning, demonstration learning, and prompting.

2.4. Chain of Thought Generation

Chain of thought is a term used to describe the logical progression of ideas or reasoning, where one thought leads to another in a sequence or pattern. The main idea is to break down complex problems into intermediate steps. In the context of Large Language Models, chain of thought generation refers to the ability of these models to generate coherent responses by following a logical sequence of reasoning, elicited by prompts with "let's think step by step" or "show your reasoning"/ "explain your steps"[7,8].

Example: Consider the arithmetic word problem: "How many keystrokes are needed to type the numbers from 1 to 500?" Using chain-of-thought reasoning, an LLM might answer:

"There are 9 one-digit numbers from 1 to 9."

"There are 90 two-digit numbers from 10 to 99."

"There are 401 three-digit numbers from 100 to 500."

"9 + 90(2) + 401(3) = 1392. The answer is 1392."

Recent developments in chain-of-thought reasoning within LLMs have shown promising results in complex reasoning tasks, such as solving mathematical problems. This approach does not require large training datasets and enables a single model to perform various tasks by learning the underlying patterns from few examples [7, 8]. This ability of LLMs to mimic human-like sequential reasoning underscores the remarkable adaptability and potential of these models in various applications.

3. Emergent Behaviors

Emergent behavior occurs when a complex entity has properties or behaviors that its parts do not have on their own (Wikipedia, 2023). These behaviors emerge only when the components interact within the broader whole. In the context of LLMs emergent behaviors refer to unexpected and novel capabilities that were not explicitly trained for. These behaviors arise due to the models' increased scale, vast training data, and intricate architectures, leading to unique patterns of language understanding, reasoning, and even mathematical problem-solving.

For example, an LLM might demonstrate the ability to solve a math problem it was never explicitly trained on, by deducing the solution through patterns it learned during training. Such instances represent emergent behaviors within the model.

4. AGI Manifestations

Artificial General Intelligence (AGI) refers to the ability of a machine to perform any intellectual task that a human being can [9]. Unlike narrow or specialized AI, which is designed for specific tasks, AGI would have the understanding, consciousness, and general intelligence comparable to human levels [10].

In the context of LLMs manifestations of AGI-like behaviors have been observed as highlighted in the paper "Towards Artificial General Intelligence: Sparks of AGI in Large Language Models" by Microsoft [10]. These behaviors present noteworthy observations in the context of AGI, demonstrating remarkable capabilities on a variety of domains and tasks, including abstraction, comprehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more.

5. Conclusion

This survey provided a comprehensive overview of the methods used to teach and infer with Large Language Models (LLMs). The Transformer architecture, particularly its attention mechanism, forms the foundation of these models. The diverse training techniques, especially reinforcement learning from human feedback (RLHF), have enhanced their adaptability and efficiency. Furthermore, the capability of in-context learning (ICL) in LLMs, including zero-shot, few-shot, and demonstration learning, highlights their ability to adapt based on limited examples. Chain of thought generation highlights the models' capacity to reason logically and break down complex problems. As LLMs continue to evolve, understanding their training and inferencing mechanisms remains crucial for their effective application in various domains.

About TCS Intelligent Urban Exchange? is an enterprise software solution from TCS Digital Software & Solutions. It is an advanced AI and ML powered solution that delivers comprehensive insights, recommendations, and metrics for sustainable, environmentally clean organizational and value chain operations. The aggregate system-wide impact of TCS Intelligent Urban Exchange? for sustainability results in substantial emissions reduction, cost savings, and resource conservation. while also advancing corporate environment stewardship, compliance, and social responsibility.

About Tata Consultancy Services

Tata Consultancy Services (TCS) is an IT services, consulting and business solutions organization that has been partnering with many of the world’s largest businesses in their transformation journeys for over 50 years.? TCS’ proactive stance on climate change and award winning work with communities across the world have earned it a place in leading sustainability indices such as the MSCI Global Sustainability Index.

?

References

  1. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models.?arXiv preprint arXiv:2302.13971.
  2. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4.?arXiv preprint arXiv:2303.12712.
  3. SchaefferBrando, R., MirandaSanmi, M., Koyejo (2023). Are Emergent Abilities of Large Language Models a Mirage? Retrieved from https://arxiv.org/abs/2304.15004
  4. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
  5. DilmeganiLarge, C. (2023). Language Model Training in 2023 (2023). Retrieved from https://research.aimultiple.com/large-language-model-training/
  6. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560.
  7. Parisotto, E., & Salakhutdinov, R. (2017). Neural map: Structured memory for deep reinforcement learning. arXiv preprint arXiv:1702.08360.
  8. Li, J., Monroe, W., Ritter, A., Jurafsky, D., & Galley, M. (2016). Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1606.00930.
  9. Korbak, T., Shi, K., Chen, A., Bhalerao, R. V., Buckley, C., Phang, J., ... & Perez, E. (2023, July). Pretraining language models with human preferences. In?International Conference on Machine Learning?(pp. 17506-17533). PMLR.
  10. Korbak, T., Shi, K., Chen, A., Bhalerao, R. V., Buckley, C., Phang, J., ... & Perez, E. (2023, July). Pretraining language models with human preferences. In?International Conference on Machine Learning?(pp. 17506-17533). PMLR.

?

Learn More:

Visit the TCS Intelligent Urban Exchange? page on https://www.tcs.com

Email Us: [email protected]

?

About the Authors:?

Dr. Arup Acharya, PhD

Arup is the Head of Research & Innovation at TCS Digital Software and Solutions and leads both ?Architecture & Design and Research teams in DS&S. ??He received his PhD from Rutgers University and B.Tech from IIT, Kharagpur in Computer Science. He has 40+ patents issued and is well-published in conferences & journals in ?leading-edge technology topics.? Prior to TCS, Arup worked at IBM Research and NEC Research.


Dr. Yibei Ling, PhD??

Yibei Ling is a Senior Data Scientist at TCS Digital Software and Solutions and works on energy-aware, no-code AutoML frameworks, machine learning models for sentiment analysis, face recognition, and time-series analysis.? Prior to working at TCS Yibei was with the research labs in Bellcore (Telcordia) working on DARPA projects including sensor and distributed networks. He has published more than 30 papers in IEEE and ACM Transactions covering fault-tolerant?and distributed computing, and network security. Yibei has been granted 21 US patents and is a senior IEEE member and reviewer for Mathematical Reviews and IEEE Transactions publications.?????????


Dr. Guillermo Rangel, PhD???????????????????

Guillermo Rangel PhD, expertise in text analytics built over a decade of working on natural language modeling (NLP/G/U). Guillermo has previously worked in verticals like banking, retail, Telco and gaming serving as a solution advisor and data science consultant for companies like Bloomberg LP, ?Vodafone, Blizzard Entertainment, and The Home Depot amongst others. Guillermo holds a PhD in Physics from the University of California at Davis.???????????????????????????????????????????????

Parag Patki

MAICD | Managing Partner - Consulting Practice APAC at TCS Limited

1 年

Very informative article ??!!

要查看或添加评论,请登录

TCS Digital Software & Solutions的更多文章

社区洞察

其他会员也浏览了