"Surprise" as an aspect of learning
Ever since I joined #Capgemini I have kept this ("learning how machines learn") as my profile quote. My view is that #chatCPT is such an sensation because of the "surprise" factor it provides, which makes people go and try it out again and again. I want to discuss some aspects of #chatGPT as well as how large language models (LLM) are trained in the next few articles.
It is well know in research Dopamine release triggers humans when encountering novelty. This is one of the key insights into human learning. Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with its environment and receiving feedback in the form of rewards or penalties. The agent's goal is to maximize the cumulative reward it receives over time. Both in machine learning and reinforcement learning, the conventional wisdom was to "teach" or "critic" the model giving it feedback, so that the agent or model can improve its performance. When I first encountered LLM's the first question was about the training objective and how did they get large labeled data? It was surprising to learn that there was no labeled data and this is actually unsupervised learning, where the model is trained on a large dataset of text without any explicit labels or supervision. The model learns patterns and relationships in the data, and uses that knowledge to generate new text that is similar to the training data.
领英推荐
OpenAI researchers Yuri Burda and team did a curious research where they removed the extrinsic reward function from the agent and used curiosity as an intrinsic reward function which uses prediction error as reward signal. The most surprising result was that the agent did very well in 54 RL benchmarks tested. This work is now extended to many RL domains.
OpenAI has been betting on scale and RL for fine-tuning the models. GPT itself is a fairly simple architecture, but scale, training data and optimizations are where the focus has been. Using RLHF they have fine-tuned the models. They are expecting people to 'hack" the system and provide the "surprise" which is required to make the model more robust. It is well known that people will try to break ChatGPT (Facebook Galactica survived 3 days, Microsoft Tey survived 16 hours !). Its very impressive how much use and abuse chatGPT is surviving.