Bard - Google’s Response to ChatGPT
About three months after OpenAI released ChatGPT Google introduced Bard (not Brad), an experimental conversational AI service, an AI chatbot. Bard is powered by a large language model (LLM) called LaMDA (Language Model for Dialogue Applications).?
You might recall the story of a Google’s engineer who believed a chatbot to be sentient - that was LaMDA.??
Bard is a search-based chatbot that attempts to solve LLM hallucinating facts by asking a search engine to validate its response.?
At the time of writing this article, Bard is only available to a few beta testers and details regarding Bard’s training and architecture have not been made available.?
However, just as I explained ChatGPT with InstructGPT previously, I will explain Bard with LaMDA, its parent model.?
LaMDA
In February 2020 Google published a paper presenting Meena, a human-like chatbot. Meena was built with Evolved Transformers with one encoder and several decoder layers. LaMDA was first introduced by Google in May 2021 with paper published accompanying it in February 2022. LaMDA shares some of the decoding and metric strategies of Meena.?
Pre-training
Similar to ChatGPT, LaMDA is a decoder only Transformer model with 64 layers and 137B parameters, 40B parameters less than ChatGPT. LaMDA was pre-trained for almost two months to predict the next token using 2.97B documents, with the majority of them being dialog data. At inference LaMDA samples 16 responses from the top 40 responses and outputs one result with the highest log likelihood probability.??
Nothing new so far, just lots of data, lots of layers and lots of training. What’s different?
As you’ll see, one of the differences is how LaMDAl fine-tuned on multiple tasks in a single model, one of which is reaching out to a search engine to verify its response.
Metrics
LaMDA fine-tunes and evaluates its responses using several metrics listed below:
Dialog Metrics - Binary Labels
Safety - Derived from Google’s 7 principal’s for AI, considering factors like bias and privacy.?
Cross-Checking? - This metric focuses on evaluating LaMDA on making correct statements.?
Helpfulness - Contains correct and helpful information.?
Role Consistency - Plays the role of a domain-specific agent when providing responses. For example, if you ask about the difference between a Cappuccino and Cortado, the model would presume a role of a Barista.
领英推荐
You’ve probably noticed most of these metrics would require human annotation, a subjective evaluation. And you’re correct. Authors of LaMDA relied on their coworkers to provide these labels or even rewrite Bard’s bad responses.
Fine-Tuning?
Pre-trained LaMDA is then fine-tuned on generative and discriminative tasks i.e., generated responses are evaluated for the aforementioned metrics. Training for multitasks in a single decoder model for discriminative tasks is done through special tokens. For more information on multi-task transformer models see T5 and MUM.??
In one scenario, generated responses are evaluated for safety and sorted by a weighted average of sensibility, specificity, and interestingness metrics.??
Generative language models tend to imitate plausible contents yet hallucinate facts. To mitigate this, LaMDA is further fine-tuned to receive other sources of data from information retrieval systems and other annotated responses.?
LaMDA also fine tunes by verifying its generated responses from a set of three tools: a calculator, a mini-search engine, and a translator that may return more than one result. The base or fine-tuned LaMDA model concatenates the returned results.?
Through this fine-tuning process, the generated response is improved iteratively by passing it to a discriminative task that improves the groundedness of the response by “researching” the response if the previous task requests it. An example is shown in the figures below.?
The response is iteratively fine-tuned by reaching out to the set of tools until no special token is generated.?
Conclusion?
I find it most interesting that the recent efforts to make LLM more useful are far more data-centric than they are model-centric, which inherently requires a lot more human annotations and feedback to translate subjective evaluations into objective metrics.?
ChatGPT solves the need for creating complex loss functions and fact hallucinations with RLHF and prompt engineering.
LaMDA (and likely Bard) tackles these challenges with a multi-task transformer and iteratively generating a response that is researched and verified with discriminative tasks like reaching out to a search engine.?
While the hype for Microsoft Bing chatbot (ChatGPT) and Google’s Bard are at an all-time high it’s important to note their limitations.?
Generative LLMs generate responses that do not reflect facts, are not safe and may be biased. Recently, ChatGPT threatened users and demanded apologies from users.?
Bard is introducing a new chat bot experience through search. And, even though Bard still makes up facts (even after verifying it with an external source), it seems to be on a promising path to creating more reliable chatbots soon.?
It’s exciting to see more research being allocated to making AI and chatbots fair, useful, and helpful for all. With more research in the next coming years the quality and the groundedness of the responses will? continue to improve. Although they might not be suitable to generate content by themselves currently, they may turn out to be very useful tools with a human in the loop.?
?Andrew Ng recently mentioned in a newsletter, “I believe that chat-based search has a promising future — not because of what the technology can do today, but because of where it will go tomorrow.”
I also appreciate seeing a section allocated for LLM carbon footprints by the authors as shown below.
Director, Technical Product | Optum Technology
1 年As always, I appreciate your technical breakdowns!