747-2 bet,234win bet app.REGISTER NOW GET FREE 888 PESOS REWARDS!

Hi everyone,

This is the last edition of Continual Learnings for 2022. Thanks for learning alongside us this year!

I’m not sure if I had more time to read this week, or everyone is trying to ship their latest thing by the end of the year, but we have a particularly long edition this time. There are a lot of fascinating articles, papers, and repos below — hope it makes for some fun holiday reading!

What are we reading this week

Building ML-powered products from the trenches

What building “Copilot for X” really takes: This is a fascinating look into the nitty-gritty details of building a LLM-enabled AI-powered product.

Copilot internals: The authors of this post reverse-engineered copilot and then explain how it works. This is a must-read if you’re building with language models.?

New models to try

OPT-IML: This new open source language model from Meta uses instruction fine-tuning, which is one of the techniques driving the rapid advancements in language model capabilities of late (as seen in Google’s?FLAN).

New and Improved OpenAI Embedding Model: Better and cheaper. What’s not to like. I built a demo with it this week, and it was super easy and worked quite well out of the box.?

Point-E: a System for Generating 3D Point Clouds from Complex Prompts: What GPT-3 did for text generation and DallE2 / Stable diffusion did for image generation will happen to the generation of 3D shapes soon as well. This isn’t quite at the level to break the internet yet though.

New LLM capabilities

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models: We all know by now that language models tend to make up the answer when they don’t know it. This paper proposes a way to have LLMs generate attributions alongside their answers.

Controllable Text Generation with Language Constraints: This paper introduces a benchmark on constrained language generation — the task of generating text while avoiding things that the model maintainer doesn’t want to include in the response. They also present a baseline approach that performs better than off-the-shelf language models.?

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions: Chain-of-thought prompting is one of the most consistently helpful tools in the prompt engineering toolkit. This paper shows how to extend it to knowledge retrieval as well.

From prompt hacking to prompt engineering

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters: I’ve remarked in the past that “prompt engineering” today is really more like prompt hacking because we have little understanding of what makes a good or bad prompt — just what works. This paper takes a step toward understanding chain-of-thougt prompting. One surprising finding is that it works even if the reasoning demonstrations provided are invalid.

?Learn Prompting: I’m really trying to avoid this becoming an LLM newsletter, but what can you do when there’s so much interesting stuff happening in that field right now. Anyway, this looks like a great resource if you’re trying to get started in prompt hacking engineering.

Prompt engineering guide: This is a good complement to Learn Prompting: it contains links to a range of resources on this emerging field.

Reinforcement learning from human feedback

RL4LMs: Another powerful-looking library for fine-tuning language models based on human preferences to consider as an alternative to?TRLX.

?Coadaptive Harness for Effective Evaluation, Steering, & Enhancement (Cheese): This new library attempts to make it easier to build human feedback collection UIs.

Continual Learning for Instruction Following from Realtime Feedback: Most reinforcement learning from human feedback techniques are performed in batches offline. This is an example of how to do it online as human users interact.

Constitutional AI: Rather than fine-tuning language models from human feedback directly, this paper proposes RLAIF: RL from AI feedback. Humans just provide rules, and a model uses those rules to generate a “critique” that can be used as a reinforcement learning signal.

Production ML papers to know

In this series, we cover important papers to know if you build ML-powered product.

MLOps at Industrial-Scale: Lessons from Google?

Have you ever wondered what it's like to do MLOps at Google-scale?

A new paper shines a light on how Google deploys, maintains, and improves on an “industrial scale” ML system to predict click-through rate - and it is eye opening.

It's a world where “many dozens of engineers” undertake R&D to drive improvement on a system that supports over 100,000 queries per second.

And despite the scale, there are technologies, techniques and advice for everyone trying to do MLOps. So let’s jump in.

Why CTR prediction is hard

Click-through rate (CTR) prediction is valuable because it’s a primary signal of the usefulness of ads. It feeds directly into the cost per click that advertisers pay.

Google’s CTR prediction model “consists of billions of weights, trains on more than one hundred billion examples, and is required to perform inference at well over one hundred thousand requests per second.” This isn’t a set-and-forget model either. Google is constantly trying to improve its performance without adding training / serving costs or undue complexity.

The paper covers techniques Google uses to improve accuracy, efficiency, reproducibility, calibration, and credit attribution.

We’ll cover their approach here, much of which is applicable for smaller scale systems too. But, first, we’ll describe the model itself.

Model architecture

The paper does not describe the full model architecture, but it reveals some interesting details.

First, the Google team found that the text of the query and ad headlines are critical context for the model, but, for performance reasons, they forgo representing them with a LLM in favor of a smaller model that uses classical text features like n-grams.

Beyond that, the baseline model is pretty standard — the remaining features are embedded, the embeddings are concatenated, and the model is trained using AdaGrad, log loss, and ReLUs. Google CTR engineers are just like you and me.

The rest of the paper describes ways they improve on this baseline.

Reducing costs through efficiency

As committed as Google is to ML, even for them any gain from ML needs to be weighed against cost: not just cost of training, but also “long-term cost to future R&D.” This frequently leads to killing ideas that improve performance but are deemed not worth the cost.

So, a parallel aim to improving accuracy is improving efficiency.?That means models are evaluated by two metrics: 1) Does accuracy go up when training cost is flat? 2) Is training cheaper if model capacity is lowered until accuracy is neutral?

Here are some techniques Google uses to improve efficiency.

Bottlenecks. Neural networks with wider layers are more accurate but slower and more costly. The Google team found that you can use bottleneck layers (low-rank matrices) to get some of the benefits of larger layers without a massive const increase.
AutoML. CTR models benefit from tuning tons of hyperparameters: embedding widths, layer widths, etc. That’s what AutoML does well, but standard AutoML isn’t cost effective. Instead, the team uses a variant of neural architecture search (NAS) based on weight sharing. In particular, the team tunes the algorithm to evaluate against different constraints (e.g., cost at no more than 85%/90%/95% of the baseline cost). Recently, this AutoML technique led to a reduced time per training step of 16% without reducing accuracy.
Data sampling. The CTR model exhibits diminishing returns in performance the size of the training set increase, so sampling is another effective cost reduction measure. The team improves on random sampling in a few ways: (1) restricting to more recent data, which is more relevant for CTR prediction, (2) oversampling “clicked” examples, which are rarer and more important, and undersampling “non-clicked” ones, and (3) sampling data with low log-loss or advertisements that were unlikely to be seen by the user.

Improving accuracy

The paper also discusses techniques aimed at improving accuracy.

Loss engineering. Improving log loss doesn’t always improve key business metrics in a production setting. The Google team put a ton of effort into designing auxiliary loss functions to better align online and offline metrics. Some ideas they employed include:

- Ranknet loss, which aims to make sure the set of candidate ads are properly ranked relative to one another.

- Distillation, which trains a smaller model called the “student” to match the predictions of a larger “teacher” model. A surprising discovery in modern deep learning is that knowledge distillation often leads to “student” models that are more capable than training the smaller model from scratch. In Google’s case, this lets them train a larger “teacher” model than would normally be computationally feasible in production.

- Loss curriculum, which borrows from curriculum learning by gradually introducing the more complicated loss functions throughout the course of training.

Second-order optimization. The common wisdom in deep learning is that “second order methods don’t work; just use momentum-based SGD”. Not so. The Google team found that using a recent algorithm called Distributed Shampoo led to significant improvements in accuracy with only a 10% increase in training time.
Deep and Cross Networks. The paper outlines the use of a DCNv2 variant using bottlenecks to learn the effective feature crosses that are critical for recommender systems. This Deep & Cross Network is added between the embedding layer described earlier, and the DNN, and resulted in an accuracy improvement of 0.18% with a minimal increase in training cost of 3%.

Increasing reproducibility

Perhaps one of the most fascinating sections of this paper is on reproducibility.

Training runs for these models are rarely reproducible due to factors like random initialization, non-determinism stemming from distributed compute, numerical errors, hardware, and more.

Irreproducibility is hard to detect in training metrics, and may impact downstream R&D - model deployment leads to further divergence, as predictions from deployed models become part of subsequent training and research.

To combat this, Google uses the metric Relative Prediction Difference (PD), which measures the absolute point-wise difference in model predictions for a pair of models. PDs are “as high as 20% for deep models”, and methods such as fixed initialization, regularization, dropout, data augmentation don’t make much of a difference. Ensemble techniques help, but introduce their own forms of technical debt.

Experimentation showed that ReLUs were a contributing factor because the “gradient discontinuity at 0 induces a highly non-convex loss landscape.” Moving to the Smooth ReLU (SmeLU) activation function led to a PD less than 10%, and also improved accuracy by 0.1%.

Generalizing Across UI Treatments

CTR performance of an ad is impacted by the UI it belongs to, so it’s important to be able to tease apart the contributions of the two. To do so, the Google team replaces the single CTR model with ??(??·??), composed of a transfer function ?? and separable models ?? and ?? that output vectorized representations of the Quality and the UI and are combined using an inner-product.

The upshot

The practical advice provided in this paper is well worth understanding, even if you, like more or less every other machine learning team today, operate at a much smaller scale than Google’s CTR model.

You might want to read it alongside another paper we recently summarized on MLOps best practices from a wider range of companies.

Check out the paper here.

Thanks for reading!

Feel free to get in touch if you have any questions: you can message us on socials or simply reply to this email.

You can also find previous issues on?our blog?and on?twitter.?

The Gantry team

MLOps at Industrial-Scale: Lessons from Google

Gantry

Continuously improve your ML-powered products

Hi everyone,

What are we reading this week

Building ML-powered products from the trenches

New models to try

New LLM capabilities

From prompt hacking to prompt engineering

Reinforcement learning from human feedback

Production ML papers to know

MLOps at Industrial-Scale: Lessons from Google?

Why CTR prediction is hard

领英推荐

Model architecture

Reducing costs through efficiency

Improving accuracy

Increasing reproducibility

Generalizing Across UI Treatments

The upshot

Thanks for reading!

Continual Learnings

1,227 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

GPT-3: A Mind-Blowing AI That Can Code, Write Songs & Poems, Create Interview Questions!

Exploring Llama 2: Open-Source LLM Advancements & Applications

Introducing GPT-4: The Revolutionary Language Model Setting New Standards!

Tame GPT 4

?? OpenAI’s GPT-3.5 Turbo: What’s New & Why It Matters

Unlocking the Potential: Exploring the Future of Work with GPT-4o

HOW TO FINE-TUNE LLAMA 2 AND UNLOCK ITS FULL POTENTIAL

Head-to-Head: LLaMA 3, GPT-4, and Gemini

GPT-5 release rumors

Assessing GPT-4 on Reasoning; Mathematical Perspective On Transformers; Family Of Multimodal Models; Why Small LMs Are The Next Thing; and More.

Hi everyone,

What are we reading this week

Building ML-powered products from the trenches

New models to try

New LLM capabilities

From prompt hacking to prompt engineering

Reinforcement learning from human feedback

Production ML papers to know

MLOps at Industrial-Scale: Lessons from Google?

Why CTR prediction is hard

领英推荐

Model architecture

Reducing costs through efficiency

Improving accuracy

Increasing reproducibility

Generalizing Across UI Treatments

The upshot

Thanks for reading!

Continual Learnings

1,227 位关注者

How to teach an old model new tricks

2023年3月14日

Putting Responsible AI into Practice

2023年2月23日

How to measure language model performance

2023年2月16日

Why do ML Projects Fail?

2023年1月26日

Monolith: The Recommendation System Behind TikTok

2023年1月10日

From prompt magic to prompt engineering?

2022年12月14日

How do people actually operationalize ML in 2022?

2022年12月7日

Do You Really Need a Feature Store?

2022年11月30日

社区洞察

其他会员也浏览了

GPT-3: A Mind-Blowing AI That Can Code, Write Songs & Poems, Create Interview Questions!

Exploring Llama 2: Open-Source LLM Advancements & Applications

Introducing GPT-4: The Revolutionary Language Model Setting New Standards!

Tame GPT 4

?? OpenAI’s GPT-3.5 Turbo: What’s New & Why It Matters

Unlocking the Potential: Exploring the Future of Work with GPT-4o

HOW TO FINE-TUNE LLAMA 2 AND UNLOCK ITS FULL POTENTIAL

Head-to-Head: LLaMA 3, GPT-4, and Gemini

GPT-5 release rumors

Assessing GPT-4 on Reasoning; Mathematical Perspective On Transformers; Family Of Multimodal Models; Why Small LMs Are The Next Thing; and More.