First, the tech innovations. How did DeepSeek get pretty close to caught up to leading models for 1/30 of the cost?
Four significant innovations and a bunch of smaller ones. First the biggest ones:
- They distilled it from an existing model, most likely Meta’s Llama 3 but it’s possible they used or got access to OpenAI’s 4o or Anthropic’s Claude. Distillation is just the process of using an existing model to train another one. OpenAI does this with models like GPT4-Turbo - it’s distilled from GPT-4 and allows them to offer pretty-good output at dramatically lower cost because the training data is provided by the earlier model. Distilling from OpenAI or Anthropic definitely violates their terms of service, but it’s difficult and maybe impossible to prove. Not needing to generate a starting training set is a huge advantage in speed and cost. ****However, it also means you can’t easily leapfrog the leading models since your starting point is always their previous release.
- Instead of one large model, DeepSeek divided its model into what’s called a Mixture of Experts. LLMs have traditionally loaded the entire model during training and inference. DeepSeek used a guided predictive algorithm to determine which parts of the model were used in different queries and only trained those parts. From the thread by @wordgrammer: They need 95% fewer GPUs than Meta because for each token, they only trained 5% of their parameters.
- The inference step (where the model makes predictions on new data like your chats) is dramatically cheaper. This is what makes the cost of running DeepSeek, locally or in the cloud, far cheaper than leading models. The breakthrough here, which was announced a little while ago, is compression of the cache the model draws on when making inferences. This is a neat trick! But likely would have been discovered by others in the near term, and in any event the technique is now available to all AI companies because it was published publicly.
- Reasoning. DeepSeek didn’t just create a model that competes with the best LLMs from OpenAI and Anthropic, they a reasoning model on par with OpenAI’s o1. Reasoning models leverage LLMs and add chain of thought and other strategies we generally associate with intelligence. They can correct their own mistakes and draw conclusions that predictive models generally can’t. Reasoning models like o1 are created by applying a reinforcement learning layer on top of the LLM. OpenAI used human feedback to guide the model’s decision making during development. DeepSeek’s innovation was to… just not do that. From the Stratechery article: DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process. Moreover, the technique was a simple one: instead of trying to evaluate step-by-step (process supervision), or doing a search of all possible answers, DeepSeek encouraged the model to try several different answers at a time and then graded them according to the two reward functions. What emerged is a model that developed reasoning and chains-of-thought on its own.
So what are the implications here? First off it’s a very clear reminder that trying to compete on regulation instead of innovation isn’t the right move. But perhaps more importantly, DeepSeek’s architecture supports a dramatically lower cost model for AI both in dollars and energy consumption. Also important, everything released by DeepSeek (except the underlying model's training data) is open source and permissively-licensed. So this benefits everyone in the industry, even the incumbents.
There are still plenty of viable businesses to be built on top of the tech, and once Product Managers get smarter about incorporating this stuff than just adding a Big Fat AI Button all over their products we’ll start to see some real gains. Models, however, are quickly becoming a commodity, and chipmakers ( okay really just NVIDIA ) do not have as big a moat as we thought a few weeks ago.
Find me on Discord ???
1 个月https://discord.gg/learnmutiny