Au revoir LLaMa2 et GPT (3.5), bonjour Mixtral 8x7b!
Eugene Delacroix's "Liberty Leading the People" in steampunk style done in Midjourney. France is back, big-time!

Au revoir LLaMa2 et GPT (3.5), bonjour Mixtral 8x7b!

It looks like the newest brainchild of Mistral AI is Hugging Face's Christmas 2023 party crasher.

Mixtral 8x7B Instruct and Mixtral 8x7B, released on December 11th, 2023 aren't just another additions to the LLM toolkit; if the assertions made by the authors of this model are true (and the first tests show that this is indeed the case, more on that below), then we are looking at a potential paradigm shift in terms of capabilities open-source, permissive large language models can deliver.

Mixtral 8x7B Performance is amazing

According to MT-Bench benchmarks, Mixtral 8x7B outperforms Llama 2 70B, Claude 2.1 or Zephyr 7B and it matches GPT 3.5. It also hallucinates less and presents much less bias than Llama 2 70B. An independent LLM Leaderboard on Huggingface puts it on par with models, such as Llama 2 70B, Falcon 40b or Intel NeuralChat 7B.

In practical terms, Mixtral brings in the following capabilities:

More details about the model specs and benchmark outputs are available here .

How is Mixtral different from other models?

According to Mistral AI's blog post, Mixtral 8x7B contains 8 models in one, mixed together with a technique called Mixture of Experts (MoE). This is not to be ignored as MoE has been known for some time (and if you trust Reddit, GPT-4 also uses MoE architecture ). But with the arrival Mixtral it has an opportunity to enter into public awareness.

MoE is a concept derived from machine learning that involves a combination of multiple "expert" models, each specializing in different aspects of the task at hand, within a larger overarching model. In the context of large language models, MoE introduces a hierarchical architecture that combines the strengths of diverse specialized models, known as "experts," to improve the overall performance of the system. These experts are individual neural network modules or sub-networks, each trained to excel in specific aspects of the language understanding or generation task.

MoE employs a gating mechanism that decides which expert(s) to activate based on the input data. This gating mechanism determines the relevance or expertise required for a particular context within the input sequence.The gating mechanism can dynamically select the most suitable expert(s) to process or generate specific parts of the text, enabling the model to adaptively leverage the expertise of different components.

It comes by no surprise that Mistral AI applied MoE architecture to its new model as these are designed to handle complex tasks more efficiently by distributing the workload among specialized experts. This approach can potentially improve the scalability and performance, allowing Mixtral 8x7B to handle diverse tasks more effectively on consumer hardware.

For a deep-dive into MoE, feel free to read this blog post .

An elephant in the room - a permissive licensing model

Mixtral 8x7B is delivered under the Apache 2.0 licensing model which makes the use of it more compelling to the business and makes it more accessible to startups and tinkerers.

From the business perspective, open-source Apache 2.0 models allow companies seeking commercialization to modify and customize the models to suit their specific needs, making it adaptable to diverse industry requirements. Closed-source models might limit customization options, while non-Apache 2.0 models may not provide the freedom to alter the code or functionalities. Apache 2.0 license can significantly reduce costs associated with licensing fees, enabling businesses to allocate resources elsewhere. Closed-source models often involve substantial licensing fees and non-Apache 2.0 models may come with usage restrictions that can increase operational costs. However, what's the most important is that this licensing model will contribute to the improvement of transparency and increased trust, while accelerating community collaboration to introduce improvements, fix bugs, and suggest enhancements. Closed-source models lack transparency and restrict community involvement, limiting opportunities for collective improvement and innovation.

Running Mixtral 8x7B on a local hardware

Well, here's a catch. Despite its promising name, Mixtral 8x7B should rather be called Mixtral-45B as Huggingface calculated in their blog post . Therefore, it will require a lot of RAM and vRAM. A model of this type will require around 86GB of RAM to run a base model. The good news is that TheBloke released quantized models for both Mixtral 8x7B and Mixtral 8x7B-instruct models. The first reports from the field show that it is possible to squeeze quantized versions on the model into a 32GB of RAM and 12GB of vRAM on a single nVidia 3090 GPU, however the napkin math says that a fairly efficient Q4 quantized model would rather double that. There are also reports of a successful launch of this model on Macbooks .

That sounds like a perfect pretext for asking the Santa to bring one of those nice and shiny Macbook Pro Max M3 with 128GB RAM for Christmas, isn't it?

Conclusion

The application of MoE in Mixtral 8x7B represents a promising direction in AI research. For me it looks like Mistral AI addresses the limitations of single, monolithic models by harnessing the collective intelligence of specialized modules, ultimately leading to more adaptable, accurate, and efficient language understanding and generation systems.

Apache 2.0 open-source model under which Mixtral 8x7B is made available, offers a learning playground for developers to explore, experiment, and enhance their skills, contributing to projects, and gaining hands-on experience. Therefore, I consider Mistral AI's recent move as a friendly approach towards innovation and freedom of experimentation.

Excellent travail, Mistral AI !


Jeremy Prasetyo

World Champion turned Cyberpreneur | Building an AI SaaS company to $1M ARR and sharing my insights along the way | Co-Founder & CEO, TRUSTBYTES

11 个月

Haha, you just convinced me to ask Santa for a M3 Max MacBook Pro 128gb ram, Maciek J?drzejczyk

Pawe? Zawadzki

Senior Solutions Architect at AWS, ex-Oracle, ex-Sun, ex-IBM, CCSP, AI/ML Enthusiast

11 个月

quantized Mixtral already available on HF by TheBloke :) 3bit, 4bit & 8bit https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ/tree/main

Pawe? Zawadzki

Senior Solutions Architect at AWS, ex-Oracle, ex-Sun, ex-IBM, CCSP, AI/ML Enthusiast

11 个月

Rumor has it that GPT4 is just such a set of many specialized models (but they don't disclose it, so we don't know directly). For me, MoE resembles the functioning of the human brain. We have areas responsible for technical understanding, feelings, memory, receiving impulses from nerve endings in the eyes, ears, skin... groups of specialized neurons. Interesting where this is heading.

Merci Mistral AI??? Père No?l n'oublie pas le #MacBook Pro ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了