New Open-source LLM: Google Gemma
Intro
Google has recently released their newest open-source AI models: Gemma 2b and 7b. These are competitors to other open source models like Llama-2 or Mistral, and seem to be superior when compared via benchmarks. Google also made sure to provide many options for deploying these models, including a special partnership with Nvidia.
Google released two different sized models: A 2B and 7B parameter version. From benchmarks, Gemma beats Llama-2 13B with their 7B. Overall, Gemma seems to be much more superior when compared to their similar parameter count counterparts.
Deployment
Unlike Llama and Mistral, however, Google seems to be focusing a lot more on the deployment of these models, whether it’s through the cloud or locally. It is already available on Google Cloud via Vertex AI or Kubernetes. There are also kaggle and colab notebooks pre-made with a quickstart to Gemma.
When it comes to local deployment, Google teamed up with Nvidia to create the “chat with RTX” app. This would allow users of Nvidia’s 30 or 40 series GPUs to easily run Gemma locally. However, it's also available through TenorRT-LLM and the NeMo framework if you want to run it on Nvidia GPUs.
领英推荐
Using Gemma
As mentioned before, there are a couple ways to actually use Gemma. Since its such a tiny model, I think its most practical use is going to be local use. This is extremely easy to do with HuggingFace’s API
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch, time
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", torch_dtype=torch.bfloat16)
input_text = "Write me a poem"
input_ids = tokenizer(input_text, return_tensors="pt")#.to("cuda")
t1 = time.time()
outputs = model.generate(**input_ids, max_length=200)
t2 = time.time()
print(tokenizer.decode(outputs[0]))
print("Time: ", t2-t1, "s")
print("Time per token: ", (t2-t1)/len(outputs[0]), "s/token")
print("Tokens per second: ", len(outputs[0])/(t2-t1), "token/s")
On a CPU, the speed came out to be around 2 tokens per second and only took 2 gigabytes of ram. This will obviously vary with whatever type of hardware you may have.?
Another great option is to grab a software like LM studio that sets all this stuff up for you. It provides a clean front end and uses the Llama.cpp project on the backend. Llama.cpp is an incredibly optimized library for running large language models. It supports quantized models, a variety of different LLMs, and CPU-GPU interop.
Conclusion
Google has made some groundbreaking excursions into the AI field. They were already at the forefront of AI research, so it's natural that they are coming to the forefront of commercial AI as well. There has been, however, many controversies with their image generation. Companies like Google have been paying very close attention to bias within their training data in order to avoid that bias reflecting in the LLM. However, their anti-bias measures ended up backfiring when their models started generating images of black nazis and other inaccurate representations. Although eliminating bias is a noble cause, these kinds of historical inaccuracies are practically racist. Especially with open source models, this is going to be an increasingly important issue that these kinds of companies need to address.
CEO & Co-Founder at Detoxio, Detox your GenAI
11 个月I have developed a Kaggle notebook to Learn TPU v3.8 + Kaggle + LLM Red Teaming For 20 Hours / Week Free. Running Models on TPUs are super fast!!! Try out the link & share - https://www.kaggle.com/code/jaycneo/gemma-tpu-llm-red-teaming-notebook-detoxio-ai/