New Open-source LLM: Google Gemma

New Open-source LLM: Google Gemma

Intro

Google has recently released their newest open-source AI models: Gemma 2b and 7b. These are competitors to other open source models like Llama-2 or Mistral, and seem to be superior when compared via benchmarks. Google also made sure to provide many options for deploying these models, including a special partnership with Nvidia.

Google released two different sized models: A 2B and 7B parameter version. From benchmarks, Gemma beats Llama-2 13B with their 7B. Overall, Gemma seems to be much more superior when compared to their similar parameter count counterparts.

Deployment

Unlike Llama and Mistral, however, Google seems to be focusing a lot more on the deployment of these models, whether it’s through the cloud or locally. It is already available on Google Cloud via Vertex AI or Kubernetes. There are also kaggle and colab notebooks pre-made with a quickstart to Gemma.

When it comes to local deployment, Google teamed up with Nvidia to create the “chat with RTX” app. This would allow users of Nvidia’s 30 or 40 series GPUs to easily run Gemma locally. However, it's also available through TenorRT-LLM and the NeMo framework if you want to run it on Nvidia GPUs.

Using Gemma

As mentioned before, there are a couple ways to actually use Gemma. Since its such a tiny model, I think its most practical use is going to be local use. This is extremely easy to do with HuggingFace’s API

from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch, time




tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it",  torch_dtype=torch.bfloat16)


input_text = "Write me a poem"
input_ids = tokenizer(input_text, return_tensors="pt")#.to("cuda")
t1 = time.time()
outputs = model.generate(**input_ids, max_length=200)
t2 = time.time()
print(tokenizer.decode(outputs[0]))
print("Time: ", t2-t1, "s")
print("Time per token: ", (t2-t1)/len(outputs[0]), "s/token")
print("Tokens per second: ", len(outputs[0])/(t2-t1), "token/s")        

On a CPU, the speed came out to be around 2 tokens per second and only took 2 gigabytes of ram. This will obviously vary with whatever type of hardware you may have.?

Another great option is to grab a software like LM studio that sets all this stuff up for you. It provides a clean front end and uses the Llama.cpp project on the backend. Llama.cpp is an incredibly optimized library for running large language models. It supports quantized models, a variety of different LLMs, and CPU-GPU interop.

Conclusion

Google has made some groundbreaking excursions into the AI field. They were already at the forefront of AI research, so it's natural that they are coming to the forefront of commercial AI as well. There has been, however, many controversies with their image generation. Companies like Google have been paying very close attention to bias within their training data in order to avoid that bias reflecting in the LLM. However, their anti-bias measures ended up backfiring when their models started generating images of black nazis and other inaccurate representations. Although eliminating bias is a noble cause, these kinds of historical inaccuracies are practically racist. Especially with open source models, this is going to be an increasingly important issue that these kinds of companies need to address.

Jitendra Chauhan

CEO & Co-Founder at Detoxio, Detox your GenAI

11 个月

I have developed a Kaggle notebook to Learn TPU v3.8 + Kaggle + LLM Red Teaming For 20 Hours / Week Free. Running Models on TPUs are super fast!!! Try out the link & share - https://www.kaggle.com/code/jaycneo/gemma-tpu-llm-red-teaming-notebook-detoxio-ai/

回复

要查看或添加评论,请登录

Mannan Bhardwaj的更多文章

  • TinyR1: Recreating DeepSeek R1 at Home!

    TinyR1: Recreating DeepSeek R1 at Home!

    OpenAI O1 pushed the frontier of what is possible with LLMs by tuning an LLM to create chains of reasoning using…

    2 条评论
  • Man VS Machine—A Battle Of Intelligence

    Man VS Machine—A Battle Of Intelligence

    Invented by Warren McCulloch and Walter Pitts, the MucColluoch-Pitts Neuron, more popularly known as the Perceptron…

    3 条评论
  • Agentic AI

    Agentic AI

    One of the most interesting use cases for LLMs is its use in autonomous agents. LLMs by themselves are great for…

    1 条评论
  • ChatGPT is obsolete

    ChatGPT is obsolete

    Whether its for general use, autonomous agents, or creating fine-tuned chatbots, OpenAI has been at the forefront of…

  • Mixture Of Experts: The Future of LLMs

    Mixture Of Experts: The Future of LLMs

    Intro What made GPT3.5 and GPT4 completely destroy all the competition? Since “Open”AI’s closed source models make it…

    4 条评论
  • Learning to Learn

    Learning to Learn

    Intro As we enter the Age of Information, a new resource has arisen: Data. Every website you visited, everything you…

    1 条评论
  • Virtual Cloning

    Virtual Cloning

    Intro Throughout one’s life, a person can create a significant impact on the internet. Every like, every post, every…

    1 条评论
  • A Journey Through Neural Compression

    A Journey Through Neural Compression

    Introduction At this point of the game, I feel like neural networks have been sort of black boxed, and for good reason.…

  • Prompting LLMs with LLMs

    Prompting LLMs with LLMs

    Introduction Prompt engineering is the process of creating a prompt such that the LLM knows exactly what to do and how…

  • Unlocking the true power of LLMs with Vector Embeddings

    Unlocking the true power of LLMs with Vector Embeddings

    What is a Vector Embedding and why is it important? In order to understand the power of vector embeddings, you need to…

    3 条评论

社区洞察

其他会员也浏览了