登录查看更多内容

No GPU? No problem. Google localllm lets you develop gen AI apps on local CPUs

Celia Lozano Grijalba

Head of Data & AI at Bosonit | Data Scientist | PhD

发布日期: 2024年2月9日

In the latest edition of Data Pills, tailored for the tech-savvy data aficionados, we're diving into a cutting-edge solution that is reshaping the AI development landscape and GenAI. The primary obstacle in this realm has been the high demand for GPUs, especially for those working with expansive language models (LLMs), but often scarce and expensive. Our spotlight this time is on an ingenious workaround that leverages the robust capabilities of CPUs and memory within 谷歌 Cloud's Cloud Workstations, a top-tier managed development environment. We've explored models available on Hugging Face, particularly those in a repository by "The Bloke", which are designed for compatibility with a quantization method enabling their operation on CPUs or lower-power GPUs. This breakthrough strategy not only circumvents the dependency on GPUs but also paves the way for streamlined and effective AI application development. By integrating "quantized models," Cloud Workstations, alongside the new open-source tool localllm (see git lab https://github.com/googlecloudplatform/localllm), and a spectrum of readily available resources, developers are equipped to create AI-driven applications on an advanced development workstation, enhancing traditional processes and workflows.

Boosting Productivity with Quantized Models and Cloud Workstations

Quantized models streamline AI by optimizing for devices with limited processing capabilities, enhancing efficiency in memory and power. Running on Cloud Workstations, they offer several benefits for development:

Enhanced Efficiency: They utilize lower-precision data (8-bit integers) for quicker computations and heightened performance, especially on constrained devices.
Smaller Memory Use: Quantization reduces AI models' memory needs, enabling them to fit on devices with less storage.
Quicker Inference: The models' smaller size and lower precision boost inference speed, making AI applications more responsive on local devices.

Integrating these models with Cloud Workstations combines the latter's scalability and cost-effectiveness with the models' operational benefits. This approach sidesteps the latency, security risks, and reliance on external GPU services typical of remote or cloud-based development, offering a more secure, efficient, and control-oriented development environment.

localllm

Today, we're unveiling localllm, a comprehensive suite of tools and libraries designed to simplify access to HuggingFace's quantized models via a command-line interface. This toolset is poised to revolutionize how developers utilize large language models (LLMs) by removing the barrier of GPU necessity. Localllm offers an all-in-one solution to operate LLMs directly on CPUs and memory within the Google Cloud Workstation environment. It also supports running these models on any local machine or system with adequate CPU resources. By bypassing the need for GPUs, localllm empowers developers to fully harness the capabilities of LLMs for their application development projects, enhancing flexibility and efficiency.

Getting started with localllm

To get started with the localllm, visit the GitHub repository at https://github.com/googlecloudplatform/localllm. The repository provides detailed documentation, code samples, and step-by-step instructions to set up and utilize LLMs locally on CPU and memory within the Google Cloud environment. You can explore the repository, contribute to its development, and leverage its capabilities to enhance your application development workflows.?

Once you’ve cloned the repo locally, the following simple steps will run localllm with a quantized model of your choice from the HuggingFace repo “The Bloke,” then execute an initial sample prompt query. For example we are using Llama.

# Install the tools

pip3 install openai

pip3 install ./llm-tool/.

# Download and run a model

llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000

# Try out a query

./querylocal.py

Creating a localllm-enabled Cloud Workstation

To get started with localllm and Cloud Workstations, you'll need a Google Cloud Project and to install the gcloud CLI. First, build a Cloud Workstations container that includes localllm, then use that as the basis for our developer workstation (which also comes pre-equipped with VS Code).

gcloud config set project $PROJECT_ID

# Enable needed services

gcloud services enable \

  cloudbuild.googleapis.com \

  workstations.googleapis.com \

  container.googleapis.com \

  containeranalysis.googleapis.com \

  containerscanning.googleapis.com \

  artifactregistry.googleapis.com

# Create AR Docker repository

gcloud artifacts repositories create localllm \

  --location=us-central1 \

  --repository-format=docker

Next, submit a build of the Dockerfile, which also pushes the image to Artifact Registry.

gcloud builds submit .

The published image is named

us-central1-docker.pkg.dev/$PROJECT_ID/localllm/localllm.

The next step is to create and launch a workstation using our custom image. We suggest using a machine type of e2-standard-32 (32 vCPU, 16 core and 128 GB memory), an admittedly beefy machine.

The following example uses gcloud to configure a cluster, configuration and workstation using our custom base image with llm installed. Replace $CLUSTER with your desired cluster name, and the command below will create a new one (which takes ~20 minutes).

领英推荐

Google Coral Edge TPU Vs NVIDIA Jetson Nano.

Saeed Al Hasan .AI ?? 3 个月前

CPU, GPU, or TPU in 2025: How to Choose the Right…

Deepak Bhandari 4 个月前

Cloud GPU: The Benefits of Using GPUs in the Cloud

ZNet Technologies Private Limited 2 年前

gcloud workstations clusters create $CLUSTER \

  --region=us-central1

The next steps create the workstation, and starts it up. These steps will take ~10 minutes to run.

# Create workstation configuration

gcloud workstations configs create localllm-workstation \

  --region=us-central1 \

  --cluster=$CLUSTER \

  --machine-type=e2-standard-32 \

  --container-custom-image=us-central1-docker.pkg.dev/$PROJECT_ID/localllm/localllm

# Create the workstation

gcloud workstations create localllm-workstation \

  --cluster=$CLUSTER \

  --config=localllm-workstation \

  --region=us-central1

# Grant access to the default Cloud Workstation Service Account

gcloud artifacts repositories add-iam-policy-binding \

  localllm \

  --location=us-central1 \

  --member=serviceAccount:service-$PROJECT_NUM@gcp-sa-workstationsvm.iam.gserviceaccount.com \

  --role=roles/artifactregistry.reader

# Start the workstation

gcloud workstations start localllm-workstation \

  --cluster=$CLUSTER \

  --config=localllm-workstation \

  --region=us-central1

You can connect to the workstation using ssh (shown below), or interactively in the browser.

gcloud workstations ssh localllm-workstation \

  --cluster=$CLUSTER \

  --config=localllm-workstation \

  --region=us-central1

After serving a model (via the llm run command with the port of your choice), you can interact with the model by visiting the live OpenAPI documentation page. You can apply this process to any model listed in the Bloke’s repo on HuggingFace Lllama was used in this scenario as an example. First, get the hostname of the workstation using:

gcloud workstations describe localllm-workstation \

  --cluster=$CLUSTER \

  --config=localllm-workstation \

  --region=us-central1

Then, in the browser, visit https://$PORT-$HOSTNAME/docs.

Conclusion

In summary, the integration of localllm with Cloud Workstations marks a significant advancement in AI application development, enabling the use of large language models (LLMs) directly on CPU and memory within the Google Cloud platform. This innovative approach circumvents the limitations imposed by GPU shortages, tapping into the vast capabilities of LLMs. Localllm brings about a new era of productivity, cost-effectiveness, and enhanced data security, simplifying the creation of cutting-edge applications. Dive into the future of application development in https://cloud.google.com/blog/products/application-development/new-localllm-lets-you-develop-gen-ai-apps-locally-without-gpus

AI Tools & ChatGPT Prompts - Free ??

1 年

Embracing the future of AI development without limitations!

1 次回应

Francisco Javier Gonzalez lopez, APA

Pricing and Data Analytics

1 年

Great article Celia!!! but what do you think that fan be the role of IoT in this topic? would be possible to use all the devices that are inactive to do the calculations?

1 次回应

查看更多评论

要查看或添加评论，请登录

Celia Lozano Grijalba的更多文章

GenAI: Can We See Reasoning Inside a Model?

2025年3月26日

GenAI: Can We See Reasoning Inside a Model?

?? Hey Data Pills Enthusiasts, reasoning is the new frontier in Large Language Models (LLMs). Since the emergence of…

2 条评论
From Black Box to Glass Box: Demystifying AI Through Explainable Artificial Intelligence (xAI)

2025年3月16日

From Black Box to Glass Box: Demystifying AI Through Explainable Artificial Intelligence (xAI)

?? Hey Data Pills Enthusiasts, Artificial Intelligence (AI) has evolved from a futuristic concept to a key player in…

1 条评论
Prompt is Dead, Long Live Agentic AI

2025年2月15日

Prompt is Dead, Long Live Agentic AI

?? Hey Data Pills Enthusiasts, The AI landscape is evolving fast, and it’s time to talk about a major shift that’s…
How DeepSeek Is Revolutionizing GenAI Model Training

2025年1月28日

How DeepSeek Is Revolutionizing GenAI Model Training

Welcome to this week’s DataPills! Today, we spotlight DeepSeek, a groundbreaking innovation that’s transforming how…

1 条评论
Bye Bye RAG! Welcome CAG (Cache-Augmented Generation). Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation

2025年1月7日

Bye Bye RAG! Welcome CAG (Cache-Augmented Generation). Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation

Welcome to this week’s Data Pills, where we break down the latest advancements in Data and AI. Today, we dive into a…

2 条评论
Updates AI Agent: Smarter and Better Than Ever – December Edition

2024年12月16日

Updates AI Agent: Smarter and Better Than Ever – December Edition

Welcome to this week’s edition of Data Pills! A quick dose of insights, updates, and innovations to keep you at the…

1 条评论
TimeGPT: Revolutionising Time Series Forecasting with Generative Models

2024年3月17日

TimeGPT: Revolutionising Time Series Forecasting with Generative Models

In an era where technological advancements are rapidly transforming how we interact with data, the emergence of…

1 条评论
Beyond GenAI: What Is A Vector Database, And Why Do You Need One?

2024年2月17日

Beyond GenAI: What Is A Vector Database, And Why Do You Need One?

In the current era of technological advancement, the surge of Large Language Models (LLMs) such as GPT, Gemini, Claude,…

4 条评论
Ontology, Finding meaning in data using Palantir Foundry

2024年2月4日

Ontology, Finding meaning in data using Palantir Foundry

In the rapidly evolving domain of data technology, understanding the underlying fabric that connects disparate pieces…

7 条评论
Data Streaming Services on AWS

2023年7月28日

Data Streaming Services on AWS

by Raúl Martínez Cordón and Bosonit Find it in Bosotrends Post Hello Bosotrends! The digital universe is evolving, and…

3 条评论

See all articles

No GPU? No problem. Google localllm lets you develop gen AI apps on local CPUs

Celia Lozano Grijalba

Head of Data & AI at Bosonit | Data Scientist | PhD

Boosting Productivity with Quantized Models and Cloud Workstations

localllm

Getting started with localllm

Creating a localllm-enabled Cloud Workstation

领英推荐

Conclusion

Celia Lozano Grijalba的更多文章

社区洞察

其他会员也浏览了

GPU Clusters: Powering the Future of High-Performance Computing

Understanding is? – The Brains Behind Computing

NVIDIA Rewrites the Rules of AI Computing: From Supercomputers to Your PC

How NVIDIA GPU Operator Optimizes GPU Utilization

Accelerating Generative AI: NVIDIA's CUDA Reinvents HPC

GPU Operator: Simplifying GPU Management on Kubernetes

openEuler × DeepSeek 2: vLLM Deployment Guide (CPU + GPU)

Why GPU Can Process Image Much Faster than CPU?

CPUs and GPUs Eco-Systems

Boosting Productivity with Quantized Models and Cloud Workstations

localllm

Getting started with localllm

Creating a localllm-enabled Cloud Workstation

领英推荐

Conclusion

Celia Lozano Grijalba的更多文章

GenAI: Can We See Reasoning Inside a Model?

From Black Box to Glass Box: Demystifying AI Through Explainable Artificial Intelligence (xAI)

Prompt is Dead, Long Live Agentic AI

How DeepSeek Is Revolutionizing GenAI Model Training

Bye Bye RAG! Welcome CAG (Cache-Augmented Generation). Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation

Updates AI Agent: Smarter and Better Than Ever – December Edition

TimeGPT: Revolutionising Time Series Forecasting with Generative Models

Beyond GenAI: What Is A Vector Database, And Why Do You Need One?

Ontology, Finding meaning in data using Palantir Foundry

Data Streaming Services on AWS

社区洞察

其他会员也浏览了

GPU Clusters: Powering the Future of High-Performance Computing

Understanding is? – The Brains Behind Computing

NVIDIA Rewrites the Rules of AI Computing: From Supercomputers to Your PC

How NVIDIA GPU Operator Optimizes GPU Utilization

Accelerating Generative AI: NVIDIA's CUDA Reinvents HPC

GPU Operator: Simplifying GPU Management on Kubernetes

openEuler × DeepSeek 2: vLLM Deployment Guide (CPU + GPU)

Why GPU Can Process Image Much Faster than CPU?

CPUs and GPUs Eco-Systems