No GPU? No problem. Google localllm lets you develop gen AI apps on local CPUs
In the latest edition of Data Pills, tailored for the tech-savvy data aficionados, we're diving into a cutting-edge solution that is reshaping the AI development landscape and GenAI. The primary obstacle in this realm has been the high demand for GPUs, especially for those working with expansive language models (LLMs), but often scarce and expensive. Our spotlight this time is on an ingenious workaround that leverages the robust capabilities of CPUs and memory within 谷歌 Cloud's Cloud Workstations, a top-tier managed development environment. We've explored models available on Hugging Face, particularly those in a repository by "The Bloke", which are designed for compatibility with a quantization method enabling their operation on CPUs or lower-power GPUs. This breakthrough strategy not only circumvents the dependency on GPUs but also paves the way for streamlined and effective AI application development. By integrating "quantized models," Cloud Workstations, alongside the new open-source tool localllm (see git lab https://github.com/googlecloudplatform/localllm), and a spectrum of readily available resources, developers are equipped to create AI-driven applications on an advanced development workstation, enhancing traditional processes and workflows.
Boosting Productivity with Quantized Models and Cloud Workstations
Quantized models streamline AI by optimizing for devices with limited processing capabilities, enhancing efficiency in memory and power. Running on Cloud Workstations, they offer several benefits for development:
Integrating these models with Cloud Workstations combines the latter's scalability and cost-effectiveness with the models' operational benefits. This approach sidesteps the latency, security risks, and reliance on external GPU services typical of remote or cloud-based development, offering a more secure, efficient, and control-oriented development environment.
localllm
Today, we're unveiling localllm, a comprehensive suite of tools and libraries designed to simplify access to HuggingFace's quantized models via a command-line interface. This toolset is poised to revolutionize how developers utilize large language models (LLMs) by removing the barrier of GPU necessity. Localllm offers an all-in-one solution to operate LLMs directly on CPUs and memory within the Google Cloud Workstation environment. It also supports running these models on any local machine or system with adequate CPU resources. By bypassing the need for GPUs, localllm empowers developers to fully harness the capabilities of LLMs for their application development projects, enhancing flexibility and efficiency.
Getting started with localllm
To get started with the localllm, visit the GitHub repository at https://github.com/googlecloudplatform/localllm. The repository provides detailed documentation, code samples, and step-by-step instructions to set up and utilize LLMs locally on CPU and memory within the Google Cloud environment. You can explore the repository, contribute to its development, and leverage its capabilities to enhance your application development workflows.?
Once you’ve cloned the repo locally, the following simple steps will run localllm with a quantized model of your choice from the HuggingFace repo “The Bloke,” then execute an initial sample prompt query. For example we are using Llama.
# Install the tools
pip3 install openai
pip3 install ./llm-tool/.
# Download and run a model
llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000
# Try out a query
./querylocal.py
Creating a localllm-enabled Cloud Workstation
To get started with localllm and Cloud Workstations, you'll need a Google Cloud Project and to install the gcloud CLI. First, build a Cloud Workstations container that includes localllm, then use that as the basis for our developer workstation (which also comes pre-equipped with VS Code).
gcloud config set project $PROJECT_ID
# Enable needed services
gcloud services enable \
cloudbuild.googleapis.com \
workstations.googleapis.com \
container.googleapis.com \
containeranalysis.googleapis.com \
containerscanning.googleapis.com \
artifactregistry.googleapis.com
# Create AR Docker repository
gcloud artifacts repositories create localllm \
--location=us-central1 \
--repository-format=docker
Next, submit a build of the Dockerfile, which also pushes the image to Artifact Registry.
gcloud builds submit .
The published image is named
us-central1-docker.pkg.dev/$PROJECT_ID/localllm/localllm.
The next step is to create and launch a workstation using our custom image. We suggest using a machine type of e2-standard-32 (32 vCPU, 16 core and 128 GB memory), an admittedly beefy machine.
The following example uses gcloud to configure a cluster, configuration and workstation using our custom base image with llm installed. Replace $CLUSTER with your desired cluster name, and the command below will create a new one (which takes ~20 minutes).
领英推荐
gcloud workstations clusters create $CLUSTER \
--region=us-central1
The next steps create the workstation, and starts it up. These steps will take ~10 minutes to run.
# Create workstation configuration
gcloud workstations configs create localllm-workstation \
--region=us-central1 \
--cluster=$CLUSTER \
--machine-type=e2-standard-32 \
--container-custom-image=us-central1-docker.pkg.dev/$PROJECT_ID/localllm/localllm
# Create the workstation
gcloud workstations create localllm-workstation \
--cluster=$CLUSTER \
--config=localllm-workstation \
--region=us-central1
# Grant access to the default Cloud Workstation Service Account
gcloud artifacts repositories add-iam-policy-binding \
localllm \
--location=us-central1 \
--member=serviceAccount:service-$PROJECT_NUM@gcp-sa-workstationsvm.iam.gserviceaccount.com \
--role=roles/artifactregistry.reader
# Start the workstation
gcloud workstations start localllm-workstation \
--cluster=$CLUSTER \
--config=localllm-workstation \
--region=us-central1
You can connect to the workstation using ssh (shown below), or interactively in the browser.
gcloud workstations ssh localllm-workstation \
--cluster=$CLUSTER \
--config=localllm-workstation \
--region=us-central1
After serving a model (via the llm run command with the port of your choice), you can interact with the model by visiting the live OpenAPI documentation page. You can apply this process to any model listed in the Bloke’s repo on HuggingFace Lllama was used in this scenario as an example. First, get the hostname of the workstation using:
gcloud workstations describe localllm-workstation \
--cluster=$CLUSTER \
--config=localllm-workstation \
--region=us-central1
Then, in the browser, visit https://$PORT-$HOSTNAME/docs.
Conclusion
In summary, the integration of localllm with Cloud Workstations marks a significant advancement in AI application development, enabling the use of large language models (LLMs) directly on CPU and memory within the Google Cloud platform. This innovative approach circumvents the limitations imposed by GPU shortages, tapping into the vast capabilities of LLMs. Localllm brings about a new era of productivity, cost-effectiveness, and enhanced data security, simplifying the creation of cutting-edge applications. Dive into the future of application development in https://cloud.google.com/blog/products/application-development/new-localllm-lets-you-develop-gen-ai-apps-locally-without-gpus
Embracing the future of AI development without limitations!
Pricing and Data Analytics
1 年Great article Celia!!! but what do you think that fan be the role of IoT in this topic? would be possible to use all the devices that are inactive to do the calculations?