When to use TPUs on ML Workloads
Guillermo Perasso
Senior Director, Training at Jellyfish - Google Cloud Certified Professional Google Cloud trainer of the year 2023 North America
This is the first article in a series on taking the Google Cloud Professional Machine Learning Certification exam. This text is intended to aid those who are preparing for the certification by presenting some key topics that may be included in the 50 questions on the exam.
Today, we will look into tensor processing units (TPUs). I will provide a quick overview of this technology created by Google, and then we will provide a brief analysis of the ML use cases in which TPUs would yield a better solution in terms of performance and potencially cost. We will focus primarily on the training stage of large deep learning models developed using neural network algorithms.
We present some hints you will need to pick up in the exam question to determine whether the option that presents TPU would be the best choice or we need to look for alternatives, as CPU or GPU may better address the requirements described (or, most likely, implied) in the exam question.?
This text is mostly based on content available here?
Introduction to TPUs:
TPUs (Tensor Processing Units) are specialised hardware accelerators designed by Google for accelerating machine learning workloads. They are designed to deliver high performance and energy efficiency for training and deploying ML models. TPUs excel at matrix operations, making them ideal for deep learning tasks such as image recognition, natural language processing, and speech recognition.
TPUs are composed of multiple processing cores, each optimised for performing specific ML operations. These cores are interconnected through a high-speed network, allowing for efficient data transfer and communication. TPUs also feature a large on-chip memory, enabling them to store and process vast amounts of data. This architecture enables TPUs to achieve exceptional performance and throughput for ML workloads.
Programming Models for TPUs:
TPUs are supported by various programming models, including:
TPUs are ideally suited for applications that require:
Large-Scale Training: TPUs excel at training complex ML models with extensive datasets, enabling faster convergence and improved accuracy. Training processes that are expected to take several days, weeks to complete, are good candidates for TPUs.
High-Throughput Inference: TPUs are well-suited for deploying ML models in production environments, where they can handle large volumes of inference requests with low latency.
Google Cloud offers various TPU options to meet diverse user requirements:
TPU Pods are pre-configured systems that provide a complete hardware and software environment for running ML workloads on TPUs.
TPU VMs are virtual machines that include TPUs, allowing users to create custom configurations and leverage existing tooling and frameworks.
Cloud TPU is a fully managed service that provides access to TPUs without the need for hardware setup and maintenance.
领英推荐
Not always TPUs would provide benefits (cost, performance) when working with training ML models. In some situations, you might want to use GPUs or CPUs on Compute Engine instances to run your machine learning workloads. In general, you can decide what hardware is best for your workload based on the following guidelines:
Less expensive, more available CPUs may be the right choice when working with:
GPU (Graphics Processing Units) are suitable for:
These are the scenarios where TPUs clearly provide benefits over other processors.
And finally, these are some workflows where Cloud TPUs are not expected to perform well:
Conclusion
As you study and prepare for the Google Cloud PMLE certification exam, you need to be able to identify relevant aspects of ML-related workloads where provisioning TPU technology could provide clear benefits in performance and cost. In general, the following factors are key:? size of training datasets, size of batches and model complexity (the larger the better); using standard TF operations (no customisations) in ML model, manipulate values with low-precision arithmetic; and last but not least, the capacity to run these process on Google infrastructure, only platform where TPU resources are available.?
For instance, here’s a question that you can find if you check the “sample exam questions” from the Google Cloud Professional Machine Learning Engineering site
?
You can see clearly in this sample question how decision factors introduced in this article are at play when trying to select the right answer.?