登录查看更多内容

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

Nedved Yang

发布日期: 2025年2月24日

(Also on Constellar tech blog https://medium.com/the-constellar-digital-technology-blog/geek-out-time-simulating-distributed-training-on-tpu-gpu-in-google-colab-2b693a342724)

Introduction

Distributed training is essential for scaling deep learning models, allowing training to be spread across multiple hardware accelerators. Google Colab provides free access to both TPU v2–8 (with 8 cores) and NVIDIA T4 GPU (with a single GPU in the free tier).

In this Geek Out Time, I play with distributed training on both TPU and GPU in Google Colab to explore how they handle distributed workloads.

What is Distributed Training?

Distributed training refers to dividing computations across multiple devices (e.g., TPUs, GPUs, or CPUs).

TPUs use TPUStrategy() to train across 8 cores simultaneously.
GPUs use MirroredStrategy(), which allows training across multiple GPUs—but since Colab Free provides only 1 GPU, I am limited to simulating a multi-GPU setup.

Experiment Setup

We train a Fashion MNIST classifier using a simple feedforward neural network. The same architecture, batch size, and optimizer settings are used for both TPU and GPU to ensure a consistent setup.

Experiment 1: Distributed Training on TPU

This experiment uses TPUStrategy() to enable training across 8 TPU cores.

# Step 1: Import Libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import time

# Step 2: Initialize TPU
def setup_tpu():
    try:
        resolver = tf.distribute.cluster_resolver.TPUClusterResolver()  
        tf.config.experimental_connect_to_cluster(resolver)
        tf.tpu.experimental.initialize_tpu_system(resolver)
        strategy = tf.distribute.TPUStrategy(resolver)  
        print(f"TPU detected with {strategy.num_replicas_in_sync} cores")
        return strategy
    except:
        print("No TPU detected. Using CPU/GPU strategy")
        return tf.distribute.MirroredStrategy()  
strategy = setup_tpu()
# Step 3: Load and Preprocess Dataset
BATCH_SIZE = 32 * strategy.num_replicas_in_sync  
AUTOTUNE = tf.data.AUTOTUNE
def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0  
    image = tf.expand_dims(image, axis=-1)  
    return image, label
def prepare_dataset(dataset, shuffle=False):
    dataset = dataset.map(preprocess, num_parallel_calls=AUTOTUNE)
    if shuffle:
        dataset = dataset.shuffle(10000)
    dataset = dataset.batch(BATCH_SIZE).prefetch(AUTOTUNE)
    return dataset
dataset, info = tfds.load("fashion_mnist", as_supervised=True, with_info=True)
train_data, test_data = dataset["train"], dataset["test"]
train_data = prepare_dataset(train_data, shuffle=True)
test_data = prepare_dataset(test_data)
# Step 4: Define Model Inside TPU Scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28, 28, 1)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax", dtype="float32")
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=["sparse_categorical_accuracy"]
    )
print("Model compiled and ready for TPU training")
# Step 5: Train the Model
start_time = time.time()
history = model.fit(train_data, epochs=10, validation_data=test_data)
training_time = time.time() - start_time
print(f"TPU Training completed in {training_time:.2f} seconds")
# Step 6: Save the Model
model.save("fashion_mnist_tpu_distributed.keras")
print("Model saved successfully")

领英推荐

This AI newsletter is all you need #92

Towards AI 11 个月前

LLM Inference War Begins

AIM 6 个月前

Scalable Processors with Built in AI Accelerators

Ronald van Loon 1 年前

Experiment 2: Simulated Distributed Training on GPU

Since Colab Free provides only 1 GPU, I use MirroredStrategy() to simulate multi-GPU training.

# Step 1: Import Libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import time

# Step 2: Initialize MirroredStrategy for GPU
def setup_gpu():
    try:
        strategy = tf.distribute.MirroredStrategy()  
        print(f"Running on {strategy.num_replicas_in_sync} GPU(s)")
        return strategy
    except:
        print("No GPU detected. Falling back to CPU")
        return tf.distribute.get_strategy()
strategy = setup_gpu()
# Step 3: Load and Preprocess Dataset
BATCH_SIZE = 32 * strategy.num_replicas_in_sync  
AUTOTUNE = tf.data.AUTOTUNE
def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0  
    image = tf.expand_dims(image, axis=-1)  
    return image, label
dataset, info = tfds.load("fashion_mnist", as_supervised=True, with_info=True)
train_data, test_data = dataset["train"], dataset["test"]
train_data = prepare_dataset(train_data, shuffle=True)
test_data = prepare_dataset(test_data)
# Step 4: Define Model Inside GPU Scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28, 28, 1)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax", dtype="float32")
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=["sparse_categorical_accuracy"]
    )
print("Model compiled and ready for simulated multi-GPU training")
# Step 5: Train the Model
start_time = time.time()
history = model.fit(train_data, epochs=10, validation_data=test_data)
training_time = time.time() - start_time
print(f"GPU Training completed in {training_time:.2f} seconds")
# Step 6: Save the Model
model.save("fashion_mnist_gpu_distributed.keras")
print("Model saved successfully")

You will see the result

Observations from the Experiment

Key Takeaways

TPU runs true distributed training across 8 cores, while Colab Free only provides 1 GPU (making multi-GPU simulation less effective).
If we had 8 GPUs instead of 1, GPU performance would likely improve significantly.
This experiment is not a comparison, but rather an exploration of how TPU and GPU handle distributed training differently.

Conclusion

This Geek Out Time explored distributed training on TPU vs. GPU in Google Colab. While TPU benefits from true multi-core parallelism, GPU is constrained by Colab Free’s single GPU limit. If we want a fair multi-GPU test, we would need to run this experiment on cloud services like AWS or Google Cloud, where we could access multiple GPUs.

The integration of Gemini 2.0 Flash? into Colab has been impressive and incredibly helpful for troubleshooting. It quickly assists with debugging, optimizing code, and resolving errors efficiently. However, I do wish there was access to a more capable model for deeper insights and complex problem-solving. But given that this is available in the free tier, you really can’t ask for much more…

Happy coding!

要查看或添加评论，请登录

Nedved Yang的更多文章

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

2025年3月17日

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

(Also on Constellar tech blog:…

1 条评论
Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

2025年3月3日

Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

(Also on Constellar tech blog…

2 条评论
Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

2025年2月17日

Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

(Also on Constellar tech blog…

2 条评论
Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

2025年2月10日

Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

(Also on Constellar tech blog…

4 条评论
Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

2025年2月4日

Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

(Also on Constellar tech blog…
Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

2025年1月20日

Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

(Also on Constellar tech blog…

2 条评论
Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

2025年1月13日

Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

(Also on Constellar tech blog…
Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

2024年12月23日

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

(Also on Constellar tech blog https://nedvedyang.medium.

1 条评论
Geek Out Time: Exploring Opensource AnythingLLM — The All-in-One, Easy AI Platform for Local RAG and Intelligent Agents with Just a Click

2024年12月9日

Geek Out Time: Exploring Opensource AnythingLLM — The All-in-One, Easy AI Platform for Local RAG and Intelligent Agents with Just a Click

(Also on Constellar tech blog…

3 条评论
Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

2024年12月6日

Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

(Also on Constellar tech blog…

See all articles

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

Nedved Yang

Introduction

What is Distributed Training?

Experiment Setup

Experiment 1: Distributed Training on TPU

领英推荐

Experiment 2: Simulated Distributed Training on GPU

Observations from the Experiment

Key Takeaways

Conclusion

Nedved Yang的更多文章

社区洞察

其他会员也浏览了

Leading Practices for GPUaaS and LLMaaS Success: A Detailed Guide

How to choose a GPU for machine learning?

How do we leverage Cloud GPUs to boost the performance of AI/ML workloads?

Inside the H200 Tensor Core GPU: An In-Depth Architectural Analysis

Inside the H200 Tensor Core GPU: An In-Depth Architectural Analysis

Breaking Big Tech's AI Stranglehold: The Case for Distributed Artificial Intelligence

Inefficient GPU Utilization for LLM Inference in Enterprises

Which AI Hardware Will Rise Above in the Wake of Competing AI Models?

A Detailed Comparison of the NVIDIA H200 and H100 Architectures for Developers

Overcoming the Limitations of Training Models in AI with GPUs

Introduction

What is Distributed Training?

Experiment Setup

Experiment 1: Distributed Training on TPU

领英推荐

Experiment 2: Simulated Distributed Training on GPU

Observations from the Experiment

Key Takeaways

Conclusion

Nedved Yang的更多文章

Geek Out Time: Trying newly released OpenAI’s Responses API with Web Search Tool in Google Colab

Geek Out Time: Building a Multi-Agent Financial Advisor Copilot with AG2 (formerly AutoGen), OpenAI, and DeepSeek LLM

Geek Out Time: “Vibe Coding” on Google Colab with OpenAI & DeepSeek

Geek Out Time: Mixture of Experts(MoE) vs. CNN: A Google Colab Experiment

Geek Out Time: Knowledge Distillation in TensorFlow- Smaller, Smarter Models in Google Colab

Geek Out Time: Build Your Own Autonomous AI Agent Backed by the Top Open-Source LLM DeepSeek v3 and Browser-Use Web UI-Right in Your Browser

Geek Out Time: AI Model Routing — Dynamically Choose Models Based on Question Complexity

Geek Out Time: AI in the Browser- Run WebLLM for Powerful, Local LLM Experiences

Geek Out Time: Exploring Opensource AnythingLLM — The All-in-One, Easy AI Platform for Local RAG and Intelligent Agents with Just a Click

Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

社区洞察

其他会员也浏览了

Leading Practices for GPUaaS and LLMaaS Success: A Detailed Guide

How to choose a GPU for machine learning?

How do we leverage Cloud GPUs to boost the performance of AI/ML workloads?

Inside the H200 Tensor Core GPU: An In-Depth Architectural Analysis

Inside the H200 Tensor Core GPU: An In-Depth Architectural Analysis

Breaking Big Tech's AI Stranglehold: The Case for Distributed Artificial Intelligence

Inefficient GPU Utilization for LLM Inference in Enterprises

Which AI Hardware Will Rise Above in the Wake of Competing AI Models?

A Detailed Comparison of the NVIDIA H200 and H100 Architectures for Developers

Overcoming the Limitations of Training Models in AI with GPUs