Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab

(Also on Constellar tech blog https://medium.com/the-constellar-digital-technology-blog/geek-out-time-simulating-distributed-training-on-tpu-gpu-in-google-colab-2b693a342724)

Introduction

Distributed training is essential for scaling deep learning models, allowing training to be spread across multiple hardware accelerators. Google Colab provides free access to both TPU v2–8 (with 8 cores) and NVIDIA T4 GPU (with a single GPU in the free tier).

In this Geek Out Time, I play with distributed training on both TPU and GPU in Google Colab to explore how they handle distributed workloads.

What is Distributed Training?

Distributed training refers to dividing computations across multiple devices (e.g., TPUs, GPUs, or CPUs).

  • TPUs use TPUStrategy() to train across 8 cores simultaneously.
  • GPUs use MirroredStrategy(), which allows training across multiple GPUs—but since Colab Free provides only 1 GPU, I am limited to simulating a multi-GPU setup.

Experiment Setup

We train a Fashion MNIST classifier using a simple feedforward neural network. The same architecture, batch size, and optimizer settings are used for both TPU and GPU to ensure a consistent setup.

Experiment 1: Distributed Training on TPU

This experiment uses TPUStrategy() to enable training across 8 TPU cores.

# Step 1: Import Libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import time

# Step 2: Initialize TPU
def setup_tpu():
    try:
        resolver = tf.distribute.cluster_resolver.TPUClusterResolver()  
        tf.config.experimental_connect_to_cluster(resolver)
        tf.tpu.experimental.initialize_tpu_system(resolver)
        strategy = tf.distribute.TPUStrategy(resolver)  
        print(f"TPU detected with {strategy.num_replicas_in_sync} cores")
        return strategy
    except:
        print("No TPU detected. Using CPU/GPU strategy")
        return tf.distribute.MirroredStrategy()  
strategy = setup_tpu()
# Step 3: Load and Preprocess Dataset
BATCH_SIZE = 32 * strategy.num_replicas_in_sync  
AUTOTUNE = tf.data.AUTOTUNE
def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0  
    image = tf.expand_dims(image, axis=-1)  
    return image, label
def prepare_dataset(dataset, shuffle=False):
    dataset = dataset.map(preprocess, num_parallel_calls=AUTOTUNE)
    if shuffle:
        dataset = dataset.shuffle(10000)
    dataset = dataset.batch(BATCH_SIZE).prefetch(AUTOTUNE)
    return dataset
dataset, info = tfds.load("fashion_mnist", as_supervised=True, with_info=True)
train_data, test_data = dataset["train"], dataset["test"]
train_data = prepare_dataset(train_data, shuffle=True)
test_data = prepare_dataset(test_data)
# Step 4: Define Model Inside TPU Scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28, 28, 1)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax", dtype="float32")
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=["sparse_categorical_accuracy"]
    )
print("Model compiled and ready for TPU training")
# Step 5: Train the Model
start_time = time.time()
history = model.fit(train_data, epochs=10, validation_data=test_data)
training_time = time.time() - start_time
print(f"TPU Training completed in {training_time:.2f} seconds")
# Step 6: Save the Model
model.save("fashion_mnist_tpu_distributed.keras")
print("Model saved successfully")        

Experiment 2: Simulated Distributed Training on GPU

Since Colab Free provides only 1 GPU, I use MirroredStrategy() to simulate multi-GPU training.

# Step 1: Import Libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import time

# Step 2: Initialize MirroredStrategy for GPU
def setup_gpu():
    try:
        strategy = tf.distribute.MirroredStrategy()  
        print(f"Running on {strategy.num_replicas_in_sync} GPU(s)")
        return strategy
    except:
        print("No GPU detected. Falling back to CPU")
        return tf.distribute.get_strategy()
strategy = setup_gpu()
# Step 3: Load and Preprocess Dataset
BATCH_SIZE = 32 * strategy.num_replicas_in_sync  
AUTOTUNE = tf.data.AUTOTUNE
def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0  
    image = tf.expand_dims(image, axis=-1)  
    return image, label
dataset, info = tfds.load("fashion_mnist", as_supervised=True, with_info=True)
train_data, test_data = dataset["train"], dataset["test"]
train_data = prepare_dataset(train_data, shuffle=True)
test_data = prepare_dataset(test_data)
# Step 4: Define Model Inside GPU Scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28, 28, 1)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation="softmax", dtype="float32")
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=["sparse_categorical_accuracy"]
    )
print("Model compiled and ready for simulated multi-GPU training")
# Step 5: Train the Model
start_time = time.time()
history = model.fit(train_data, epochs=10, validation_data=test_data)
training_time = time.time() - start_time
print(f"GPU Training completed in {training_time:.2f} seconds")
# Step 6: Save the Model
model.save("fashion_mnist_gpu_distributed.keras")
print("Model saved successfully")
        

You will see the result

Observations from the Experiment

Key Takeaways

  • TPU runs true distributed training across 8 cores, while Colab Free only provides 1 GPU (making multi-GPU simulation less effective).
  • If we had 8 GPUs instead of 1, GPU performance would likely improve significantly.
  • This experiment is not a comparison, but rather an exploration of how TPU and GPU handle distributed training differently.

Conclusion

This Geek Out Time explored distributed training on TPU vs. GPU in Google Colab. While TPU benefits from true multi-core parallelism, GPU is constrained by Colab Free’s single GPU limit. If we want a fair multi-GPU test, we would need to run this experiment on cloud services like AWS or Google Cloud, where we could access multiple GPUs.

The integration of Gemini 2.0 Flash? into Colab has been impressive and incredibly helpful for troubleshooting. It quickly assists with debugging, optimizing code, and resolving errors efficiently. However, I do wish there was access to a more capable model for deeper insights and complex problem-solving. But given that this is available in the free tier, you really can’t ask for much more…

Happy coding!

要查看或添加评论,请登录

Nedved Yang的更多文章

社区洞察

其他会员也浏览了