Geek Out Time: Simulating Distributed Training on TPU & GPU in Google Colab
(Also on Constellar tech blog https://medium.com/the-constellar-digital-technology-blog/geek-out-time-simulating-distributed-training-on-tpu-gpu-in-google-colab-2b693a342724)
Introduction
Distributed training is essential for scaling deep learning models, allowing training to be spread across multiple hardware accelerators. Google Colab provides free access to both TPU v2–8 (with 8 cores) and NVIDIA T4 GPU (with a single GPU in the free tier).
In this Geek Out Time, I play with distributed training on both TPU and GPU in Google Colab to explore how they handle distributed workloads.
What is Distributed Training?
Distributed training refers to dividing computations across multiple devices (e.g., TPUs, GPUs, or CPUs).
Experiment Setup
We train a Fashion MNIST classifier using a simple feedforward neural network. The same architecture, batch size, and optimizer settings are used for both TPU and GPU to ensure a consistent setup.
Experiment 1: Distributed Training on TPU
This experiment uses TPUStrategy() to enable training across 8 TPU cores.
# Step 1: Import Libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import time
# Step 2: Initialize TPU
def setup_tpu():
try:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
print(f"TPU detected with {strategy.num_replicas_in_sync} cores")
return strategy
except:
print("No TPU detected. Using CPU/GPU strategy")
return tf.distribute.MirroredStrategy()
strategy = setup_tpu()
# Step 3: Load and Preprocess Dataset
BATCH_SIZE = 32 * strategy.num_replicas_in_sync
AUTOTUNE = tf.data.AUTOTUNE
def preprocess(image, label):
image = tf.cast(image, tf.float32) / 255.0
image = tf.expand_dims(image, axis=-1)
return image, label
def prepare_dataset(dataset, shuffle=False):
dataset = dataset.map(preprocess, num_parallel_calls=AUTOTUNE)
if shuffle:
dataset = dataset.shuffle(10000)
dataset = dataset.batch(BATCH_SIZE).prefetch(AUTOTUNE)
return dataset
dataset, info = tfds.load("fashion_mnist", as_supervised=True, with_info=True)
train_data, test_data = dataset["train"], dataset["test"]
train_data = prepare_dataset(train_data, shuffle=True)
test_data = prepare_dataset(test_data)
# Step 4: Define Model Inside TPU Scope
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(28, 28, 1)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation="softmax", dtype="float32")
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=["sparse_categorical_accuracy"]
)
print("Model compiled and ready for TPU training")
# Step 5: Train the Model
start_time = time.time()
history = model.fit(train_data, epochs=10, validation_data=test_data)
training_time = time.time() - start_time
print(f"TPU Training completed in {training_time:.2f} seconds")
# Step 6: Save the Model
model.save("fashion_mnist_tpu_distributed.keras")
print("Model saved successfully")
领英推荐
Experiment 2: Simulated Distributed Training on GPU
Since Colab Free provides only 1 GPU, I use MirroredStrategy() to simulate multi-GPU training.
# Step 1: Import Libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import time
# Step 2: Initialize MirroredStrategy for GPU
def setup_gpu():
try:
strategy = tf.distribute.MirroredStrategy()
print(f"Running on {strategy.num_replicas_in_sync} GPU(s)")
return strategy
except:
print("No GPU detected. Falling back to CPU")
return tf.distribute.get_strategy()
strategy = setup_gpu()
# Step 3: Load and Preprocess Dataset
BATCH_SIZE = 32 * strategy.num_replicas_in_sync
AUTOTUNE = tf.data.AUTOTUNE
def preprocess(image, label):
image = tf.cast(image, tf.float32) / 255.0
image = tf.expand_dims(image, axis=-1)
return image, label
dataset, info = tfds.load("fashion_mnist", as_supervised=True, with_info=True)
train_data, test_data = dataset["train"], dataset["test"]
train_data = prepare_dataset(train_data, shuffle=True)
test_data = prepare_dataset(test_data)
# Step 4: Define Model Inside GPU Scope
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(28, 28, 1)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation="softmax", dtype="float32")
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=["sparse_categorical_accuracy"]
)
print("Model compiled and ready for simulated multi-GPU training")
# Step 5: Train the Model
start_time = time.time()
history = model.fit(train_data, epochs=10, validation_data=test_data)
training_time = time.time() - start_time
print(f"GPU Training completed in {training_time:.2f} seconds")
# Step 6: Save the Model
model.save("fashion_mnist_gpu_distributed.keras")
print("Model saved successfully")
You will see the result
Observations from the Experiment
Key Takeaways
Conclusion
This Geek Out Time explored distributed training on TPU vs. GPU in Google Colab. While TPU benefits from true multi-core parallelism, GPU is constrained by Colab Free’s single GPU limit. If we want a fair multi-GPU test, we would need to run this experiment on cloud services like AWS or Google Cloud, where we could access multiple GPUs.
The integration of Gemini 2.0 Flash? into Colab has been impressive and incredibly helpful for troubleshooting. It quickly assists with debugging, optimizing code, and resolving errors efficiently. However, I do wish there was access to a more capable model for deeper insights and complex problem-solving. But given that this is available in the free tier, you really can’t ask for much more…
Happy coding!