Make Hardware Work For You: Part 1 – Optimizing Code For Deep Learning Model Training on CPU
Christopher Touchberry
Custom Desktop/SFF/Laptop/Rackmount Workstations & Gaming PCs | Technical Business Development Advisor | 78 SSI LinkedIn Coach
BIG shout out to Kevin Scott for assisting with this article! More to come!!!
Make Hardware Work For You – Introduction
The increasing complexity of deep learning models demands not just powerful hardware but also optimized code to make software and hardware work for you. Top Flight Computers customized builds are specifically optimized for certain workflows, such as deep learning and high-performance computing.
This specialization is crucial because it ensures that every component—from the CPU and GPU to the memory and storage—is selected and configured to maximize efficiency and performance for specific tasks. By aligning hardware capabilities with software requirements, users can achieve significant improvements in processing speed, resource utilization, and overall productivity. By customizing code to match your hardware capabilities, you can significantly enhance deep learning model training performance.
In this article, we’ll explore how to optimize deep learning code for CPU to take full advantage of a high-end custom-built system, showcasing benchmarking and profiling and benchmarking improvements using perf, hyperfine. In the next article we will discuss GPU based optimizations.
Overview of the High Performance Hardware
Relevance to Deep Learning
The Importance of Code Optimization in Deep Learning
In the rapidly evolving landscape of deep learning, where large datasets and complex algorithms converge with powerful hardware, code optimization plays a critical role. While advances in hardware—like GPUs and TPUs—have transformed what’s possible, poorly optimized code can severely limit performance.
Failing to fully utilize the capabilities of modern systems leads to slower training, increased costs, and less efficient workflows. Code optimization, therefore, is essential for maximizing resources and time. To truly unlock the potential of cutting-edge hardware, software must be carefully tailored to take advantage of its strengths.
Without this, training processes can become unnecessarily slow and resource-intensive, reducing the efficiency of deep learning workflows.
The Benefits of Customizing Code
Optimizing Deep Learning Model Training
The Baseline
PyTorch is one of the most popular frameworks to get started with. We’ll walk through setting up a simple convolutional neural network(CNN) using PyTorch’s default configuration. This setup will be optimized and expanded. We will use the MNIST dataset for this exercise.
The MNIST (Modified National Institute of Standards and Technology) dataset is a widely recognized benchmark in the deep learning community, especially for image classification tasks. It serves as a starting point for deep learning due to its simplicity and well-defined structure. Here are some details regarding the data set.
Image Classes: 10 (handwritten digits 0 through 9)
Number of Samples:
Image Specifications:
What is a CNN?
A Convolutional Neural Network (CNN) is a specialized deep learning architecture designed to process data with a grid-like topology, such as images. CNNs are particularly effective for image classification tasks due to their ability to automatically and adaptively learn spatial hierarchies of features from input images.
Key Components of Our CNN Model
2. Batch Normalization:
3. Activation Functions:
4. Pooling Layers:
5. Fully Connected Layers:
6. Dropout (nn.Dropout):
Access to Code
The code that pulls the MNIST data can be accessed in the GitHub repository associated with this blog at: https://github.com/topflight-blog/make-hardware-work-part-1/blob/main/py/download_mnist.py
The code that runs the baseline CNN on the MNIST data can be accessed here:
The code with the optimized batch size can be accessed here:
The code with the optimized batch size and number of workers for reading in the image data can be accessed here: https://github.com/topflight-blog/make-hardware-work-part-1/blob/main/py/optimized_batchsize_nw.py
Benchmarking and Profiling Tools
Optimization of deep learning workflows requires measurement and analysis of both hardware and software performance. Benchmarking and profiling tools are essential in this process, providing quantitative data that provide results on attempts at optimization. This section discusses two tools—perf and hyperfine—detailing their functionalities, installation procedures, and applications in the context of deep learning model training.
Perf
perf is a performance analysis tool available on Linux systems, designed to monitor and measure various hardware and software events. It provides detailed insights into CPU performance, enabling developers to identify inefficiencies and optimize code accordingly.
perf can track metrics such as CPU cycles, instructions executed, cache references and misses, and branch predictions, making it a valuable asset for performance tuning in computationally intensive tasks like deep learning.
Installing perf is straightforward on most Linux distributions. The installation commands vary depending on the specific distribution:
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
Fedora:
sudo dnf install perf
Perf Example
To perform a straightforward performance analysis using perf you can use the perf stat command:
perf stat python baseline_cnn.py
Hyperfine
hyperfine is a command-line benchmarking tool designed to measure and compare the execution time of commands with high precision. Unlike profiling tools that focus on detailed performance metrics, hyperfine provides a straightforward mechanism to assess the execution time, making it suitable for evaluating the impact of code optimizations on overall performance.
hyperfine can be installed using various package managers or by downloading the binary directly. The installation methods are as follows:
领英推荐
Using Cargo (Rust’s Package Manager)
sudo cargo install hyperfine
Linux (Debian/Ubuntu via Snap):
sudo snap install hyperfine
Hyperfine Example
To compare the performance of an optimized training script against the baseline script, averaged over 20 separate executions with 3 warmup runs to account for the effect of a warm cache you can use:
Code Optimization Techniques
Baseline Run
To evaluate the efficiency of our baseline Convolutional Neural Network (CNN) training process, we utilized the perf tool to gather essential performance metrics. This analysis focuses on four key indicators: execution time, clock cycles, instructions executed, and cache performance.
We will use the following perf command:
perf stat -e cycles,instructions,cache-misses,cache-references python baseline_cnn.py
Perf Stat Output
Performance counter stats for 'python baseline_cnn.py':
4,809,411,481,842 cycles
1,001,004,303,356 instructions # 0.21 insn per cycle
2,939,529,839 cache-misses # 25.401 % of all cache refs
11,572,494,609 cache-references
19.382106840 seconds time elapsed
1583.134838000 seconds user
55.326302000 seconds sys
Execution Time
The execution time recorded was approximately 19.38 seconds, representing the total duration required to complete the CNN training process. This metric provides a direct measure of the training efficiency, reflecting how quickly the model can be trained on the given hardware configuration.
Clock Cycles and Instructions Executed
Cache Performance
First Optimization, Increasing Batch Size
By increasing the batch size, we aim to reduce the total number of training iterations for a fixed dataset size, thereby decreasing overhead and improving overall CPU performance.
To evaluate each configuration, we used the following perf command:
perf stat -r 20 -e cycles,instructions,cache-misses,cache-references python optimized_batchsize.py
Batch sizes of 128, 256, and 512 were tests and perf was used to collect performance metrics for each execution:
Increasing the batch size substantially reduces execution time. At batch size 512, we achieve the fastest training at around 10.60 seconds, a considerable improvement over the baseline(19.38 seconds). However, the cache miss rate does increase with larger batches—highlighting a trade-off between higher throughput and memory access patterns.
Despite the elevated miss rate, the net effect is a marked reduction in training time, indicating that larger batch sizes effectively optimize CPU-based training.
Hyperfine was also used to benchmark the baseline CNN which had batch size 64 versus the batch size 512:
hyperfine --runs 10 --warmup 3 "python baseline_cnn.py" "python optimized_batchsize.py"
Hyperfine Output
Benchmark 1: python baseline_cnn.py
Time (mean ± σ): 19.232 s ± 0.222 s [User: 1582.967 s, System: 60.176 s]
Range (min … max): 18.956 s … 19.552 s 10 runs
Benchmark 2: python optimized_batchsize.py
Time (mean ± σ): 10.468 s ± 0.193 s [User: 440.104 s, System: 63.261 s]
Range (min … max): 10.187 s … 10.688 s 10 runs
Summary
'python optimized_batchsize.py' ran
1.84 ± 0.04 times faster than 'python baseline_cnn.py'
The hyperfine benchmark for increasing batch size confirms that a batch size of 512 is 1.84 times faster than the baseline of 64, on average across 10 runs. The variability in the elapsed time across runs is marginal.
Second Optimization, Increasing the Number of DataLoader Workers
increasing the batch size reduced the total number of iterations and provided a significant performance boost, data loading can still become a bottleneck if done single-threaded. By increasing the num_workers parameter in the PyTorch DataLoader, we enable multi-process data loading, allowing the CPU to prepare the next batch of data in parallel while the current batch is being processed.
Here is an excerpt of the python code which shows how to initialize num_workers in DataLoader:
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=512,
shuffle=True,
num_workers=4)
To investigate the impact of different num_workers settings, we used the same perf command as in Section 6.2:
perf stat -r 20 -e cycles,instructions,cache-misses,cache-references python optimized_batchsize_nw.py
Below is a summary of how num_workers = 2, 4, and 8 affected training performance when paired with a batch size of 512:
We also validated these improvements using Hyperfine, specifically comparing the baseline CNN (batch size = 64, single worker) to the optimized code (batch_size = 512, num_workers=4). The command was:
hyperfine --runs 10 --warmup 3 "python baseline_cnn.py" "python optimized_batchsize_nw.py"
Hyperfine Output
Benchmark 1: python baseline_cnn.py
Time (mean ± σ): 19.226 s ± 0.110 s [User: 1582.359 s, System: 60.366 s]
Range (min … max): 19.047 s … 19.397 s 10 runs
Benchmark 2: python optimized_batchsize_nw.py
Time (mean ± σ): 7.161 s ± 0.112 s [User: 418.890 s, System: 76.137 s]
Range (min … max): 7.036 s … 7.382 s 10 runs
Summary
'python optimized_batchsize_nw.py' ran
2.68 ± 0.04 times faster than 'python baseline_cnn.py'
By combining a larger batch size (512) with four worker processes for data loading, our training script runs 2.68 times faster than the baseline. These results underscore the importance of both reducing the number of training iterations (larger batches) and parallelizing data loading (more workers) to fully utilize CPU resources.
Conclusion
Optimizing deep learning workflows for CPU performance requires a combination of hardware-aware adjustments and code-level refinements.
This article demonstrated the impact of two key optimizations on training performance: increasing batch size and increasing the number of workers for image data loads.
Key Takeaways
Next Steps
While this article focused on CPU-specific optimizations, modern deep learning workflows often leverage GPUs for computationally intensive tasks. In the next article, we will explore optimizations for GPU-based training, including strategies for utilizing tensor cores, optimizing memory transfers, and leveraging mixed precision training to accelerate deep learning on high-performance hardware.
By systematically applying and validating optimizations like those described in this article, you can maximize the performance of your deep learning pipelines on custom-built systems, ensuring efficient utilization of both hardware and software resources.
About Top Flight Computers
Top Flight Computers is based in Cary North Carolina and designs custom built computers, focusing on bespoke?desktop workstations,?rack workstations,?and?gaming PCs.
We offer free delivery within 20 miles of our shop, can deliver within 3 hours of our shop, and ship nationwide.
Building Custom AI Agents
1 个月If there is one thing today’s stock market selloff tells me it is that you don’t need to spend billions on hardware. Fine-tuning for specific uses will yield better/cheaper results.