Make Hardware Work For You: Part 1 – Optimizing Code For Deep Learning Model Training on CPU
The system above is similar to our tested system, see Accelerated Design on the Top Flight Computers website for specs.

Make Hardware Work For You: Part 1 – Optimizing Code For Deep Learning Model Training on CPU

BIG shout out to Kevin Scott for assisting with this article! More to come!!!

Make Hardware Work For You – Introduction

The increasing complexity of deep learning models demands not just powerful hardware but also optimized code to make software and hardware work for you. Top Flight Computers customized builds are specifically optimized for certain workflows, such as deep learning and high-performance computing.

This specialization is crucial because it ensures that every component—from the CPU and GPU to the memory and storage—is selected and configured to maximize efficiency and performance for specific tasks. By aligning hardware capabilities with software requirements, users can achieve significant improvements in processing speed, resource utilization, and overall productivity. By customizing code to match your hardware capabilities, you can significantly enhance deep learning model training performance.

In this article, we’ll explore how to optimize deep learning code for CPU to take full advantage of a high-end custom-built system, showcasing benchmarking and profiling and benchmarking improvements using perf, hyperfine. In the next article we will discuss GPU based optimizations.


Overview of the High Performance Hardware

  • CPU: AMD Ryzen 9 9950X
  • CPU Cooling: Phanteks Glacier One 360D30
  • Motherboard: MSI X870 Motherboard
  • Memory: Kingston Fury Renegade 96GB DDR5-6000
  • Storage: 2x Kingston Fury Renegade 2TB PCIe 4.0 NVME SSD
  • GPU: Nvidia RTX 5000 ADA 32 GB
  • Case: Be Quiet Dark Base Pro 901 Black
  • Power Supply: Be Quiet Straight Power 12 1500 W Platinum
  • Case Fans: 6x Phanteks F120T30

Relevance to Deep Learning

  • CPU Multithreading: Essential for data preprocessing and augmentation.
  • GPU Capabilities: RTX 5000 ADA tensor cores and large VRAM accelerate model training.
  • High-Speed Storage: PCIe 4.0 NVME SSDs reduce data loading times, minimizing I/O bottlenecks.
  • DDR5 Memory: Faster memory speeds enhance data throughput between CPU and RAM.


The Importance of Code Optimization in Deep Learning

In the rapidly evolving landscape of deep learning, where large datasets and complex algorithms converge with powerful hardware, code optimization plays a critical role. While advances in hardware—like GPUs and TPUs—have transformed what’s possible, poorly optimized code can severely limit performance.

Failing to fully utilize the capabilities of modern systems leads to slower training, increased costs, and less efficient workflows. Code optimization, therefore, is essential for maximizing resources and time. To truly unlock the potential of cutting-edge hardware, software must be carefully tailored to take advantage of its strengths.

Without this, training processes can become unnecessarily slow and resource-intensive, reducing the efficiency of deep learning workflows.

The Benefits of Customizing Code

  1. Improved Training Times: Faster code execution enables quicker iterations, allowing models to converge more rapidly. This acceleration facilitates greater experimentation and faster delivery of results, critical in competitive or time-sensitive contexts.
  2. Better Resource Utilization: Optimization ensures that available hardware is used to its fullest potential. By aligning software operations with hardware capabilities, organizations can achieve maximum efficiency, whether on-premises or in cloud environments.
  3. Cost Efficiency: Faster training and optimized resource use lead to significant reductions in computational costs. For organizations operating at scale, these savings can translate into measurable financial benefits over time.


Optimizing Deep Learning Model Training

The Baseline

PyTorch is one of the most popular frameworks to get started with. We’ll walk through setting up a simple convolutional neural network(CNN) using PyTorch’s default configuration. This setup will be optimized and expanded. We will use the MNIST dataset for this exercise.

The MNIST (Modified National Institute of Standards and Technology) dataset is a widely recognized benchmark in the deep learning community, especially for image classification tasks. It serves as a starting point for deep learning due to its simplicity and well-defined structure. Here are some details regarding the data set.

Image Classes: 10 (handwritten digits 0 through 9)

Number of Samples:

  • Training Set: 60,000 images
  • Test Set: 10,000 images

Image Specifications:

  • Dimensions: 28×28 pixels
  • Color: Grayscales

What is a CNN?

A Convolutional Neural Network (CNN) is a specialized deep learning architecture designed to process data with a grid-like topology, such as images. CNNs are particularly effective for image classification tasks due to their ability to automatically and adaptively learn spatial hierarchies of features from input images.

Key Components of Our CNN Model

  1. Convolutional Layers:

  • Purpose: Extract local features from input images by applying learnable filters.
  • Operation: Detect patterns like edges, textures, and shapes.

2. Batch Normalization:

  • Purpose: Normalize the output of convolutional layers to stabilize and accelerate training.
  • Benefit: Reduces internal covariate shift, allowing for higher learning rates.

3. Activation Functions:

  • Purpose: Introduce non-linearity into the model, enabling it to learn complex patterns.

4. Pooling Layers:

  • Purpose: Downsample to reduce spatial dimensions and computational load.
  • Operation: Extract the most prominent features within a region.

5. Fully Connected Layers:

  • Purpose: Perform classification based on the extracted features.
  • Operation: Map learned features to output classes.

6. Dropout (nn.Dropout):

  • Purpose: Prevent overfitting.
  • Benefit: Encourages the network to learn redundant representations.

Access to Code

The code that pulls the MNIST data can be accessed in the GitHub repository associated with this blog at: https://github.com/topflight-blog/make-hardware-work-part-1/blob/main/py/download_mnist.py

The code that runs the baseline CNN on the MNIST data can be accessed here:

https://github.com/topflight-blog/make-hardware-work-part-1/blob/main/py/baseline_cnn.py

The code with the optimized batch size can be accessed here:

https://github.com/topflight-blog/make-hardware-work-part-1/blob/main/py/optimized_batchsize.py

The code with the optimized batch size and number of workers for reading in the image data can be accessed here: https://github.com/topflight-blog/make-hardware-work-part-1/blob/main/py/optimized_batchsize_nw.py


Benchmarking and Profiling Tools

Optimization of deep learning workflows requires measurement and analysis of both hardware and software performance. Benchmarking and profiling tools are essential in this process, providing quantitative data that provide results on attempts at optimization. This section discusses two tools—perf and hyperfine—detailing their functionalities, installation procedures, and applications in the context of deep learning model training.

Perf

perf is a performance analysis tool available on Linux systems, designed to monitor and measure various hardware and software events. It provides detailed insights into CPU performance, enabling developers to identify inefficiencies and optimize code accordingly.

perf can track metrics such as CPU cycles, instructions executed, cache references and misses, and branch predictions, making it a valuable asset for performance tuning in computationally intensive tasks like deep learning.

Installing perf is straightforward on most Linux distributions. The installation commands vary depending on the specific distribution:

Ubuntu/Debian:

sudo apt-get update

sudo apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)

Fedora:

sudo dnf install perf

Perf Example

To perform a straightforward performance analysis using perf you can use the perf stat command:

perf stat python baseline_cnn.py

Hyperfine

hyperfine is a command-line benchmarking tool designed to measure and compare the execution time of commands with high precision. Unlike profiling tools that focus on detailed performance metrics, hyperfine provides a straightforward mechanism to assess the execution time, making it suitable for evaluating the impact of code optimizations on overall performance.

hyperfine can be installed using various package managers or by downloading the binary directly. The installation methods are as follows:

Using Cargo (Rust’s Package Manager)

sudo cargo install hyperfine

Linux (Debian/Ubuntu via Snap):

sudo snap install hyperfine

Hyperfine Example

To compare the performance of an optimized training script against the baseline script, averaged over 20 separate executions with 3 warmup runs to account for the effect of a warm cache you can use:

hyperfine --runs 20 --warmup 3 "python baseline_cnn.py" "python optimized_cnn.py"


Code Optimization Techniques

Baseline Run

To evaluate the efficiency of our baseline Convolutional Neural Network (CNN) training process, we utilized the perf tool to gather essential performance metrics. This analysis focuses on four key indicators: execution time, clock cycles, instructions executed, and cache performance.

  • Cache Performance encompasses metrics related to the CPU cache’s effectiveness, including cache references (the number of times data is accessed in the cache) and cache misses (instances where the required data is not found in the cache, necessitating retrieval from slower memory).

  • Execution Time refers to the total duration required to complete the training process, providing a direct measure of how long the task takes from start to finish.

  • Clock Cycles indicate the number of cycles the CPU undergoes while executing the training workload, reflecting the processor’s operational workload and efficiency.

  • Instructions Executed represent the total number of individual operations the CPU performs during the training, offering insight into the complexity and optimization level of the code.

We will use the following perf command:

perf stat -e cycles,instructions,cache-misses,cache-references python baseline_cnn.py        

Perf Stat Output

 Performance counter stats for 'python baseline_cnn.py':

     4,809,411,481,842      cycles
     1,001,004,303,356      instructions              #    0.21  insn per cycle
         2,939,529,839      cache-misses              #   25.401 % of all cache refs
        11,572,494,609      cache-references

          19.382106840 seconds time elapsed

        1583.134838000 seconds user
          55.326302000 seconds sys
        

Execution Time

The execution time recorded was approximately 19.38 seconds, representing the total duration required to complete the CNN training process. This metric provides a direct measure of the training efficiency, reflecting how quickly the model can be trained on the given hardware configuration.

Clock Cycles and Instructions Executed

  • Clock Cycles (cycles): The baseline run utilized 4.81 trillion clock cycles. Clock cycles are indicative of the CPU’s operational workload, representing the number of cycles the processor spent executing instructions during the training process.
  • Instructions Executed (instructions): A total of 1.00 trillion instructions were executed. The ratio of instructions to cycles (0.21 insn per cycle) suggests that, on average, fewer than one instruction was executed per cycle. This low ratio may imply that the CPU is underutilized or that there are inefficiencies in the code preventing optimal instruction throughput.

Cache Performance

  • Cache References (cache-references): The process made 11.57 billion cache references, which encompass both cache hits and misses. This metric reflects how frequently the CPU accessed the cache during the execution of the training script.
  • Cache Misses (cache-misses): There were 2.94 billion cache misses, accounting for 25.401% of all cache references. A cache miss occurs when the CPU cannot find the requested data in the cache, necessitating retrieval from slower memory tiers.

First Optimization, Increasing Batch Size

By increasing the batch size, we aim to reduce the total number of training iterations for a fixed dataset size, thereby decreasing overhead and improving overall CPU performance.

To evaluate each configuration, we used the following perf command:

perf stat -r 20 -e cycles,instructions,cache-misses,cache-references python optimized_batchsize.py        

  • -r 20: Runs the program 20 times to collect more robust averages and reduce random variance.

  • -e cycles,instructions,cache-misses,cache-references: Collects data on CPU cycles, instructions executed, cache misses, and cache references—key indicators of CPU utilization and efficiency.

Batch sizes of 128, 256, and 512 were tests and perf was used to collect performance metrics for each execution:


Increasing the batch size substantially reduces execution time. At batch size 512, we achieve the fastest training at around 10.60 seconds, a considerable improvement over the baseline(19.38 seconds). However, the cache miss rate does increase with larger batches—highlighting a trade-off between higher throughput and memory access patterns.

Despite the elevated miss rate, the net effect is a marked reduction in training time, indicating that larger batch sizes effectively optimize CPU-based training.

Hyperfine was also used to benchmark the baseline CNN which had batch size 64 versus the batch size 512:

hyperfine --runs 10 --warmup 3 "python baseline_cnn.py" "python optimized_batchsize.py"
        

Hyperfine Output

Benchmark 1: python baseline_cnn.py
  Time (mean ± σ):     19.232 s ±  0.222 s    [User: 1582.967 s, System: 60.176 s]
  Range (min … max):   18.956 s … 19.552 s    10 runs

Benchmark 2: python optimized_batchsize.py
  Time (mean ± σ):     10.468 s ±  0.193 s    [User: 440.104 s, System: 63.261 s]
  Range (min … max):   10.187 s … 10.688 s    10 runs

Summary
  'python optimized_batchsize.py' ran
    1.84 ± 0.04 times faster than 'python baseline_cnn.py'        

The hyperfine benchmark for increasing batch size confirms that a batch size of 512 is 1.84 times faster than the baseline of 64, on average across 10 runs. The variability in the elapsed time across runs is marginal.

Second Optimization, Increasing the Number of DataLoader Workers

increasing the batch size reduced the total number of iterations and provided a significant performance boost, data loading can still become a bottleneck if done single-threaded. By increasing the num_workers parameter in the PyTorch DataLoader, we enable multi-process data loading, allowing the CPU to prepare the next batch of data in parallel while the current batch is being processed.

Here is an excerpt of the python code which shows how to initialize num_workers in DataLoader:

train_loader = torch.utils.data.DataLoader(train_dataset, 
                                           batch_size=512,
                                           shuffle=True,
                                           num_workers=4)        

To investigate the impact of different num_workers settings, we used the same perf command as in Section 6.2:

perf stat -r 20 -e cycles,instructions,cache-misses,cache-references python optimized_batchsize_nw.py        

Below is a summary of how num_workers = 2, 4, and 8 affected training performance when paired with a batch size of 512:


  • The cache miss rate remains around the mid-30% range—similar to or slightly higher than when using a single worker. This suggests there is additional memory pressure and parallel access, but it does not negate the net benefit of parallelizing I/O and preprocessing.
  • Among the tested configurations, num_workers=4 yields the fastest execution (7.16 seconds on average), although num_workers=2 and num_workers=8 are also improvements over the baseline. Optimal num_workers often depends on your CPU’s core count and workload characteristics.

We also validated these improvements using Hyperfine, specifically comparing the baseline CNN (batch size = 64, single worker) to the optimized code (batch_size = 512, num_workers=4). The command was:

hyperfine --runs 10 --warmup 3 "python baseline_cnn.py" "python optimized_batchsize_nw.py"        

Hyperfine Output

Benchmark 1: python baseline_cnn.py
  Time (mean ± σ):     19.226 s ± 0.110 s    [User: 1582.359 s, System: 60.366 s]
  Range (min … max):   19.047 s … 19.397 s   10 runs

Benchmark 2: python optimized_batchsize_nw.py
  Time (mean ± σ):      7.161 s ± 0.112 s    [User: 418.890 s, System: 76.137 s]
  Range (min … max):    7.036 s … 7.382 s    10 runs

Summary
  'python optimized_batchsize_nw.py' ran
    2.68 ± 0.04 times faster than 'python baseline_cnn.py'
        

By combining a larger batch size (512) with four worker processes for data loading, our training script runs 2.68 times faster than the baseline. These results underscore the importance of both reducing the number of training iterations (larger batches) and parallelizing data loading (more workers) to fully utilize CPU resources.


Conclusion

Optimizing deep learning workflows for CPU performance requires a combination of hardware-aware adjustments and code-level refinements.

This article demonstrated the impact of two key optimizations on training performance: increasing batch size and increasing the number of workers for image data loads.

  • Increasing Batch Size: By increasing the batch size from 64 to 512, we significantly reduced the total number of iterations required to complete training. This change improved training time by 1.84×, as measured using Hyperfine, and effectively reduced execution time by nearly 46% in the baseline comparison. However, the trade-off was a slight increase in the cache miss rate, highlighting the balance between computational throughput and memory access efficiency.
  • Parallelizing Data Loading: Optimizing the DataLoader with num_workers=4 enabled multi-threaded data preprocessing, reducing the I/O bottleneck and maximizing CPU utilization. This adjustment yielded an additional 2.68× speedup over the baseline when combined with the larger batch size, as validated through both perf and Hyperfine. Notably, the improvement from parallel data loading varied based on the number of workers, emphasizing the need to tune this parameter based on CPU core availability and workload characteristics.

Key Takeaways

  1. Batch Size Matters: Increasing the batch size reduces training iterations, improving throughput and training speed. However, larger batch sizes may increase memory access pressure, as evidenced by the higher cache miss rates in our benchmarks.
  2. Parallel Data Loading is Essential: Increasing the number of workers in the DataLoader minimizes the idle time caused by I/O operations, ensuring the CPU remains fully engaged during training. The optimal number of workers will depend on the hardware configuration, particularly the number of CPU cores.
  3. Benchmarking Tools Drive Informed Decisions: Using tools like perf and Hyperfine enabled precise measurement of the impact of our optimizations, providing actionable insights into how each change affected execution time, CPU utilization, and cache performance.

Next Steps

While this article focused on CPU-specific optimizations, modern deep learning workflows often leverage GPUs for computationally intensive tasks. In the next article, we will explore optimizations for GPU-based training, including strategies for utilizing tensor cores, optimizing memory transfers, and leveraging mixed precision training to accelerate deep learning on high-performance hardware.

By systematically applying and validating optimizations like those described in this article, you can maximize the performance of your deep learning pipelines on custom-built systems, ensuring efficient utilization of both hardware and software resources.


About Top Flight Computers

Top Flight Computers is based in Cary North Carolina and designs custom built computers, focusing on bespoke?desktop workstations,?rack workstations,?and?gaming PCs.

We offer free delivery within 20 miles of our shop, can deliver within 3 hours of our shop, and ship nationwide.


Adam Perrell

Building Custom AI Agents

1 个月

If there is one thing today’s stock market selloff tells me it is that you don’t need to spend billions on hardware. Fine-tuning for specific uses will yield better/cheaper results.

要查看或添加评论,请登录

Christopher Touchberry的更多文章

  • Why Top Flight Computers?

    Why Top Flight Computers?

    We believe that fostering and strengthening authentic personal relationships with our customers is the most important…

  • About Top Flight Computers

    About Top Flight Computers

    We believe fostering and strengthening authentic personal relationships with our customers is the most important part…

    2 条评论
  • Using Social Media to Build Your Brand When You Have a Small Budget

    Using Social Media to Build Your Brand When You Have a Small Budget

    When I opened Top Flight Computers, I didn't have much of a budget for marketing and advertising. I'll be honest, I had…

    3 条评论
  • Knowing What Hardware to Use for a Custom Built Computer

    Knowing What Hardware to Use for a Custom Built Computer

    Knowing what hardware to use for a computer build is arguably the most important part of the process, and is…

社区洞察

其他会员也浏览了