登录查看更多内容

Introduction to High Performance Computing (HPC)

Mohammad Sepehr Mollaei

Backend Developer | MS in Software Engineering

发布日期: 2024年11月22日

Definition

HPC refers to the use of supercomputers and parallel processing to perform complex calculations at lightning speeds. Unlike traditional computers, HPC systems integrate thousands (or even millions) of processing cores, working collaboratively to handle massive datasets and execute intricate simulations.

Use cases

The most relevant and important use case for hpc is simulation because it enables researchers and industries to replicate complex real-world processes in a virtual environment. This helps scientists to reduce risk and manage cost efficiency. Examples of simulations using hpc include:

Predicting weather and climate
Understanding space
Optimization of algorithms

Examples and Implementations

There are two primary implementation models: shared memory and message passing. Both have their unique strengths and applications, depending on the architecture and scale of the computing system.

Shared memory

Shared memory is a way for multiple programs (or processes) to communicate and work together by using the same memory space.

This is one of the most efficient methods for single-node machines and has little overhead. However, shared memory is limited by the physical memory of the machine. It does not scale easily to systems with multiple nodes. Also, developers must implement proper synchronization to avoid race conditions and ensure data consistency, which can add complexity to the code when working with shared varibales.

The most well-known shared memory implementation library is openMP.

Example: Implementation of parallel summation on an array

领英推荐

An idea whose time has come: Onur Mutlu on processing…

HiPEAC 4 个月前

In Network Acceleration for AI/ML Workloads

Sharada Yeluri 1 年前

DDR6 & package substrate

AKEN Cheung 封装基板制造商 10 个月前

#include <stdio.h>
#include <omp.h>

int recursive_sum(int nums[], int begin, int end);

int main(int argc, char const *argv[])
{
    int A[] = {
        1, 2, 3, 4, 5, 6, 7, 8
    };

    int ans;
    #pragma omp parallel
    {
        #pragma omp single
        {
            ans = recursive_sum(A, 1, 8);
        }
    }
    printf("Sum of the numbers in verctor is: %d\n", ans);
}


int recursive_sum(int nums[], int begin, int end) {
    if (end - begin <= 2)
    {
        int sum = 0;
        for (int i = begin; i <= end; i++)
        {
            int threadNum = omp_get_thread_num();
            int threads = omp_get_num_threads();
            sum += nums[i-1];
            printf("Threads: %d, Thread num: %d, Calculated sum:%d\n", threads,
            threadNum, sum);
        }
        return sum;
    }

    int middle = (end - begin) / 2 + begin;
    int left = 0, right = 0;

    #pragma omp task shared(left)
    left = recursive_sum(nums, begin, middle);

    #pragma omp task shared(right)
    right = recursive_sum(nums, middle + 1, end);
    
    #pragma omp taskwait
    return left + right;
}

output:

Threads: 12, Thread num: 10, Calculated sum:7
Threads: 12, Thread num: 9, Calculated sum:3
Threads: 12, Thread num: 9, Calculated sum:7
Threads: 12, Thread num: 0, Calculated sum:5
Threads: 12, Thread num: 0, Calculated sum:11
Threads: 12, Thread num: 8, Calculated sum:1
Threads: 12, Thread num: 8, Calculated sum:3
Threads: 12, Thread num: 10, Calculated sum:15
Sum of the numbers in verctor is: 36

2. Message passing

The message passing model is used in environments where processes are distributed across multiple machines or nodes. Unlike shared memory, message passing requires processes to explicitly send and receive data via messages, even if those processes are running on different physical machines.

Message passing is ideal for distributed systems and clusters where processes are running on different machines. It is more scalable than shared memory and works well for heterogeneous systems, where nodes may have different memory and processing capabilities.

However, Since data must be physically transferred between processes, message passing typically involves more overhead compared to shared memory, especially in high-latency networks.

One of the best tools for using message passing is MPI.

Example: Calculating pi using the area of a circle with a radius of 1

#include <mpi.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    int n = 100, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    

    MPI_Init (&argc,&argv);
    MPI_Comm_size (MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank (MPI_COMM_WORLD,&myid);

    MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
    h = 1.0 / (double) n; 
    sum = 0.0;
    for (i = myid + 1; i <= n; i += numprocs) { 
        x = h * (((double)i) - 0.5); 
        sum += 4.0 * sqrt (1.0 - x*x); 
    }
    mypi = h * sum;
    printf("Calculated piece of pi: %f, on process: %d\n", mypi, myid); 
    MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0) {
        printf("Calculated pi: %.16f, Error: %.16f\n", pi, fabs(pi - PI25DT));
    }
    
    MPI_Finalize();
    return 0;
}

output:

Calculated piece of pi: 0.411435, on process: 0
Calculated piece of pi: 0.401444, on process: 2
Calculated piece of pi: 0.387584, on process: 4
Calculated piece of pi: 0.379954, on process: 6
Calculated piece of pi: 0.406696, on process: 1
Calculated piece of pi: 0.395133, on process: 3
Calculated piece of pi: 0.383863, on process: 5
Calculated piece of pi: 0.375827, on process: 7
Calculated pi: 3.1419368579000082, Error: 0.0003442043102151

Introduction to High Performance Computing (HPC)

Mohammad Sepehr Mollaei

Backend Developer | MS in Software Engineering

Definition

Use cases

Examples and Implementations

领英推荐

社区洞察

其他会员也浏览了

RISC Architecture and the Cosmic Evolution of Human Technology

Infrastructure Requirements for LLMs

Exploring the Future of Computing: Paul Savluc's & OpenQQuantify's Lab Release on Simulating Classical, Quantum, and Hardware Processes

Advanced Attention Mechanisms — II

Bell Computer Model V: A Chronicle of Early Computing Brilliance

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Memory Bandwidth Explained: Typical Challenges and Practical Solutions

HBM Memory Forces a Reshuffle in the Chip Industry

Optimize Your ML and Data Workloads

Harnessing the Power of Cache Coherency in CXL