Introduction to High Performance Computing (HPC)
Definition
HPC refers to the use of supercomputers and parallel processing to perform complex calculations at lightning speeds. Unlike traditional computers, HPC systems integrate thousands (or even millions) of processing cores, working collaboratively to handle massive datasets and execute intricate simulations.
Use cases
The most relevant and important use case for hpc is simulation because it enables researchers and industries to replicate complex real-world processes in a virtual environment. This helps scientists to reduce risk and manage cost efficiency. Examples of simulations using hpc include:
Examples and Implementations
There are two primary implementation models: shared memory and message passing. Both have their unique strengths and applications, depending on the architecture and scale of the computing system.
Shared memory is a way for multiple programs (or processes) to communicate and work together by using the same memory space.
This is one of the most efficient methods for single-node machines and has little overhead. However, shared memory is limited by the physical memory of the machine. It does not scale easily to systems with multiple nodes. Also, developers must implement proper synchronization to avoid race conditions and ensure data consistency, which can add complexity to the code when working with shared varibales.
The most well-known shared memory implementation library is openMP.
Example: Implementation of parallel summation on an array
领英推荐
#include <stdio.h>
#include <omp.h>
int recursive_sum(int nums[], int begin, int end);
int main(int argc, char const *argv[])
{
int A[] = {
1, 2, 3, 4, 5, 6, 7, 8
};
int ans;
#pragma omp parallel
{
#pragma omp single
{
ans = recursive_sum(A, 1, 8);
}
}
printf("Sum of the numbers in verctor is: %d\n", ans);
}
int recursive_sum(int nums[], int begin, int end) {
if (end - begin <= 2)
{
int sum = 0;
for (int i = begin; i <= end; i++)
{
int threadNum = omp_get_thread_num();
int threads = omp_get_num_threads();
sum += nums[i-1];
printf("Threads: %d, Thread num: %d, Calculated sum:%d\n", threads,
threadNum, sum);
}
return sum;
}
int middle = (end - begin) / 2 + begin;
int left = 0, right = 0;
#pragma omp task shared(left)
left = recursive_sum(nums, begin, middle);
#pragma omp task shared(right)
right = recursive_sum(nums, middle + 1, end);
#pragma omp taskwait
return left + right;
}
output:
Threads: 12, Thread num: 10, Calculated sum:7
Threads: 12, Thread num: 9, Calculated sum:3
Threads: 12, Thread num: 9, Calculated sum:7
Threads: 12, Thread num: 0, Calculated sum:5
Threads: 12, Thread num: 0, Calculated sum:11
Threads: 12, Thread num: 8, Calculated sum:1
Threads: 12, Thread num: 8, Calculated sum:3
Threads: 12, Thread num: 10, Calculated sum:15
Sum of the numbers in verctor is: 36
2. Message passing
The message passing model is used in environments where processes are distributed across multiple machines or nodes. Unlike shared memory, message passing requires processes to explicitly send and receive data via messages, even if those processes are running on different physical machines.
Message passing is ideal for distributed systems and clusters where processes are running on different machines. It is more scalable than shared memory and works well for heterogeneous systems, where nodes may have different memory and processing capabilities.
However, Since data must be physically transferred between processes, message passing typically involves more overhead compared to shared memory, especially in high-latency networks.
One of the best tools for using message passing is MPI.
Example: Calculating pi using the area of a circle with a radius of 1
#include <mpi.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int n = 100, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init (&argc,&argv);
MPI_Comm_size (MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank (MPI_COMM_WORLD,&myid);
MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * (((double)i) - 0.5);
sum += 4.0 * sqrt (1.0 - x*x);
}
mypi = h * sum;
printf("Calculated piece of pi: %f, on process: %d\n", mypi, myid);
MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
printf("Calculated pi: %.16f, Error: %.16f\n", pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}
output:
Calculated piece of pi: 0.411435, on process: 0
Calculated piece of pi: 0.401444, on process: 2
Calculated piece of pi: 0.387584, on process: 4
Calculated piece of pi: 0.379954, on process: 6
Calculated piece of pi: 0.406696, on process: 1
Calculated piece of pi: 0.395133, on process: 3
Calculated piece of pi: 0.383863, on process: 5
Calculated piece of pi: 0.375827, on process: 7
Calculated pi: 3.1419368579000082, Error: 0.0003442043102151
Linux Systems Build Engineer Supercomputer HA-HPC Design
3 个月Interesting. I was doing something not quite the same with shared memory and posted it on Twitter the other day. I have a theory on how to use it but not for calculations.