As systems evolve to incorporate multiple CPUs, scheduling becomes more complex due to the potential for load sharing and parallel execution of threads. Unlike single-core systems, there is no one-size-fits-all solution for scheduling in multiprocessor environments.
Definition of Multiprocessor
Traditionally, "multiprocessor" referred to systems with multiple physical processors, each housing a single-core CPU. This definition has evolved to include various architectures in modern computing systems, such as:
- Multicore CPUs: CPUs with multiple cores, each capable of executing threads independently.
- Multithreaded Cores: Cores designed to manage multiple threads concurrently, allowing for better resource utilization.
- NUMA (Non-Uniform Memory Access) Systems: Architectures where memory access time varies based on the memory location relative to the processor, requiring specialized scheduling strategies.
- Heterogeneous Multiprocessing: Systems with processors of varying capabilities, necessitating tailored scheduling approaches based on processor type.
Scheduling Concerns
Several considerations arise when discussing multiprocessor scheduling, particularly regarding:
- Homogeneous Systems: In scenarios where processors are identical in functionality, any available CPU can execute any process in the queue. This flexibility simplifies scheduling decisions.
- Heterogeneous Systems: When processors differ in capabilities, scheduling must account for which processes are best suited for each type of processor, complicating the strategy.
Approaches to Multiprocessor Scheduling
There are two primary approaches to CPU scheduling in multiprocessor systems: asymmetric multiprocessing (AMP) and symmetric multiprocessing (SMP).
- Asymmetric Multiprocessing (AMP): [ Description ]: In this model, a single processor, known as the master server, handles all scheduling decisions, I/O processing, and other system activities. The remaining processors are dedicated to executing user code. [ Advantages ]: The design is simpler since only one core accesses system data structures, minimizing the need for data sharing and synchronization. [ Disadvantages ]: The master server can become a bottleneck, potentially hindering overall system performance due to the concentration of scheduling tasks on a single processor.
- Symmetric Multiprocessing (SMP): [ Description ]: Each processor in the system is self-scheduling. The scheduler for each processor examines the ready queue and selects threads to run independently. [ Strategies for Thread Organization ]: [ Commonly Ready Queue ]: All threads are placed in a single shared ready queue. This can lead to race conditions, where multiple processors may attempt to schedule the same thread or cause threads to be lost from the queue. Locking mechanisms can be employed to mitigate these issues, but they can create contention and become a performance bottleneck.[ Private Run Queues ]: Each processor maintains its own private queue of threads, which avoids complications associated with a shared queue. This method promotes efficient cache memory usage due to locality but may lead to imbalances in workload distribution. Balancing algorithms can be applied to help equalize workloads across all processors.
Multicore Processors
Traditionally, symmetric multiprocessing (SMP) systems facilitate parallel processing through multiple physical processors. However, modern computing has evolved to utilize multicore processors, where multiple computing cores are integrated into a single physical chip.
Key Characteristics of Multicore Processors:
- Logical CPUs: Each core in a multicore processor maintains its own architectural state, making it appear to the operating system as a separate logical CPU.
- Performance and Efficiency: Multicore processors offer enhanced speed and reduced power consumption compared to traditional SMP systems where each CPU is on its own physical chip.
Scheduling Challenges
One significant challenge in multicore scheduling is the time spent waiting for data during memory access, known as a memory stall. Modern processors operate at significantly higher speeds than memory, leading to considerable wait times when fetching data. Memory stalls can also occur due to cache misses, which happen when the processor tries to access data not present in the cache memory.
Impact of Memory Stalls: Research indicates that processors can spend up to 50% of their time waiting for data availability from memory, complicating effective scheduling and utilization of processor resources.
Multithreaded Processing Cores and Chip Multithreading (CMT)
To address the challenges posed by memory stalls, recent hardware designs have adopted multithreaded processing cores, where two or more hardware threads are assigned to each core. This innovation allows for more efficient utilization of processing power.
- Thread Switching: When one hardware thread stalls while waiting for memory, the core can seamlessly switch to another active thread, minimizing idle time and improving overall performance.
- Architectural State: From the operating system’s perspective, each hardware thread maintains its own architectural state, which includes critical components like the instruction pointer and register set. This allows each thread to function as a separate logical CPU.
Chip Multithreading (CMT)
In a typical configuration with CMT, a processor may contain multiple computing cores, each supporting multiple hardware threads. For example, a processor with four cores, where each core has two hardware threads, would present eight logical CPUs to the operating system.
Interleaved Execution: The execution of threads on a dual-threaded processing core can be interleaved, allowing concurrent operations that further enhance processing efficiency.
Hyper-Threading and Multithreading in Processors
Intel processors implement a feature known as hyper-threading (or simultaneous multithreading, SMT), which allows multiple hardware threads to be assigned to a single processing core. This technology significantly enhances processing efficiency by enabling parallel execution.
- Thread Support: Modern Intel processors, like the i7, support two threads per core. In contrast, the Oracle Sparc M7 processor can handle eight threads per core across eight cores, offering the operating system a total of 64 logical CPUs.
Types of Multithreading
- Coarse-Grained Multithreading: In this approach, a thread executes on the core until a long-latency event (such as a memory stall) occurs. The core then switches to another thread, which incurs a high switching cost due to the need to flush the instruction pipeline before executing the new thread. This results in a delay as the new thread starts filling the pipeline with its instructions.
- Fine-Grained Multithreading: This method involves switching between threads at a finer level, typically at the instruction cycle boundary. Fine-grained systems incorporate specialized logic for thread switching, which minimizes the cost of switching between threads, allowing for smoother and more efficient execution.
Resource Sharing
Core Resources: The physical resources of the core, such as caches and pipelines, are shared among the hardware threads. Consequently, a processing core can only execute one hardware thread at a time.
Dual-Level Scheduling
A multithreaded, multicore processor requires two different levels of scheduling to effectively manage the execution of its threads, accommodating the needs of both coarse-grained and fine-grained multithreading.
Two Levels of Scheduling in Operating Systems
In the context of managing processor resources, scheduling decisions occur at two distinct levels:
- Software Thread Scheduling: [ Operating System Role ]: The operating system decides which software thread to run on each hardware thread (logical CPU). This level of scheduling has been the primary focus of scheduling discussions. The operating system can implement any scheduling algorithm, including those previously described in the chapter.
- Hardware Thread Scheduling: [ Core Decision-Making ]: At the second level, the core itself decides which hardware thread to execute. Various strategies exist for managing this process: [ Round-Robin Scheduling ] : A straightforward approach used by systems like UltraSPARC T3, where hardware threads are scheduled in a round-robin fashion. [ Dynamic Urgency Value ]: The Intel Itanium uses a more nuanced method where each hardware thread is assigned a dynamic urgency value ranging from 0 (lowest urgency) to 7 (highest urgency). The Itanium identifies five events that may trigger a thread switch. When one of these events occurs, the thread-switching logic compares the urgency values of the threads and selects the one with the highest urgency to run.
Integration of Scheduling Levels
The two scheduling levels are not necessarily exclusive; in fact, they can complement each other. If the operating system scheduler (first level) is aware of how processor resources are shared, it can make more informed scheduling decisions.
Resource Sharing Example: Consider a CPU with two processing cores, each capable of running two hardware threads. If two software threads are running on the same core, they must share the processor’s resources, potentially leading to slower performance. If the operating system understands the resource-sharing dynamics, it can schedule software threads on logical processors that do not share resources, optimizing performance.
Load Balancing in Symmetric Multiprocessing (SMP) Systems
Importance of Load Balancing:
- Load balancing is crucial in SMP systems to ensure efficient CPU usage. Without it, some processors may be idle while others are overworked.
- Need for Load Balancing: In SMP with private ready queues, each processor holds threads waiting for execution. Shared queues do not require load balancing since idle processors can directly access available threads.
- Load Balancing Approaches: 1) Push Migration: A periodic task reviews processor workloads and reallocates threads from busy to idle processors. 2) Pull Migration: Idle processors actively pull threads from busier processors' queues. Push and pull migration can be used in tandem, as seen in Linux's Completely Fair Scheduler (CFS) and FreeBSD's ULE scheduler.
- Balanced Load Definition: A balanced load can refer to equal numbers of threads across processors or an equal distribution of thread priorities. Achieving this balance may conflict with the main scheduling goals based on system requirements.
Processor Affinity
Concept of Processor Affinity:
- When a thread runs on a processor, its data is stored in that processor’s cache (creating a warm cache), allowing faster memory access for subsequent requests.
Impact of Migration on Cache:
- Cache Repopulation: Moving a thread to another processor requires repopulating the cache on the new processor, which is costly.
- Processor Affinity: Operating systems attempt to keep threads on the same processor to leverage the warm cache, enhancing performance.
Types of Processor Affinity:
- Soft Affinity: The OS tries to maintain a thread on the same processor but may migrate it if necessary.
- Hard Affinity: A strict approach where a thread can specify a set of processors it can run on.
Queue Types and Affinity:
- Common Ready Queue: Leads to more cache repopulation as threads are likely to be scheduled on different processors.
- Per-Processor Ready Queues: Help maintain a warm cache since threads remain on the same processor.
NUMA (Non-Uniform Memory Access) Systems:
- In NUMA architectures, each processor has faster access to its local memory, which optimizes memory access speed.
Balancing Load vs. Processor Affinity in SMP Systems
- Processor affinity enhances performance by keeping threads on the same processor, while load balancing aims to prevent some processors from being overloaded. However, moving threads can invalidate caches and increase memory access times, especially in NUMA systems.
Heterogeneous Multiprocessing (HMP)
- HMP systems feature cores that run the same instruction set but differ in speed and power usage. High-speed, energy-intensive cores operate alongside slower, more energy-efficient ones.
ARM big.LITTLE Architecture:
- This architecture combines high-performance "big" cores for demanding tasks with low-power "little" cores for energy-efficient processing of less intensive tasks.
- Efficient power management by directing tasks to suitable cores enhances energy efficiency without restricting specific tasks to certain cores.
- Windows 10 supports thread scheduling policies that adapt to power management needs, balancing performance and energy efficiency.
Key Concepts in Multicore and Multiprocessor Scheduling
- Load Sharing: Enables threads to run on any available CPU core, maximizing parallel processing.
- Asymmetric Multiprocessing (AMP): A simple model where a master core manages scheduling, potentially causing bottlenecks.
- Symmetric Multiprocessing (SMP): All cores are self-scheduling, enhancing flexibility but requiring locking mechanisms for shared resources.
- Memory Stall: Occurs when a core waits for data not present in its cache, leading to execution delays.
- Hardware Threads: The number of threads a core can handle simultaneously, enhancing efficiency by switching to another thread if the current one stalls.
- Chip Multithreading (CMT): A CPU architecture allowing multiple cores to manage several hardware threads.
- Load Balancing: Distributes tasks evenly across cores in SMP systems to prevent overloading.
- Push Migration & Pull Migration: Strategies for redistributing tasks between cores to maintain balance.
- Processor Affinity: A scheduling strategy favoring thread retention on the same core for cache efficiency.
- Soft and Hard Affinity: Different levels of thread retention on cores, affecting cache management.
- Heterogeneous Multiprocessing (HMP): Cores designed for different performance levels, improving energy efficiency.
- big.LITTLE Architecture: ARM’s model combining high-performance and energy-efficient cores for optimal task management.