登录查看更多内容

Cache

Vinit ..

Sr. DSP platform SW Engg | Ex. Visteon | Telematics Guy | Tech Author

发布日期: 2024年2月12日

Introduction

Memory is an Important resource in all embedded systems, any program executing on the core will be needing memory for its execution and data transactions. In Programs execution time, memory transaction plays important roles. Faster the memory transaction lesser the execution time. In General system architecture , there will be CPU memory (registers), cache, DDR, flash. These memories are utilized in design in such a way that program’s execution will be faster with lesser cost. Article’s focus on cache memory, will be seeing how cache memory works & plays an important role in speeding the execution.

Cache

Cache memory is one of the fastest memories which plays an important role in faster execution. Cache memories are closely coupled with main core. Cache memories are costlier one, to Make it cost effective cache are divided into three levels.

L1 cache – fastest, but smallest, data and instructions
L2 cache – slower, but bigger, data-only
L3 cache – slowest, but biggest, data-only

L3 cache is faster than main memory (DDR/RAM).

As showed in below pictures caches are placed on a SOC (system on chip). L1 & L2 are private and tightly coupled with the core where as L3 is shared among multiple cores. L3 caches connected with DDR/RAM with the help of system bus which are governed by memory controllers.

Now we know what caches are and where they are placed in system let’s see How cache’s works to reduce latency in execution. (The time needed to access data from memory is called "latency.")

How cache help to reduce latency??

Programs are stored into secondary memory i.e. HDD/flash Drive. When Any program is invoked for execution. The initial few execution code segments are pulled into main memory i.e. RAM. Similar when processor start processing code/data the similar code/data segments will be pulled into cache.

when processor want to re-access the similar data it will fetch from cache NOT RAM as Caches are faster, the accessing time is less hence latency is less. When processor finds data in cache it’s called cache hit. There will be cases when needed data segment are not present in cache then it is cache miss, In cache miss case needed code segments will be fetched from RAM and put into cache.

We need to ensure the design should be such way where cache miss probabilities are less for good performance.

Hit Ratio = Cache hit /Number of Access

Miss Ration= 1-Hit Ratio.

Note: For good design cache Hit ratio should be more than 90%

Structural design of cache (Memory wise)

As per showed fig. caches are designed.

Cache Lines has the max number of storages which can be transferred between cache and RAM. Cache line has a valid bit, to identify Memory is valid or not. multiple cache lines are stacked together with index together they form one unit of stack. Multiple stack units together form a cache set. Along with each cache line there is RAM Tag bits Which helps to fetch RAM memory address.

Let’s Evaluate how one memory location stored in cache.

Example:

Consider 32-bit RAM address system which can allocate 4GB of space. Our max Address will be of 32 bit. Same should be able to store by cache too.

Note: All Addresses are logical not physical. Because processor works on logical address only. physical address is computed by MMU.

RAM address: 0x12341200

Tag RAM = 0x12341 (LSB)

Index = 0x2

Offset = 0

Based on Above division one cache line should be able to store 256 bytes. And in next line address will be 0x12341300 where index will be 3 which will be new line. Based on How much cache line size is available above tag and index will be vary. Please refer below cache allocation for clarification.

For Above example we used set-associate type mapping. This address mapping can be done in different ways too.

Mapping types of cache

Direct Mapping
Associative Mapping
Set-Associative Mapping

Based on Above information we can say caches are smaller memory. In case of multiple applications running concurrently there will be different memory requirements which will easily fill up cache. Ultimately which increase cache miss possibility. More cache miss means latency is higher. To overcome this problem without increasing size of cache TLB’s (Translation look-aside buffers) are introduced.

领英推荐

Hash Table Internals - Part 4 - Linear Probing

Arpit Bhayani 2 年前

Hash Table Internals - Part 7 - Performant Hash Tables

Arpit Bhayani 2 年前

Tearing Down the Memory Wall

Sharada Yeluri 2 年前

(Just to Know)

A?translation lookaside buffer?(TLB) is a memory?cache?that stores the recent translations of?virtual memory?to?physical memory i.e. page table Entries. It is used to reduce the time taken to access a memory location. Whenever we encounter cache miss scenarios, we can search for similar addresses in TLB is it is present in the TLB we can fetch similar data by using physical address stored in TLB. It will optimize the latency for searching the address in page table which is stored in RAM.

Problem with multi core execution cache

We know how caches are structured and where they are located on SOC. Though cache helps to reduce the latency, but it produces few problems too if not designed well. From fig 1 we know every core has its own L1 & L2 cache. Where L2 can share code and data both. Please follow below scenario to understand cache coherency problem.

Scenario: consider one application has 2 threads T1 & T2. Both access “global_var” protected by Locks. So ideally it is expected that there should be no corruption and values should be expectedly 0 (please follow code).

If your variable in non-volatile then variable write/read operation won’t go till main memory. operations happen over cache and at the end values are committed to main memory RAM. This creates the problem in case of multi core. ?Consider current global variable value is 23, both core executing threads which will modify global_var shared variable.

Core 1 Acquires the lock and update the value to 24 (new value) release the lock.
Core 2 has prefetched value 23
Core 2 will acquire lock. And does operation on 23 and update value 22 (new value) release the lock

Shared variable value got corrupted here, data is no more correct.

Note: with similar code you might not see cache coherency problem because Hardware cache coherency protocols will be enabled on your device

Why data corruption happened?

Ans: ?shared variable is allowed to be cached in the core cache. Shared variable was not volatile so all frequent modification will be done on cache (won’t update value in the RAM for every operation). If every time for data fetch if we go to main memory, it kills the purpose of having cache to reduce latency.

Solution to the cache coherency Problem

Cache coherency problem can be solved by Hardware as well by software. The problem occurred because we have one copy in the main memory and one in each cache memory When one copy of shared data is changed then other copies of the data ?must be changed other wise inconsistent copies, this creates a cache coherence problem. Solution to this problem will be making discipline that ensures that changes in the values of shared data ?are propagated throughout the system in a timely fashion.

Cache coherency protocol à Hardware solution for cache coherency
Cache coherency scheme à Software solution

We saw cache structure has invalid bit to indicate memory status valid or not. Cache coherency protocol uses similar approaches to indicate to current memory status. Weather it is needed to fetch again or not.

Coherency mechanisms:?

§? Directory-based – In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed, the directory either updates or invalidates the other caches with that entry.

§? Snooping – Snooping is a process where the individual caches monitor address lines for accesses to memory locations that they have cached. It is called a write invalidate protocol. When a write operation is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.

§? Snarfing – It is a mechanism where a cache controller watches both address and data to update its own copy of a memory location when a second master modifies a location in main memory. When a write operation is observed to a location that a cache has a copy of the cache controller updates its own copy of the snarfed memory location with the new data.

With above coherency mechanism below coherency protocols are implemented.

MSI protocol (Modified, Shared, Invalid)
MOSI protocol (Modified, Owned, Shared, Invalid)
MESI protocol (Modified, Exclusive, Shared, Invalid)
MOESI protocol (Modified, Owned, Exclusive, Shared, Invalid)

Modified – It means that the value in the cache is dirty, that is the value in current cache is different from the main memory

Exclusive – It means that the value present in the cache is same as that present in the main memory, that is the value is clean

Shared – It means that the cache value holds the most recent data copy and that is what shared among all the cache and main memory as well

Owned – It means that the current cache holds the block and is now the owner of that block, that is having all rights on that block

Invalid – This states that the current cache block itself is invalid and is required to be fetched from other cache or main memory.

References

https://www.geeksforgeeks.org/cache-coherence-protocols-in-multiprocessor-system/

https://www.geeksforgeeks.org/cache-memory-in-computer-organization/

https://en.wikipedia.org/wiki/Cache_coherence

https://www.youtube.com/watch?v=r_ZE1XVT8Ao&t=6s

要查看或添加评论，请登录

Vinit ..的更多文章

How does Linker Works???

2022年7月31日

How does Linker Works???

We know the general compiler steps, which the compiler follows for the generation of the executable file. Code -->…

2 条评论
Security Attacks by Buffer overflows

2022年5月21日

Security Attacks by Buffer overflows

Introduction Buffer is a chunk of contiguous memory which is used to store some data. Buffers are used for data…

2 条评论
Remote Procedural Call

2022年4月16日

Remote Procedural Call

Introduction Remote procedure calls allow a local computer (client) to remotely call procedures on a different computer…

5 条评论
Virtual function/class, Where? Why? & How?

2020年12月27日

Virtual function/class, Where? Why? & How?

There is no doubt object-oriented languages are becoming the base for the creation of a new software stack. In this…

1 条评论
CPU Isolation & CPU affinity In Linux

2020年11月7日

CPU Isolation & CPU affinity In Linux

In Multi-processor architecture processors is directly get compared with processing power. Common notion is more…

13 条评论
Kernel Module Debugging Techniques

2020年9月20日

Kernel Module Debugging Techniques

There are several Debugging Techniques, few efficient Debugging techniques are listed Below. For kernel Module…

4 条评论
Debugging With GDB

2020年4月11日

Debugging With GDB

GNU GDB Debugger No programmer is perfect, some of them do logical mistakes so some to syntactical. Syntax error can be…

1 条评论
"Inline" Function and It's Use

2019年12月10日

"Inline" Function and It's Use

"Inline" Function is a provision or feature provided by the compiler. Inline is a request made to the compiler to…
Which Programming Language to learn???

2019年4月28日

Which Programming Language to learn???

Which Programming Language to learn???? it is always a big question for new learners or beginners. should it be C, C++,…

See all articles

Cache

Vinit ..

Sr. DSP platform SW Engg | Ex. Visteon | Telematics Guy | Tech Author

Introduction

Cache

How cache help to reduce latency??

Structural design of cache (Memory wise)

领英推荐

Problem with multi core execution cache

Why data corruption happened?

Solution to the cache coherency Problem

References

Vinit ..的更多文章

社区洞察

其他会员也浏览了

A Survey of Computing Paradigms - From Literature to Machines

Simplifying the dataflow with a switch fabric!

PCI Express Primer #2: Data Link Layer

The design of the NoC is key to the success of large, high-performance compute SoCs

SPSC Queue Part 2: Going Atomic

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Memory mapping in DDR is essential for defining how data is stored, accessed, and managed in the physical address space of DRAM

IBM's POWER10 chip is too small !!

What are the various types of memory faults?

PowerMax & Storage-Class Memory: “Warp Speed, Mr. Sulu”

Introduction

Cache

How cache help to reduce latency??

Structural design of cache (Memory wise)

领英推荐

Problem with multi core execution cache

Why data corruption happened?

Solution to the cache coherency Problem

References

Vinit ..的更多文章

How does Linker Works???

Security Attacks by Buffer overflows

Remote Procedural Call

Virtual function/class, Where? Why? & How?

CPU Isolation & CPU affinity In Linux

Kernel Module Debugging Techniques

Debugging With GDB

"Inline" Function and It's Use

Which Programming Language to learn???

社区洞察

其他会员也浏览了

A Survey of Computing Paradigms - From Literature to Machines

Simplifying the dataflow with a switch fabric!

PCI Express Primer #2: Data Link Layer

The design of the NoC is key to the success of large, high-performance compute SoCs

SPSC Queue Part 2: Going Atomic

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Memory mapping in DDR is essential for defining how data is stored, accessed, and managed in the physical address space of DRAM

IBM's POWER10 chip is too small !!

What are the various types of memory faults?

PowerMax & Storage-Class Memory: “Warp Speed, Mr. Sulu”