Power Optimization Techniques in Digital IC Design - 2

Power Optimization Techniques in Digital IC Design - 2

"There has certainly come to you a Messenger from among yourselves. Grievous to him is what you suffer; [he is] concerned over you [i.e., your guidance] and to the believers is kind and merciful." Quran[9:128]
#boycottFrance

In the previous article, we discussed the sources of power dissipation in CMOS circuits and we arrived at this equation

No alt text provided for this image

In this article, we will focus on how to reduce the activity factor α through several RTL techniques.

Clock gating

Clock gating is by far the most famous and most efficient technique to reduce dynamic power.

No alt text provided for this image
Fig.1

Take a look at the circuit in Fig.1. There are several sources of dynamic power dissipation.

  • The power dissipated within the FF to capture the data.
  • The power dissipated in the clock network in the wires and the buffers when they drive their load capacitance.
  • The power dissipated in the combinational circuit due to the change in the input which might cause the output to switch.

Clock gating techniques will try reducing all these sources of power dissipation.

It's worth noting the clock alone is responsible for much power dissipation. The clock network consumes about 30%-50% of the total dynamic power of the chip. That’s why clock gating is one of the most effective techniques to reduce dynamic power.

There are situations when we don’t need the FF to capture the data:

  • When the output of the FF is not used as shown in Fig.2. If the mux selection is 0 The MUX won't select FF B output. So we don’t need FF B to sample the data thus dissipating power in itself and in the combinational circuit B between it and the MUX.
No alt text provided for this image
Fig.2
  • When the input is not changing: There is some power that gets wasted within the FF even if its input is not changing so it’s important to stop sending the clock to the FF to stop this power dissipation.

And in both cases, there is no need to dissipate power in the clock network to deliver the clock signal to the FF.

The 1st implementation to achieve gating is to use a feedback MUX as in Fig.3. This will stop the switching of any combinational gate after the FF and will stop sampling the input inside the FF. However, this doesn’t stop the power wasted by the clock signal network. And even if the input is constant to the FF there is still power dissipated in it as we said above.

No alt text provided for this image
Fig.3

Another issue with this implementation is the area it uses. If we have a 64-bit register, we need 64 2-input muxes for each FF. This will increase the area and will lead to more gates thus more static power dissipation.

The 2nd implementation is to block the clock itself from reaching the FF. In order to do so, we can use an AND gate where one input is the clock and the other is the enable signal.

Using just an AND gate has several issues. If a glitch happens in the enable signal during the high period of the clock, this will lead to an unintended clock edge as shown in Fig.4. If this gated clock is clocking a counter, this clock edge will lead to an unintended count which will lead to an unintended behavior of the circuit.

No alt text provided for this image
Fig.4

To overcome this issue, we use the circuit in Fig.5. which consists of an AND gate along with a negative level triggered latch.

  • During the low period, the enable signal gets latched. Any glitch that will occur in the low period won’t cause an edge to happen because the glitch will be anded with the clock which is low.
  • During the high period, any glitch that occurs won’t pass the latch because it’s not transparent and so won’t affect the gated clock.
No alt text provided for this image
Fig.5

There is still an issue with this implementation. The clock might not reach the AND gate and the latch at the same time due to different routing delays. If the clock arrived at the latch later than the AND, this will cause an untended clock edge to occur.

In Fig.6, the enable was low before the arriving of the clock edge so the clock shouldn't have been enabled, However, the clock arrived later to the latch, causing the latch to be transparent during the gray area. When the enable became high to enable the next clock it mistakenly enabled the current clock thus causing an unintended clock edge.

No alt text provided for this image
Fig.6

To overcome this the AND and the latch are placed in a single ASIC cell to reduce the differential delay due to routing. This cell is called an integrated clock gating cell or ICG for short.

The clock gating circuit should be placed as close to the clock source as possible to stop the clock from reaching more buffer. In Fig.7 , if the ICG was placed at the beginning all the buffer won’t switch and so a huge reduction to power will be made. If the ICG was placed near the clock pin, some of the buffers will switch thus dissipating power.

No alt text provided for this image
Fig.7

Disadvantages of clock gating

  • Clock gating adds an additional area to the design due to the AND gate and the latch but unlike the feedback MUX, we don’t need a gate for each FF. a register that consists of several FF can have one clock gating cell.
  • The ICG itself consumes power.
  • The gating adds delay to the clock paths which makes clock tree synthesis (CTS) more difficult.

A good designer should use gating when necessary. If the gated blocks are enabled “active” most of the time, then the gating is unnecessary and the design will now be taking more area and power (due to the ICG) without use!

Clock Gating for FPGAs

Clock gating SHOULD NOT be used for FPGA designs. Clock Signals in FPGA go through dedicated paths and buffers that drive the clock to the FPGA fabric with minimum skew. Forcing the clock to go through a LUT will lead to a bad skew in the clock signal which will have its problems in static timing analysis (STA). Instead, you should use the enable signal of FFs in the FPGA.

Example of clock gating

No alt text provided for this image
Fig.8

Consider the FSM in Fig.8 . If the FSM is in state s1 and the input is x then the FSM will remain without change. And if it is not changing, then it doesn't need a clock. So we can use gating here where the enable of the clock gate will be (state == s1 AND input == x)

Active gates

As we showed in the previous article, not all gates have the same activity factor.

Table .1 shows the activity factor for several 2-input gates. Where P_A and P_B are the probability that the inputs are 1. The 2nd column shows the probability that the output is 1. The probability that the output is 0 is simply the complement. The activity factor is the probability that an output transitions which is equal to P_1 x P_0. The 3rd column is the activity factor assuming that P_A = P_B = ?

No alt text provided for this image
Table 1

As you can see, some gates are more active (switching) than others. Which means they will dissipate more power than others. The most active of them is the XOR gate. You can arrive at the same result without going through the math above. By looking at the XOR gate truth table you will find that whenever one of the inputs changes the output changes. While in a gate such as the AND gate, if one of the inputs is 0 then the output will remain 0 even if the other input switched.

Unfortunately, the XOR gate is one of the most common gates in digital circuits. It is the main gate in all arithmetic circuits. This makes arithmetic circuits a target for several power optimization techniques as we will see.

Shifters instead of multipliers

Multipliers are made of a large number of XORs. Any change at the input of the multiplier will cause several transitions. In applications where multiplication by multiples of 2 is common, it’s better to use a left shifter instead of a multiplier and right shifters instead of division by multiples of 2.

You can also multiply by other numbers like 7 like so : (input << 3) - input

Registering inputs

The operands’ bits of an arithmetic circuit might not arrive at the same time due to different routing delays. When the earliest bit arrives at the arithmetic circuit an output will get calculated. When another bit arrives, another output will get calculated. This will keep happening until the last bit arrives and the final and true output gets calculated. All these intermediate calculations were unnecessary and causing unnecessary switching activity.

No alt text provided for this image
Fig.9

To give you an example, look at Fig.9. Let's assume the input at T = 0 to the XOR was 0 0 and so the output was 0. Now the input changed to 1 1 but due to the delay in the wires, the input will take some time to reach the XOR. Because B goes through a shorter wire it will arrive first so the XOR will see an input of A = 0, B = 1, and the output will switch to 1. After some time the signal A will arrive so the XOR will see A = 1, B = 1 and will output 0.

As you can see this simple gate did an unnecessary switch and so dissipated unnecessary power. Imagine what may happen in a complex circuit with lots of XOR gates as the multiplier!

To reduce this the designer should place a register before any arithmetic circuit and during placement and routing this register should be placed as close as possible to the arithmetic circuit to reduce the difference in routing delay.

In FPGAs, it's recommended to use the input registers inside the DSP primitives. The registers are placed inside the DSP and are designed to reduce the differential delay. Xilinx Vivado will issue a warning if you used a DSP without input registers.

Note that these intermediate calculations might still happen, even if we balanced the wire delays, due to unbalanced delays in the circuit elements itself. This causes what we call a glitch. You can search google for more info about glitches.

Comparison circuits

The XOR gate is present in comparison circuits too. A way to minimize the transitions is to compare two numbers only by a part of their data width. If you are comparing two 64-bit numbers you may only use the most 4 bits for comparison instead of all the 64 bits. This will give you the correct result “most” of the time. If your design can’t handle any mistakes in calculations, then you don’t have a choice but to compare all the bits.

You can use the same technique in fixed-point representation. If your inputs are guaranteed to vary mainly in the integer part, not the fraction, then you can compare two fixed-point numbers by the integer part only.

Again, this is subject to your design. In my graduation project, we were dealing with fixed-point representation and the algorithm we were implementing was not very sensitive to the fraction part. Meaning that if the comparator mistakenly decided that 2.5 is larger than 2.6 and sent 2.5 to the next stage, the design accuracy won’t change much. So, such a technique was valid in our case.

Gray coding

Again, our target is to reduce the switching thus reducing the activity factor and so reduce the dynamic power.

Consider a 3-bit counter where the counting is in binary and we count from 0 to 5. The counter will switch from 000 to 001 causing one transition. Then from 001 to 010 with 2 transitions and so on until it reaches 101. The number of transitions will be 8. Now consider if the counting is in Gray code. In Gray code each time we transition from a code to another only one-bit changes. So, if we count from 0 to 5 in Gray code, we will only get 5 transitions. This shows that Gray encoding is better for low power applications.

FSM state assignment

Gray encoding can be used in the FSM state assignment. The FSM register is connected to several pins in the FSM circuit which means it has a high capacitive load and hence high dissipation when it switches. So reducing the switching using Gray coding will reduce power consumption.

This is only true if the state transition sequence is somewhat predictable. You should use Gray encoding when you know that the FSM will usually transition from state X to Y to Z and hence assign X, Y, and Z consecutive Gray codes. If you have Gray states like so: 00 01 11 10 and your FSM transitions unpredictably you may get several transitions from state 00 to state 11, so two bits will change not one, which means we will lose the main advantage of Gray encoding.

Gray coding addressing

Another interesting application for Gray coding is to you use it for high-capacitance address busses for memories where the access is sequential such as the program counter in CPUs and the address pointer in FIFOs. Instead of storing data in a FIFO at locations 0 -> 1 -> 2 ->…-> N, use Gray coding addresses instead: 0 -> 1 -> 3 -> 2 -> 6 -> … This guarantees that only one bit will transition each time you push data into the FIFO.

However, some engineers argued that this technique might have some issues such as the complexity of Gray coding counters over binary counters, and the high latency of Gray counters. You can get more details at [1].

Side note: Gray coding are called “Gray” after their inventor Frank Gray not the color grey.

Disabling logic clouds

No alt text provided for this image
Fig.10 [2]

This technique is related to clock gating. The 2nd FF is clock gated and turned off but the logic cloud feeding it is still transitioning although its output is not needed. We can’t turn off the clock on FF A because its output is used in another part of the circuit. The solution here is to add an enable to the logic cloud thus turning it off when FF 2 is turned off.

Bus inversion

This is used for external data busses with high capacitance and switching activity. Let’s assume that the data sent on a 10-bit bus at time 0 was “00_0000_0001”. The data sent at time T is “00_1111_1110”. If we send the 2nd data as it is, we will have 8 transitions on the bus. Bus inversion states that if more than half of the data transitioned, then invert the data and send it. So, the 2nd data sent will instead be “11_0000_0001” which will cause only 3 transitions.

We need to inform the receiver that this data is inverted so that it re-invert again to get the actual data. For this, we will add another bit to the bus called the sign bit. If the bus was inverted the sign bit will be 1. So, the total number of transitions in the above example is 4: 3 for the data and 1 for the sign bit.

No alt text provided for this image
Fig.11

The circuit that determines that more than half the bits switched is called the hamming circuit and is shown in Fig.11.

  1. The current input gets compared to the previous input using a group of XOR for every bit. If two bits are not equal the XOR will output a 1.
  2. The output of the comparator enters a majority voting circuit which outputs a 1 if there are more 1's (more unequal bits) coming out of the comparator.
  3. If the output is one, the sign bit is set to 1 and the input is inverted and sent on the bus using a group of XORs.
  4. The current output (After knowing if its inverted or not) is stored in the register to compare it with the next input

To understand why we use XOR in step 3 consider the circuit in Fig.12. If INV is 0 the XOR will act as a buffer for the I/P. Otherwise, the XOR will act as an inverter. The INV signal in our case is the output of the majority voter.

No alt text provided for this image
Fig.12

As you can see the circuit will take area and power so the designer should make sure the saved power is greater than the power and area lost in the circuit.

You can read more about this technique in this paper [3].

References

[1] https://ieeexplore.ieee.org/document/497616

[2] https://www.springer.com/gp/book/9781461403968

[3] https://www.researchgate.net/publication/220526496_Bus-Invert_Coding_for_Low-Power_IO

要查看或添加评论,请登录

Amr Adel的更多文章

社区洞察

其他会员也浏览了