Application of Statistics in Semi-conductor product development
The discipline of Statistics deals with matter of chance. People relate chance to luck. I used to wonder, is it really possible to predict luck - like what would be revealed in the next toss of a coin - head or tail? Obviously very soon I could figure out that there was no need to wonder. In fact, even after infinite number of tossing the coin, it is not possible for an expert in Statistics to predict - what would be the face of next toss of the coin. Disappointedly next question popped up in the mind that, is there some other claim of a statistician which could be miraculous? This question remained unanswered even after putting my best effort in my early days of encounter with this subject. As usual, application of those concepts in real world was never illustrated either in the text book or by the teacher which did not make it easier either.
After college, I joined Semi-conductor industry. Here, there were three areas where concepts of Statistics were being applied - Measurement, Noise and Matching. I spent reasonable amount of time in the industry but there is no doubt that my ground was always shaky while dealing with these topics. Not that, I have become pundit now, but I thought of sharing my lens which I developed over the years. Through this lens I could make heads or tails of ideas which were thrown on my face. It would be interesting to know whether readers would like to decorate their library with this lens or they would throw it in their dust bin. Apology in advance, if you spent time in reading this article without anything to take away.
In case of tossing a coin, statistician may not be able to predict face of the next toss, but as number of tosses keeps on increasing, he can predict the relative frequency of heads and tails with a reasonably good accuracy. The percentage error in the prediction can be made smaller - even 0 in the limit by arbitrarily increasing the number of tosses to infinity. However, there is one very important point to note that percentage of the error can be made smaller but not the absolute size. For example, after 200 tosses, total number of heads is likely to be in the range of 95 to 105 (100+/-5). And after 20,000 tosses it likely to be in the range of 9,950 to 10,050 (10,000+/-50). One could see that absolute size of chance error has increased from 5 to 50. However, percentage of chance error has become smaller from 2.5% (5 in 200) to 0.25% (50 in 20,000). The key to this behaviour is that size of error does not track number of tosses in linear fashion. 5 in 200 tosses does not grow to 500 in 20,000. In fact, it seems that size of error grows in proportion to square root of number of tosses. 20,000/200=100 and square root of 100 is 10. And hence 5 in 200 grew to 50 in 20,000. Due to this square root relationship, percentage error can be arbitrarily reduced by increasing the number of tosses. The size of chance error (5 in 200 and 50 in 20,000) is not an absolute truth but just a representative one to explain the concept.
There must be general proof of this square root law, but I did not even attempt to search for it as there is little doubt that it would be difficult for me to comprehend it. I have accepted it on face value. I placate my curiosity by giving one plausible explanation that probably number of unfavourable cases remain relative constant whereas number of favourable cases increases with number of repetition of experiments and hence may be percentage of chance error reduces.
But then, it is an expected behaviour in case of coin tossing. Question is, does it apply to any arbitrary experiments? Answer is yes. In case of throwing a die, as number of throws increases the percentage of all spots gets closer to 17% (1/6). The beauty of statistics is that it seems it can be applied to any unknown set of data. Let us assume that there is a box with an unknown distribution of numbers. If we repeatedly take out one number from the box with replacement, then as number of repetition increases the percentage distribution of the outcome of the experiment start matching the unknown distribution very closely. So after some repetition, one would start getting a sense of average and spread (standard deviation ??) in the data.
In case of die throw,
Average = 3.5 and SD=1.71
If there is a box, whose average is 100 and SD is 5, then 75% of the times any next draw from the box would be within 100+/-2x5 i.e., 90 to 110. 96% of the times, it would be within 100+/-5x5 = 75 to 125. So likelihood of getting beyond 75 and 125 is very low. This comes from Chebyshev's inequality. This is true for any kind of distribution of data in the box.
In case of single draw, chance error would be reflection of SD of the box. If SD of the box is large, then prediction range would be very wide and hence may not be very useful.
However, if you make n draws from a box of average ?? and SD of ??, and sum them up, then resulting distribution would always be approximated by a normal distribution. The average of the normal distribution would be n?? and standard deviation (called Standard error) would be √n??. (As mentioned above, I have not cared to grasp the proof of this due to obvious reason.)
SD (standard deviation) term is generally used for known set of data. SE (standard error) is used while predicting chance error of an unknown set of data.
For example, sum of 200 throws of a die would give a distribution like below:
Average would be 200x3.5 = 700 and SE = (√200)x1.71=24.18
And if this distribution is divided by n, then resulting distribution would look like this:
Average = ?? = 3.5 and SE = (√n??)/n= ??/√n =0.12
You can see that 0.12 is very small compared to 1.71. This way, SE can be even reduced to zero if n is increased to infinity.
This is a beautiful result because
- Whatever be the distribution of the original box, the distribution of sum could always be approximated by a normal distribution. Normal distribution has got finer limit on the prediction compared to Chebyshev's inequality (which is valid for any distribution). +/-3?? of normal distribution would cover 99.7% of the data.
- By increasing the number of draws, SE can be reduced to arbitrarily small value.
These are two results which is at the root of most of applications of statistics in Measurement, Noise and Matching.
Measurement:
There are two phases in which measurement is done on semi-conductor products. One is during the development and other is during production. One of the major challenges of measurement team is to set test limits at room temperature testing during production in such a way that yield loss does not go beyond acceptable limits and at the same time data-sheet limit is set in such a way that part meets the spec across temperature in the field and customer does not return parts. During development, measurement can be done only on limited number of devices. Here comes Statistics to rescue Test Engineer to predict yield during high volume production based on limited number of measurements done during development and conformance to the data-sheet limits across temperature in customer's application.
First task of measurement team is to establish accuracy of measurement of a parameter of a single part. One could assume that parameter value = actual value + systematic bias + chance error. It seems nobody knows the cause of chance error. However, long run average of chance error should cancel out. Statistics would not be able to help in isolating systematic error. Here, measurement team needs to collaborate with design team to correlate measurement with the simulation results. Either measurement team finds out fault in its own measurement or Design team finds out bug in its simulation.
Next challenge for measurement team is to identify and cancel out chance error. One could imagine parameter values are in a box and each measurement is a single draw from this box. There could be two process at this step. Assumption is made that the source of chance error is measurement process.
- Histogram of single measurement on the same part is made. Average and SD of this histogram is found out. If the distribution is normal, then Average +/-3?? would cover 99.7% of the measurements. Even future measurements during production is expected to have similar spread. If the distribution is not normal, then Chebyshev's inequality has to be applied and expectation in spread during production would be even higher.
- n number of measurements is made and histogram of average is plotted. In this case, histogram would always be normal. And even spread of chance error can be arbitrarily reduced to small value by increasing repetition n of the measurement. In this case, during production also "n" number of measurements is made and then average is taken for the final value of measurement. Since spread of chance error would be very small, data-sheet limit could be made tighter and even yield loss during production is expected to be low.
Second process would always give better results. Then the question is what is its downside. Since in this case, number of measurements is high, test time would increase and accordingly test cost would increase. It is obvious that Test Engineer has tough time to optimize cost vs better data-sheet values.
Next task for measurement engineer is to look at parameter variation across process and temperature to set the data-sheet limits. Voltage is not so important, as worst case voltage can always be identified and applied. To reduce test cost, production testing is done generally at only one temperature and that too at room temperature. Mostly 50units from all process corner wafers (deliberately created by process engineers) are taken during product development process. Instead of looking at histogram of one part, n numbers of parts are randomly chosen and histogram is made. This histogram would be normal. From this average value "???" and spread "???" are calculated. The devices are marked and same set of devices are measured across temperatures. Simulation results should correlate with data across temperatures. The drift in parameter value across temperature is found. Here, concept of statistical analysis can be employed. Instead of looking at drift of one part, n number of parts can be randomly selected from the lot and its average is plotted as histogram. Most likely this distribution would be normal if n is relatively large. Average value ?? and ?? of this distribution is calculated. Depending on product application, drift across parts is calculated as ?? +/- m√n??. Since drift across temperature is real and not a chance error, we need to multiply ?? with √n to get the actual drift. ?? would be tighter if n is large. Large n would reduce chance error in measurement process across temperature. And ?? would be much closer to the actual . It would correlate more closely with the simulation. Generally 3 is acceptable for value of m. Measurement engineer sets data-sheet limit as ??? +/- ?? +/- 3√n??. This would correspond to around 2,000ppm defect at the customer's site across temperature. For single digit ppm, m=6 need to be used. But then in that case, there would be much wider spec. Test limit is set as ??? +/-3√n??? at room temperature.
Noise
As expected, noise is a random phenomena and hence concept of statistics is applied here. One would have seen that instead of looking at noise of a device from amplitude point of view, it is looked at in frequency domain in circuit analysis. This is because as in signal, any random waveform of noise generated by any device can be decomposed into its frequency content. And circuit responds to its frequency content by modifying its amplitude and phase. The total output noise can be assembled again by adding all individual output frequency component. If noise of a device was characterised in only amplitude domain then it would not be possible to find out total output noise of a circuit.
In frequency domain, again average of noise voltage generated by a device is not characterised directly. As it is expected that noise is random, its average value of amplitude would be zero. Hence amplitude is squared and its average value is characterised for each frequency by putting a 1Hz filter at the given frequency. Average value is characterised by taking n number of samples to reduce the spread in average value. Based on this power spectrum of noise is found out. And power spectrum of noise of each device is used to find out output noise of the circuit in simulation. The power of all frequency component is added (or integrated) to find out total power of the noise at the output. One can take square root of this sum to find out expected amplitude of the noise. Text books describe it as rms noise, as it is found out by taking square root of mean of squared amplitude. However, this amplitude is not randomly selected leaving frequency aspect left out but it is amplitude of each frequency component. So I am not sure whether it can be called rms noise. In my opinion, it is reasonable to call it average amplitude of noise as it is found out by taking square root of average power.
As far as measuring noise of the output of circuit in the lab is concerned, voltage can be sampled and histogram can be plotted. The total time to capture the noise should be large enough if one has to take low frequency noise also into account. The average value would be zero, but standard deviation can be found out. One would think that this standard deviation would be same as rms amplitude described above. However, I think, this may not be correct. One should take n samples, square each sample and then get average value. Plot histogram of these measurements. The average value of this plot would be average noise power. By increasing n, the spread (pSE) in average noise power can be reduced a lot. At the last, one could take square root of this average noise power to get expected noise voltage. Twice of this value would be expected peak to peak noise voltage. I am not sure how to calculate chance error (or SE) in this. May be SE of noise voltage could also be found out by taking square root of SE of power histogram. It would be great if someone could comment on this.
Matching:
One would have seen that matching between two elements improves if area of element increases. Bigger area could be viewed as summation of more number of same elements. More element in summation means smaller Standard Error. And that is why matching improves if area increases. Ideally, matching could have been improved indefinitely but gradient error becomes more prominent as area of the element increases.
Acknowledgement: I would like to express immense gratitude to Ravi Prakash for spending his valuable time in reviewing this article and giving very useful feedback.