[time-nuts] 5070B once more.... (actually 5370A fans)

Fri May 22 20:40:02 UTC 2009

Hal Murray skrev:
>> The reason for the fans is to prevent premature failures of the
>> silicon devices due to thermal degradation. The life of a silicon
>> chip is halved for every 10C temperature increase, more or less. 
> 
> I was going to make a similar comment, but got sidetracked poking around 
> google.  I didn't find a good/clean article.  Does anybody have a good URL?
> 
> Doubling every 10C is the normal recipe for chemical reactions.  I think that 
> translated to IC failure rate back in the old days.  Is that still correct?  
> Has modern quality control tracked down and eliminated most of the 
> temperature dependent failure mechanisms?
> 
> I remember reading a paper 5 or 10 years ago.  The context was disks.  I think the main failure was electronics rather than mechanical.  It really really really helped to keep them cool.
> 

There are being books written about this. One that I have found being a 
fairly short but useful one is the AT&T Reliability Manual. A criticism 
on reliability calculations (say from Bob Pease) is that if you remove 
protection circuits from the design, the MTBF calculations says the 
design improves (as there are fewer devices contributing their FITS) 
while the actual design is less reliable as protection would have 
avoided premature failure. This criticism is valid only if blindfolded 
beleif in MTBF is allowed to rule the judgement of reliability, since 
the methodology assumes otherwise sound engineering practices to avoid 
over-voltage, over-current, over-heating and other forms of excess 
stress that is within the limits for which the design is intended to 
operate and be stored in.

Anyway, there have been much research into the reliability of electrical 
devices and in general, keeping a sufficiently low temperature is among 
the things which improves reliability. For silicon the junction 
temperature limit needs to be ensured by having the component ambient 
temperature limited (usually to 70 degrees as measured inbetween two 
DIP-packages for intance), which has yeat another temperature drop into 
the (self) convected air and the ambient temperature of an rack of 
electronic (as measured 1 meter from the floor, 3 dm from the rack) to a 
maximum of 45 degrees. The 19" rack standard was originally designed for 
a total of 300 W per rack, so self convection up through the installed 
boxes would work. Having 1-10 kW per rack is not uncommon these days, so 
forced convection needs to be done, which puts requirement on different 
manufactures to have a common air-flow dicipline, which also needs to 
consider that no heat may be put out on the front for safety reasons 
(you shall never hesetate to hit the on/off switch).

It is unfortunatly common to see racks where one box has an airflow 
left-to-right while ontop of it is one with right-to-left and the rack 
has very narrow space between the side of the boxes and the sides, so 
effectively the boxes feeds each other with pre-heated air until one of 
them dies. Another example where a particular switch which had a 
left-to-right airflow, which was sitting at the top of a line of 
computing rack for a parallel computing setup. They where feeding a step 
wise increasing temperature air flow going through all the racks. The 
last switched died prematurely.

In parallel computing heat management and power management can be a much 
more troublesome issue than load balancing between the CPUs, which is 
the kind of luxury problem you can deal with at a later stage.

Cray was a refrigeration company, which also delivered alot of CPU 
cycles along the way.

Cheers,
Magnus