[time-nuts] 5070B once more.... (actually 5370A fans)
Magnus Danielson
magnus at rubidium.dyndns.org
Fri May 22 20:40:02 UTC 2009
Hal Murray skrev:
>> The reason for the fans is to prevent premature failures of the
>> silicon devices due to thermal degradation. The life of a silicon
>> chip is halved for every 10C temperature increase, more or less.
>
> I was going to make a similar comment, but got sidetracked poking around
> google. I didn't find a good/clean article. Does anybody have a good URL?
>
> Doubling every 10C is the normal recipe for chemical reactions. I think that
> translated to IC failure rate back in the old days. Is that still correct?
> Has modern quality control tracked down and eliminated most of the
> temperature dependent failure mechanisms?
>
> I remember reading a paper 5 or 10 years ago. The context was disks. I think the main failure was electronics rather than mechanical. It really really really helped to keep them cool.
>
There are being books written about this. One that I have found being a
fairly short but useful one is the AT&T Reliability Manual. A criticism
on reliability calculations (say from Bob Pease) is that if you remove
protection circuits from the design, the MTBF calculations says the
design improves (as there are fewer devices contributing their FITS)
while the actual design is less reliable as protection would have
avoided premature failure. This criticism is valid only if blindfolded
beleif in MTBF is allowed to rule the judgement of reliability, since
the methodology assumes otherwise sound engineering practices to avoid
over-voltage, over-current, over-heating and other forms of excess
stress that is within the limits for which the design is intended to
operate and be stored in.
Anyway, there have been much research into the reliability of electrical
devices and in general, keeping a sufficiently low temperature is among
the things which improves reliability. For silicon the junction
temperature limit needs to be ensured by having the component ambient
temperature limited (usually to 70 degrees as measured inbetween two
DIP-packages for intance), which has yeat another temperature drop into
the (self) convected air and the ambient temperature of an rack of
electronic (as measured 1 meter from the floor, 3 dm from the rack) to a
maximum of 45 degrees. The 19" rack standard was originally designed for
a total of 300 W per rack, so self convection up through the installed
boxes would work. Having 1-10 kW per rack is not uncommon these days, so
forced convection needs to be done, which puts requirement on different
manufactures to have a common air-flow dicipline, which also needs to
consider that no heat may be put out on the front for safety reasons
(you shall never hesetate to hit the on/off switch).
It is unfortunatly common to see racks where one box has an airflow
left-to-right while ontop of it is one with right-to-left and the rack
has very narrow space between the side of the boxes and the sides, so
effectively the boxes feeds each other with pre-heated air until one of
them dies. Another example where a particular switch which had a
left-to-right airflow, which was sitting at the top of a line of
computing rack for a parallel computing setup. They where feeding a step
wise increasing temperature air flow going through all the racks. The
last switched died prematurely.
In parallel computing heat management and power management can be a much
more troublesome issue than load balancing between the CPUs, which is
the kind of luxury problem you can deal with at a later stage.
Cray was a refrigeration company, which also delivered alot of CPU
cycles along the way.
Cheers,
Magnus
More information about the Time-nuts_lists.febo.com
mailing list