[time-nuts] DC distribution

Sun Oct 6 03:03:13 UTC 2019

On 10/5/19 3:35 PM, Hal Murray wrote:
> 
> jimlux at earthlink.net said:
>> There is *great* resistance to changing any assembly and workmanship
>> standard - nobody wants to be the person who says "we don't need to do
>> *that* anymore" and then a disaster happens, and one of the potential  causes
>> is "you didn't do *that*"
> 
>> It is entirely possible that the original rationale and explanation is  no
>> longer valid.
> 
> There is also a risk of troubles because you are still doing *that*.
> 
> Do the people who maintain the rules occasionally look around to see if a
> better way has been developed?
> 

TL,DR: Yes, but..

Sore point there - since my job these days is managing what are called 
Risk Class D missions, for which some of the (perceived) risk is that 
you don't have to follow all the process that is typical for Class A, B, 
and C missions. And I've had in flight failures where the spacecraft was 
lost (Ouch!) There's the question of "should we have followed some 
process that we didn't follow" - The idea is that process is expensive, 
and knowing acceptance of risk allows you to do things you could not 
otherwise do.

NASA divides missions into risk classes (NPR 8705.4), in terms of the 
"consequences of failure" or "national significance" or "difficulty of 
reflight" or cost ranging from A down to D. Class A is human or 
multibillion flagship; Class B is things like Mars rovers; Class C is 
"less than two year missions that cost less than $100M" kind of thing; 
Class D is "ok if it fails".

There is an enormous amount of "standard practice" for NASA missions - 
often derived from long experience, or, perhaps, from some "bad day" and 
a process/rule gets created that says "we're not going to do that again".

It's important to know that NASA, in general, does not do "reliability 
calculations" in a MIL-HDBK-217 way - there's no stacking up of 
individual part reliabilities to get an estimated system MTBF. This is 
historical - NASA typically builds "just one unit" (maybe 2 or 3) - so 
there's no chance to do life testing and build up statistics.  I think 
(Jim's opinion) that when they started coming up with process, the 
part/assembly reliability data had huge variances, so the resulting MTBF 
predictions spanned a wide range, or worse yet, said "failure is 
certain".  There's also a problem that parts reliability probably isn't 
the dominant factor for reliability - it's design (is that wire under 
tension causing it to break with thermal cycles) or workmanship (not in 
a good/bad sense, but a variability sense).

So there are tons of process to try and drive the variability of 
workmanship down - You don't just tighten a fastener, you torque it to a 
specified level, determined (in theory) by the design loads, etc.; and 
someone witnesses the torquing to make sure someone didn't forget to 
install the bolts.  Mistakes happen - the system tends to get paperwork 
heavy - and disasters have occurred because someone ignored the evidence 
of their hands/eyes and trusted the paper - NOAA N-prime is the case in 
point.  Interestingly, these are called "process escapes" - and there's 
a huge amount of work (multiple work years even for things where nothing 
bad happened) that goes into determining why someone did something 
that's outside the process - was it just a bad day? is the process 
itself inconvenient or incomplete, etc.   There is an intense amount of 
contemplation on changing the process - typically it was created because 
of a single bad event (NASA just doesn't build that many things), it 
addressed the causes of that event and appears to be a "good idea" for 
the future.  It then becomes part of the "received wisdom of the ages" 
and everyone does it - until some event triggers a reevaluations.

In general, the system is set up so that it's easier to just "do the 
standard thing" than to get a waiver to not do it. Getting the waiver 
typically requires that you *prove* in some sense that it won't increase 
risk, or that you've somehow backed yourself into a corner and there's 
no way to get the job done without it. The latter is the "willing 
acceptance of risk" and there's a lot of people who have to sign off on 
it - The NASA administrator does NOT want to sit in front of Congress 
explaining why a $500M mission was lost because a waiver was issued to 
not do something. "You mean, sir, that we saved a few hours labor and it 
cost us $500M?"  You don't get to say "There were 10,000 things, each 
that are individually a good idea, but if we did them all, the mission 
would have cost $1B, and you only gave us $500M"

For a Class D mission, there is a formal process (at JPL, anyway) where 
you go through the roughly 700 "Design Principles" and "Flight Project 
Practices" and identify which ones you will comply with, which you 
won't, and which are "comply with intent, but adjusted for this 
mission".  The DP and FPP are high level documents that describe "stuff 
you should do" - things like "you should have no more than 30% CPU 
loading at PDR", "You shouldn't discharge the batteries more than X%".

The result of this process (which takes a few months) is a list of 
blanket waivers - for instance, maybe you don't need to have independent 
people do a worst case analysis or parts stress analysis of all your 
circuits - you trust in the experience of the engineer doing the design, 
and they do some informal analysis (a spreadsheet of voltage rating vs 
voltage it sees in the circuit).  A big one is getting waivers to not 
have inspection and test at ALL levels of integration - you can assemble 
the whole thing, test it as a whole, at the risk of discovering a 
problem late in the project. For instance, you assume all the 
transistors are good from the mfr, and that the board is correctly 
assembled by the automated fab, so you don't need electrical test.  You 
plug the board in, and if the system doesn't work, you have a spare you 
can swap in.  On the other hand, if it takes 6 months to dismantle the 
spacecraft and extract the failed board, you probably won't get the 
waiver.  My spacecraft were easy - you could assemble or disassemble 
them into their component assemblies in less than a day - so it wasn't 
schedule risk, it was "will we break something by handling it" - little 
teeny connectors are fragile.

Ultimately, whether your mission succeeds or fails, but especially if it 
fails, we go back and look at all those waivers and over a period of 
years, we decide, hmm, maybe we should change that because technology 
has changed. Each time someone goes through the Class D process, the FPP 
and DP get looked at, and if everyone is getting exempted from some 
requirement, and has good reasons, then there's a rule change.  But it's 
slow.

And where there is a problem, maybe a new rule will be created - with 
the large number of SmallSats (cubesats and slightly larger) being done 
these days, you wind up with physical properties that are outside the 
"traditional" experience. A 10 foot long flexible antenna sticking out 
of a 1000kg spacecraft is mechanically very different from that same 
antenna sticking out of a 5kg spacecraft.

And there will need to be new processes to deal with swarms and massive 
constellations - NASA is used to flying one spacecraft, maybe 2 (MER) - 
if there's a failure, it's a big deal.  You convene a Failure Review 
Board (FRB), you identify Corrective Actions, etc. If you fly 100 
spacecraft to perform a function, and one fails, and the function is 
still performed, meeting all requirements,  is it a big deal? Maybe it's 
just that the spacecraft have 90% reliability, and you planned for that 
by launching 100 when you need 50 to make your measurement.  Are you 
going to convene a FRB for each failure? Or are you going to say - oh 
yeah, that is an expected failure mode, we know it's random and not a 
common design flaw among all 100, move on.

With a move to "statistics" instead of "build it perfect" - there will 
be process changes - but there will need to be test data to back up the 
statistics.