[time-nuts] DC distribution
jimlux
jimlux at earthlink.net
Sun Oct 6 03:03:13 UTC 2019
On 10/5/19 3:35 PM, Hal Murray wrote:
>
> jimlux at earthlink.net said:
>> There is *great* resistance to changing any assembly and workmanship
>> standard - nobody wants to be the person who says "we don't need to do
>> *that* anymore" and then a disaster happens, and one of the potential causes
>> is "you didn't do *that*"
>
>> It is entirely possible that the original rationale and explanation is no
>> longer valid.
>
> There is also a risk of troubles because you are still doing *that*.
>
> Do the people who maintain the rules occasionally look around to see if a
> better way has been developed?
>
TL,DR: Yes, but..
Sore point there - since my job these days is managing what are called
Risk Class D missions, for which some of the (perceived) risk is that
you don't have to follow all the process that is typical for Class A, B,
and C missions. And I've had in flight failures where the spacecraft was
lost (Ouch!) There's the question of "should we have followed some
process that we didn't follow" - The idea is that process is expensive,
and knowing acceptance of risk allows you to do things you could not
otherwise do.
NASA divides missions into risk classes (NPR 8705.4), in terms of the
"consequences of failure" or "national significance" or "difficulty of
reflight" or cost ranging from A down to D. Class A is human or
multibillion flagship; Class B is things like Mars rovers; Class C is
"less than two year missions that cost less than $100M" kind of thing;
Class D is "ok if it fails".
There is an enormous amount of "standard practice" for NASA missions -
often derived from long experience, or, perhaps, from some "bad day" and
a process/rule gets created that says "we're not going to do that again".
It's important to know that NASA, in general, does not do "reliability
calculations" in a MIL-HDBK-217 way - there's no stacking up of
individual part reliabilities to get an estimated system MTBF. This is
historical - NASA typically builds "just one unit" (maybe 2 or 3) - so
there's no chance to do life testing and build up statistics. I think
(Jim's opinion) that when they started coming up with process, the
part/assembly reliability data had huge variances, so the resulting MTBF
predictions spanned a wide range, or worse yet, said "failure is
certain". There's also a problem that parts reliability probably isn't
the dominant factor for reliability - it's design (is that wire under
tension causing it to break with thermal cycles) or workmanship (not in
a good/bad sense, but a variability sense).
So there are tons of process to try and drive the variability of
workmanship down - You don't just tighten a fastener, you torque it to a
specified level, determined (in theory) by the design loads, etc.; and
someone witnesses the torquing to make sure someone didn't forget to
install the bolts. Mistakes happen - the system tends to get paperwork
heavy - and disasters have occurred because someone ignored the evidence
of their hands/eyes and trusted the paper - NOAA N-prime is the case in
point. Interestingly, these are called "process escapes" - and there's
a huge amount of work (multiple work years even for things where nothing
bad happened) that goes into determining why someone did something
that's outside the process - was it just a bad day? is the process
itself inconvenient or incomplete, etc. There is an intense amount of
contemplation on changing the process - typically it was created because
of a single bad event (NASA just doesn't build that many things), it
addressed the causes of that event and appears to be a "good idea" for
the future. It then becomes part of the "received wisdom of the ages"
and everyone does it - until some event triggers a reevaluations.
In general, the system is set up so that it's easier to just "do the
standard thing" than to get a waiver to not do it. Getting the waiver
typically requires that you *prove* in some sense that it won't increase
risk, or that you've somehow backed yourself into a corner and there's
no way to get the job done without it. The latter is the "willing
acceptance of risk" and there's a lot of people who have to sign off on
it - The NASA administrator does NOT want to sit in front of Congress
explaining why a $500M mission was lost because a waiver was issued to
not do something. "You mean, sir, that we saved a few hours labor and it
cost us $500M?" You don't get to say "There were 10,000 things, each
that are individually a good idea, but if we did them all, the mission
would have cost $1B, and you only gave us $500M"
For a Class D mission, there is a formal process (at JPL, anyway) where
you go through the roughly 700 "Design Principles" and "Flight Project
Practices" and identify which ones you will comply with, which you
won't, and which are "comply with intent, but adjusted for this
mission". The DP and FPP are high level documents that describe "stuff
you should do" - things like "you should have no more than 30% CPU
loading at PDR", "You shouldn't discharge the batteries more than X%".
The result of this process (which takes a few months) is a list of
blanket waivers - for instance, maybe you don't need to have independent
people do a worst case analysis or parts stress analysis of all your
circuits - you trust in the experience of the engineer doing the design,
and they do some informal analysis (a spreadsheet of voltage rating vs
voltage it sees in the circuit). A big one is getting waivers to not
have inspection and test at ALL levels of integration - you can assemble
the whole thing, test it as a whole, at the risk of discovering a
problem late in the project. For instance, you assume all the
transistors are good from the mfr, and that the board is correctly
assembled by the automated fab, so you don't need electrical test. You
plug the board in, and if the system doesn't work, you have a spare you
can swap in. On the other hand, if it takes 6 months to dismantle the
spacecraft and extract the failed board, you probably won't get the
waiver. My spacecraft were easy - you could assemble or disassemble
them into their component assemblies in less than a day - so it wasn't
schedule risk, it was "will we break something by handling it" - little
teeny connectors are fragile.
Ultimately, whether your mission succeeds or fails, but especially if it
fails, we go back and look at all those waivers and over a period of
years, we decide, hmm, maybe we should change that because technology
has changed. Each time someone goes through the Class D process, the FPP
and DP get looked at, and if everyone is getting exempted from some
requirement, and has good reasons, then there's a rule change. But it's
slow.
And where there is a problem, maybe a new rule will be created - with
the large number of SmallSats (cubesats and slightly larger) being done
these days, you wind up with physical properties that are outside the
"traditional" experience. A 10 foot long flexible antenna sticking out
of a 1000kg spacecraft is mechanically very different from that same
antenna sticking out of a 5kg spacecraft.
And there will need to be new processes to deal with swarms and massive
constellations - NASA is used to flying one spacecraft, maybe 2 (MER) -
if there's a failure, it's a big deal. You convene a Failure Review
Board (FRB), you identify Corrective Actions, etc. If you fly 100
spacecraft to perform a function, and one fails, and the function is
still performed, meeting all requirements, is it a big deal? Maybe it's
just that the spacecraft have 90% reliability, and you planned for that
by launching 100 when you need 50 to make your measurement. Are you
going to convene a FRB for each failure? Or are you going to say - oh
yeah, that is an expected failure mode, we know it's random and not a
common design flaw among all 100, move on.
With a move to "statistics" instead of "build it perfect" - there will
be process changes - but there will need to be test data to back up the
statistics.
More information about the Time-nuts_lists.febo.com
mailing list