Air University Review, July-August 1976

Seeking Failure-Free Systems

Major General Richard E. Merkling

Failure-free systems are somewhat like a perfect accident rate--easy talk about but very difficult to attain. And without failure-free systems we will never have a perfect accident rate. The prevention of accidents is especially significant when expensive and sometimes virtually irreplaceable equipment is involved.

The development and maintenance failure-free Systems require a lot of hard work from everyone associated with a weapon system from design, through cycle, to termination. We who are in the safety business work especially hard because of our direct responsibility in prevention. At the same time, we recognize that success depends on everyone connected with the system--the operators, maintainers, builders, and designers. Air Force safety history is replete with experiences from which we have sustained substantial weapon system losses because of built-in deficiencies. Often, the causes were deeply embedded in the basic design that it was impossible to eliminate them even after they had been identified.

Over the years, we have made significant progress toward achieving that generally elusive goal of zero weapon dsystem losses due to accidents. In 1943, there were more than 20,000 major aircraft accidents within the continental United States, only part of the total--because of the war we were not counting those overseas. Some 5600 persons lost their lives in those stateside accidents.

By 1955, our rate was down to 17 aircraft accidents per 100,000 hours flown, but even at that point we had 1600 aircraft accidents and more than 800 people lost their lives. As one Air Force leader after another worked the problem, we continued to lower the number of aircraft accidents, until in 1975 we experienced 116 aircraft flight accidents for a major aircraft accident rate of 2.8. Commendable—yes! But in 1975 alone, USAF aircraft mishaps cost the American people more than $250 million.

In the late 1950s a popular economist published a best-seller entitled The Affluent Society. Both the phrase and the idea seemed to reflect the attitudes of the people. We Americans have always cherished the notion that we could do anything if we would just spend enough money. And there always seemed to be a group that felt we had the money to do whatever we wanted to do. However, I believe that recent events and economic and resource conditions may refute those premises.

In fact, we see our top leadership continually wrestling with the problems of less real buying power in today's budget. While the defense budget in 1975 was $88.9 billion as compared to only $53.6 billion in 1964, in terms of today's dollar this is in fact a reduction of more than $2 billion in real buying power. What is perhaps even more concerning is that during the same period, the portion allotted to procurement decreased almost $9 billion in terms of real buying power; and the portion allotted to Research, Development, Test, and Evaluation (RDT&E) decreased more than $3 billion.

What does this tell the military manager, the planner, or the operator? Obviously, in very broad terms, it sets forth serious challenges and restrictions. What does this tell the weapon system developer and those of us who are charged with protecting that system from accidental loss? Quite simply, it tells us that we must do a better job.

Traditionally, accident-prevention programs have been founded on a mode of operation that essentially waited for accidents to occur, parts to fail, and people to make errors. Then we corrected procedures, redesigned parts, or restricted operations. We can no longer operate in such a manner. We cannot risk the loss of a weapon system costing $50-$70 million to identify the flaw in the design, the part that will fail under stress, or, perhaps in today's sophisticated systems, the circuit that has an alternate route built into it or a flaw in its logic.

We must take a disciplined approach to these problems. One very promising approach is through system safety engineering. The Air Force concept of system safety is that safety must be considered in the original concept, predesign, design, and test phases of any development to achieve the greatest effect.

I do not know of a system program manager who has not been faced with the task of meeting performance standards. As we push the state of the art, this becomes at times an extremely difficult if not impossible task.

While the manager and engineers are striving to develop a system to do that which has not been attainable before, or in ways not previously possible, there are those who demand that a schedule be met. Frequently, these schedules are based on real world needs. Often in today's decision-making processes, decisions to proceed are delayed time after time as we pursue alternatives, tradeoff studies, independent reviews, and the like. Recently, during these delays inflation has been spiraling steadily upward, and now the manager's program has increased in cost and the approval go ahead process is further delayed as he and his staff are required to revalidate and rejustify cost estimates.

Frequently, a program is stalled while we debate the risks of what initially may be perceived as concurrent development, testing, and production. By the time the argument is finally resolved, if there is a real need for the system in the operational forces--and there usually is--we have lost valuable development time. And we have now ensured that to meet a firm initial operational-capability date, we have to accept a greater degree of development/test concurrency.

While the program manager is fighting all of these problems, here comes a safety person--and it does not much matter whether he is a member of the design group, from the management or corporate level of the contractor's company, or from the USAF Directorate of Aerospace Safety--with a request, a plan, sometimes even a demand for expenditure of system safety engineering effort. But does he also say, "I, as the safety man, have 'X' number of dollars to add to your program to cover the costs of the analyses I am requesting you to undertake?" No--safety does not have a line item in the budget; we are like the poor country cousin--a great many wants and very few, if any, dollars. Now, we have further complicated the program manager's task of satisfying the cost, performance, and schedule aspects of his program by also asking him to invest a sizable amount of manpower and dollars in some vague element called safety. To make the problem even more troublesome, we have a difficult, if not impossible, time quantifying the value of efforts invested in safety during the design and development cycle.

Earlier, I commented on the dollar losses due to aircraft accidents. Let's look at that just a bit further. In the years 1971 through 1974, aircraft accidents in nine of our major systems cost $774 million. The four most costly systems were the F-111, F-4, B-52, and C-5 at $213, $209, $68, and $57 million respectively. Admittedly, this does not tell the entire story because of different exposures and missions.

Figure 1. USAF major accidents versus cost

Figure 1. USAF major accidents vs. cost

What is significant, however, is that generally speaking about 30 percent of these accidents were credited to material cause factors, which closely approximates the overall Air Force experience for all aircraft weapon systems. (See Figure 1.) An additional fact of some importance is that, while in the last 25 years we have made a notable reduction in total aircraft accidents and rates, we have not significantly reduced the proportion of this overall experience credited to material problems.

Although we have eliminated many of our past deficiencies, systems today are perhaps an order of magnitude more complex than they were 25 years ago. And they are, in a number of instances, almost that same order of magnitude more expensive as well. Until just recently, we have done very little to attack these problems systematically. For example: How many aircraft flying today have the nose gear steering on the same hydraulic system as the wheel brakes? Even the simplest system analysis would reveal that, in terms of safety, a single failure which deprives us of the wheel brakes should not also eliminate our ability to steer the aircraft during the landing phase. Also we have long recognized the severe threat that fire poses to airplanes. Yet how long has it taken us to change our designs so that fuel, electrical, and hydraulic lines do not run unprotected and grouped together immediately adjacent to the hot section of the engine?

These potential hazards seen in retrospect appear obvious, and one wonders why they were not recognized at the time of design. But there is another factor in this equation—man--and in this case, more explicitly, the engineer, the designer, and the manager. For many reasons, a specific technical design problem maybe approached and argued differently even by experts in the same discipline as well as by managers or program directors. I think we must recognize clearly that even if we agree that system safety must be pressed--and pressed hard-in the early design and pre-production stages of a system's development, and even if somehow we find a way to fund the costly analyses that are frequently required to uncover failure modes and sneak circuits, our engineering knowledge may not be sufficient to point the way positively and to identify the real hazards.

In one of our current first-line aircraft, we made an engineering decision in the design phase to use a certain type of structural splice. This splice saved weight and appeared to have all of the necessary requirements of strength, producibility, integration with other members, and the like. Now, a number of years later and with some innovations in the analysis of structural failures called "fracture mechanics," we have found some disturbing data about the susceptibility of such a splice under the loads we ask it to carry. We have learned how very sensitive this joint is to manufacturing-induced minute cracks or abrasions within the holes used for the fasteners that hold the splice together.

Perhaps the real challenge in all of this is not one of attention, programming, or funding. Rather, it is our ability--having once designed a system--to be smart enough, then, to track through to the potential failure of the system, to find the key areas, and to determine the failure potential once the system is operationally mature. Frequently, a system may be relatively trouble free for the first few years of its operational life and then fail--not always as a result of wear or age but because of a latent design problem.

There is yet another area where our experience does not track back to before World War II, where we do not have the data from hundreds of smoking wrecks or thousands of pieces of paper documenting component failures. I am referring to the problem of analyzing the reliability of airborne computers and software, those marvels of today's science that permit us to print a complete memory or computer circuit on a chip the size of a pin head; these advances allow us through multiplexing to use a single wire for a number of electrical signals. Such a system is used in an aircraft under development to achieve a significant weight savings; but what if--as a result of a sneak electron path or faulty logic resulting from an electronic crossover or interference--the gear should be lowered at supersonic speeds as the weapons operator prepares for weapons delivery? What if, in a fly-by-wire fighter employing negative static stability, a lighting discharge or the energy from a high-powered air-borne enemy radar causes the circuit to falter or fail or just switch to an unplanned path within the computer circuitry?

We can postulate a large number of undesired events that may have a higher probability of occurring when we use the multitude of technological advances in computers, miniaturization, and electronics available to us today. With the growing use of computers on airborne systems--radars, remotely piloted vehicle (drone) control, weapons control, and fly-by-wire avionics-- our rapid progress has created a new safety concern. How can we adequately conduct a safety analysis of weapon systems that have highly complex logic circuits and computers?

Certainly we cannot hope to accomplish the task using some of the methods of the past. Equally as certain is the fact that we cannot rely totally on the design engineer to be completely aware of and catch all of the possible combinations and potentials for failure in his system, as he initially formulates the design. Increased emphasis on system safety analyses of all types will help us meet this new challenge. We need to continue to look at the man-machine interface through analyses such as the operating hazard analysis and the fault tree analysis.

A special kind of operating hazard analysis was performed prior to the first flight of the B-1. The analysis simulated the failure of various "black boxes" on the B-1 and verified that the crew has a way of detecting the failure, taking corrective action, and keeping the aircraft under control. Several other system safety techniques were used on the B-1 to identify hazards caused by malfunctions in the computer and other hardware. For example, by use of failure modes and effects analysis (FMEA) and fault hazard analysis (FHA), the read/write memory chips on the B-1were analyzed and hazards were identified.

However, we need a breakthrough to give us a faster, more economical way to conduct fault tree analysis. Failure modes and effects analysis and fault hazard analysis "what happens if'" type analyses and are limited in that they treat the failure of one component at a time. Multiple component failures and/or their subsequent cumulative effect on the systems are not considered--thus the need for the time-consuming fault tree type of analysis which will handle multiple combinations of failures.

The fault tree analysis, incidentally, is a deductive method used to investigate a specific undesired event (such as "loss of radar facility by fire"). Starting with the undesired event, a logic diagram (tree) is constructed which considers all known circumstances that can lead to the top event, either alone or in combination. (See Figure 2.)

But can we defend the cost of these analyses in a program budget? At every level of program review and project approval, the question of whether system safety is a worthwhile endeavor must be pursued. Regrettably, we have yet to find a good way to articulate the benefits of such efforts in the life cycle cost considerations. This is particularly true if the analyses are successful and we do not have the accident-producing failures. I believe some managers have for too long been primarily interested in cost, schedule, and technical performance. We need to express the need for system safety within the constraints of these classical areas. At times, it would seem to be done more easily if the military were as profit-and-loss oriented as commercial companies.

Figure 2. Fault tree analysis

Figure 2. Fault tree analysis

Ideally, we should have system safety engineering deeply involved from the very outset of a system's development life. Often our definition of a Required Operational Capability (ROC) tries to incorporate too much into a single package, and we wind up with a system that, rather than doing a few jobs extremely well, does many things only fairly well. Frequently, a complexity also results that fosters the potential for accidents.

We need a well-defined plan for the incorporation of system safety work. While it is important, for efficiency's sake, that efforts by system safety not duplicate similar work being done by the reliability, maintainability, and human factors personnel, it is equally important that, as we do these other tasks, they incorporate to the maximum extent possible items related to system safety. To do this, a plan is needed. However, perhaps even more basic is that the system program manager needs to realize that these efforts are complementary and that they support and include the safety portion. For example, if the Required Operational Capability developed by a using command included safety design criteria or requested a safety review of the system design before final go-ahead, we would have made a giant step toward catching the attention of our development community.

Another word of caution--it is very easy to lose the real meaning of what some of our simplified mathematical expressions are trying to tell us. For example, the level of reliability we are attempting to achieve in one new aircraft program is expressed in these terms, where the "x 10-5" means "per 100,000 flight hours":

Major accidents  5 x 10-5
Aircraft destroyed 3.72 x 10-5

These are harmless sounding numbers and ones that I feel may give a false sense of security. Let's take these one step further, assuming a 15-year system life, some 1600 aircraft flying approximately 300 hours per year for a total program of 7.2 million flight hours. What these figures are telling us is that, if the weapon system costs approximately $4.6 million per copy, we will invest in excess of $1 billion in aircraft losses over the life span of the system. I wonder how many of the top program review panels and individuals considered the safety level of effort in these terms? And as though this were not enough, how do we handle the problems associated with a production decision that evolved from a prototype design demonstration effort, such as the F-16? In a design-to-cost prototype effort with high value given to performance, how can we expect a program manager to devote critical funds for long-term safety considerations? Once we have bought the system, how do we convince a manager to go back and redesign or study systems that have been incorporated and seem to be doing the job satisfactorily? How can we restructure the impression of system safety engineering from something we "buy" or "add on" to a "way of life"?

In yet another aspect of system engineering under the American competitive system, we seem to repeat mistakes rather consistently and have to relearn costly development design lessons. Sometimes we seem not to learn them at all. I would like to think that, through the use of up-to-date design handbooks, we can improve our "corporate memory" and pass on the lessons we have learned. But even here we encounter severe problems in updating the design handbooks, having timely feedback from ongoing programs, and in accurately detailing pitfalls to be avoided.

I have outlined a number of problems and obstacles and have presented no specific answers or solutions. This should in no way be construed as defeatist or negative. I am firmly convinced that the cost, complexity, and defense values of our new systems are such that we must pursue and achieve ways of handling these. This must be done in the same spirit with which our pioneer forefathers opened the West and, more recently, we put a man on the moon. I have that same optimistic spirit that leads me to believe that, if we sincerely put our minds to it, ways can be developed to achieve the necessary analysis and review techniques, but we must recognize and. define the problem before us.

We must sincerely support the goal of developing failure-free systems, and we must place this goal in proper perspective other requirements.

Norton AFB, California

America lives in the heart of every man everywhere who wishes to find a region where he will be free to work out his destiny as he chooses.

Woodrow Wilson, April 1912.


Contributor

Major General Richard E. Merkling (M. S., George Washington University) is Director of Aerospace Safety, Air Force Inspection and Safety Center, Norton AFB, California. After flying training and a combat tour in Korea, he served as fighter gunnery instructor, Laughlin AFB, Texas, and test pilot at Eglin and Edwards. He has served as an RF-101 reconnaissance pilot at Misawa AB, Japan, and as a member of the Fifth AF Tactical Evaluation Team, flying F-100s and RF-101s. Other assignments include Chief, Tactical Fighter Branch, DCS/R&D; Commander, 388th Tactical Fighter Wing, Korat Royal Thai AFB; Chief, Air Section, Operations Division, SHAPE, Belgium; and DCS/O, Fourth Allied Tactical Air Force, Germany. General Merkling is a graduate of Squadron Officer School an Air War College.

Disclaimer

The conclusions and opinions expressed in this document are those of the author cultivated in the freedom of expression, academic environment of Air University. They do not reflect the official position of the U.S. Government, Department of Defense, the United States Air Force or the Air University.


Air & Space Power Home Page | Feedback? Email the Editor