Fault tolerance
Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. Any decrease in operating quality is proportional to the severity of the failure, unlike a naively designed system in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability to maintain functionality when portions of a system break down is referred to as graceful degradation.[1]
A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely when some part of the system fails.[2] The term is most commonly used to describe computer systems designed to continue operation in the event of a partial failure, with perhaps a reduction in throughput or an increase in response time. That is, the system as a whole is not stopped due to problems either in the hardware or the software. Non-computing examples include a motor vehicle designed to remain drivable if one of the tires is punctured, or a structure that retains its integrity despite damage caused by fatigue, corrosion, manufacturing flaws, or impact.
For an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.
Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant:[19]
An example of a component that passes all the tests is a car's occupant restraint system. While the primary occupant restraint system is not normally thought of, it is gravity. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so the first test is passed. Accidents causing occupant ejection were quite common before seat belts, so the second test is passed. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms of weight and space, so the third test is passed. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as airbags, are more expensive and so pass that test by a smaller margin.
Another excellent and long-term example of this principle being put into practice is the braking system: whilst the actual brake mechanisms are critical, they are not particularly prone to sudden (rather than progressive) failure, and are in any case necessarily duplicated to allow even and balanced application of brake force to all wheels. It would also be prohibitively costly to further double-up the main components and they would add considerable weight. However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus lose effectiveness). Thus in most modern cars the footbrake hydraulic brake circuit is diagonally divided to give two smaller points of failure, the loss of either only reducing brake power by 50% and not causing as much dangerous brakeforce imbalance as a straight front-back or left-right split, and should the hydraulic circuit fail completely (a relatively very rare occurrence), there is a failsafe in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a safe halt in conjunction with transmission/engine braking so long as the demands on it are in line with normal traffic flow. The cumulatively unlikely combination of total foot brake failure with the need for harsh braking in an emergency will likely result in a collision, but still one at lower speed than would otherwise have been the case.
In comparison with the foot pedal activated service brake, the parking brake itself is a less critical item, and unless it is being used as a one-time backup for the footbrake, will not cause immediate danger if it is found to be nonfunctional at the moment of application. Therefore, no redundancy is built into it per se (and it typically uses a cheaper, lighter, but less hardwearing cable actuation system), and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find a flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as there is no need for them to include the sophistication to first bring it to a halt.
On motorcycles, a similar level of fail-safety is provided by simpler methods; first, the front and rear brake systems are entirely separate, regardless of their method of activation (that can be cable, rod or hydraulic), allowing one to fail entirely while leaving the other unaffected. Second, the rear brake is relatively strong compared to its automotive cousin, being a powerful disc on some sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more central, the rear tire is generally larger and has better traction, so that the rider can lean back to put more weight on it, therefore allowing more brake force to be applied before the wheel locks. On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the footpedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if the rest of the machine, like a lot of low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance.
Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:
Related terms[edit]
There is a difference between fault tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly fault resistant. But when a fault did occur they still stopped operating completely, and therefore were not fault tolerant.