Ways to Implement Redundancy in High-Reliability Embedded Systems
Some embedded systems are very simple, and they do not require backup systems should the device fail. Today, there are some embedded systems that are even considered throwaway. But when high reliability is demanded, some method should be considered for failsafes or redundancies in a system. When an important peripheral or subsection fails, what can the device do to ensure it can still perform its main functions?
Redundancies in high reliability systems take many forms, from duplicate boards and circuits to over designed electrical and mechanical protections on the PCB. For our purposes, we'll focus more on the electrical side of the design, which involves certain types of monitoring and switchover to redundant boards or circuits. Redundancy has to be designed at the hardware level and in an embedded application, but when done correctly it can keep a system operating for extended periods in harsh conditions.
Most Common Types of Redundancy
Hardware redundancy and embedded systems most often involve duplicate circuits and PCBs. In large systems that must have extended operating times, a design might Implement a modular approach with redundant modules placed in a product. In smaller devices, redundancy might simply be duplicated circuits that can be switched on when a primary circuit fails. The main processor or memories can also be redundant, such as on a separate management PCB.
An embedded application needs to continuously monitor certain signals on the hardware to ensure the system is meeting its uptime requirement. An embedded application may have to implement a process such as that shown below to carry out switch over between redundant circuits.
Obviously, the embedded application has a lot of work to do between processing data from peripherals and monitoring whether the peripherals are working. How exactly this is implemented in code or in hardware depends on several factors:
-
Is the application working by running a main Loop or does it receive external commands?
-
Do peripherals need a startup or shut down procedure, and can this be easily applied?
-
What is the processor monitoring in order to determine whether a circuit or subsystem has failed?
-
Are only momentary failures being addressed, or is there a possibility of permanent failure?
These points relate to the operating environment in which the system will be deployed, as well as what the system needs to do or monitor. Implementation happens both in the code and in the hardware as outlined in the next section.
Hardware Monitoring in Code
The code development needed to monitor hardware is very simple. This involves a direct measurement of a signal (either digital or analog), and this is checked against some logical conditions. The logical conditions are also quite simple, basically involving a boolean variable. A basic code snippet (Arduino) with a loop is shown below:
This code snippet is relatively simple but it illustrates the process by which a system determines to switch to the redundant circuit on its own. As long as the logical conditions for operation return FALSE, the system can autonomously switch to the working copy of the circuit. The working copy may require its own startup procedure, turn on of power or peripherals, and possibly notification of the user that switch over has occurred. All of this increases the complexity of switchover as illustrated in the next example.
To switch to a redundant circuit, the redundant circuit could be designed simply as a repeated element, or by repeating large sections of the entire system. The diagram below shows two possible examples involving an ASIC with its supporting passives, as well as the power and startup section.
In the first topology, some indicator on the ASIC or its incoming data stream is monitored directly. This verifies the component is active and interacting with the rest of your system. It shares the power bus with the rest of the system and is only brought online through ENABLE toggling. This means the copy can also be brought online by simple ENABLE toggling with a GPIO.
The second option is a bit more complex as it requires monitoring multiple sections of the system. Power, signal, and any indicators (e.g., PGOOD) are all monitored together. When switchover occurs, power also needs to be switched, sequencing started, and the ASIC core started with an ENABLE pin if available. This uses up more GPIOs on the main processor and requires increasing the cost and size of the overall system.
Don't Forget Your Safety Circuits
Having redundancy is great, but circuits should be protected anyways so that redundancy measures may not be needed except in emergencies. Power systems, sensor interfaces, or highly thermally or mechanically stressed systems can implement basic safety measures in the design to help protect a product and extend operation. Focusing on the electrical protection side, we might see any of the following:
-
Surge suppressor and ESD suppressor circuits
-
Components with thermal shut off protection
-
Resettable breakers, fuses, relays, or crowbar circuits
Companies that build high-reliability electronics need a suite of design tools, to implement best-practices for redundancy. Multi-disciplined design teams rely on the best set of PCB design features in the Allegro X Design Platform from Cadence. Only Cadence offers a comprehensive set of circuit, IC, and PCB design tools for any application and any level of complexity.