Failure Mode and Effects Analysis (FMEA) is a key safety assessment analysis that determines failure modes at system, hardware and software level. Overlooking Failure Modes can often cause system or functionality failure which directly impacts a systems safety performance, reliability and quality.
FMEA is a bottom-up approach which has four key phases - identification of fault, assessment of impact, determination of potential causes and their resolutions, and finally testing and documentation of analysis.
FMEA addresses the effect of failures at the system, software and hardware level. The outcome of the analysis helps us identify gaps in safety requirements specification and provides input for component testing, integration testing and system level testing. This paper describes the application of Failure Mode and Effects Analysis (FMEA) to software modules.
FAILURE MODE AND EFFECTS ANALYSIS (FMEA)
Â
Software Failure Mode and Effects Analysis (FMEA) is a bottom-up analysis technique to identify the consequences of possible software failure modes on the software system. An example below outlines the application of Software FMEA to Brake ECU (Electronic Control Unit).
As depicted in figure 1 below, Brake ECU receives brake pedal sensor input from the driver as an analog signal and vehicle speed information from another ECU via CAN which in turn outputs brake torque request and brake module status to other ECUs over CAN.
FMEA starts with identifying different software failure modes that can influence the subsystem or system. The four phases (mentioned above) is one potential approach to perform FMEA. A brief expansion of these phases are:
Look at the system functionality holistically and identify a comprehensive list of potential failure modes
For each identified failure mode in step 1, assess the implications of failure on connected software or hardware system and also on the overall performance of the systemÂ
Once we know the overall impact, we isolate potential causes for failure. Once the causes are identified the system design needs to be enhanced to adequately prevent future failures
Once the design change is made, we retest the failure mode to ensure that the system appropriately handles the failure before release. Then, the necessary documentation in doneÂ
Now that we have a brief understanding of the approach, let’s follow these steps to perform software FMEA on Brake ECU depicted in figure 1 above.
STEP 1:
For the example above, let’s start by listing individual components including interfaces, the function they provide and their failure modes.
The only component that is of interest here is Brake ECU with inputs and outputs. The function can be defined as:
transmitting brake torque request, based off inputs: brake pedal sensor and vehicle speed, to other vehicle modules.Â
sending brake module fault status to other vehicle modules.
The failure modes for interfaces and the component (Brake ECU) can be defined as follows:
Component / Interface | Function | Potential Failure Mode |
Brake pedal sensor analog voltage input | No signal | |
Signal voltage out of range | ||
Vehicle speed | Message corruption | |
Message loss | ||
Message timeout | ||
Brake ECU | Transmits brake torque request | NO brake torque request |
DELAYED brake torque request | ||
INVALID brake torque request | ||
Sends brake module fault status to other vehicle modules | NO brake module status | |
DELAYED brake module status | ||
INVALID brake module status |
STEP 2:
Once we have listed the failure modes, let’s determine the effect(s) of the failure on other system components and on the overall system for each failure mode.
For the example above, we determine the effect(s) of receiving invalid or delayed vehicle speed, brake pedal analog voltage out of range, or not receiving anything at all and ask this question:Â
What if the brake pedal input requested by the driver is not received for a certain period of time?
What if we receive corrupted vehicle speed over CAN? Are we okay with 1* corrupted message or not?
Does the failure impact vehicle behavior resulting in high severity?
The table below lists the potential effect(s) of failure which might or might not impact vehicle behavior.
Component / Interface | Function | Potential Failure Mode | Potential Effect(s) of Failure |
Brake pedal sensor analog voltage input | No signal | ||
Signal voltage out of range | |||
Vehicle speed | Message corruption | ||
Message loss | |||
Message timeout | |||
Brake ECU | Transmits brake torque request | NO brake torque request | No brake command issued to the vehicle actuator when requested by the driver |
DELAYED brake torque request | Brake command issued too late to the vehicle actuator when requested by the driver | ||
INVALID brake torque request | Invalid brake command issued to the vehicle actuator when requested by the driver which might cause overbraking | ||
Sends brake module fault status to other vehicle modules | NO brake module status | No brake module status issued to other vehicle modules in order to notify brake ECU failure | |
DELAYED brake module status | Brake module status issued too late to other vehicle modules in order to notify brake ECU failure | ||
INVALID brake module status | Invalid brake module status issued to other vehicle modules in order to notify brake ECU failure |
STEP 3:
After we are done defining the failure modes and potential effect(s) of failure, the next step is to determine potential cause(s) of failure. For each failure mode, we determine all possible causes, including both hardware and software. Listing potential cause(s) of failure helps us figure out which design controls prevention technique to be implemented in order to mitigate these failures. We can have the mitigation strategy defined only in hardware or software or both.
Component / Interface | Function | Potential Failure Mode | Potential Effect(s) of Failure | Potential Cause(s) of Failure |
Brake pedal sensor analog voltage input | No signal | |||
Signal voltage out of range | ||||
Vehicle speed | Message corruption | |||
Message loss | ||||
Message timeout | ||||
Brake ECU | Transmits brake torque request | NO brake torque request | No brake command issued to the vehicle actuator when requested by the driver | [brake pedal sensor analog voltage input] No signal |
[vehicle speed] Message loss | ||||
No power supply | ||||
DELAYED brake torque request | Brake command issued too late to the vehicle actuator when requested by the driver | [vehicle speed] Message timeout | ||
[Brake ECU] Internal fault | ||||
INVALID brake torque request | Invalid brake command issued to the vehicle actuator when requested by the driver which might cause overbraking | [brake pedal sensor analog voltage input] Signal voltage out of range | ||
[vehicle speed] Message corruption | ||||
[Brake ECU] Internal fault | ||||
Sends brake module fault status to other vehicle modules | NO brake module status | No brake module status issued to other vehicle modules in order to notify brake ECU failure | [Brake ECU] Internal fault | |
No power supply | ||||
DELAYED brake module status | Brake module status issued too late to other vehicle modules in order to notify brake ECU failure | [Brake ECU] Internal fault | ||
INVALID brake module status | Invalid brake module status issued to other vehicle modules in order to notify brake ECU failure | [Brake ECU] Internal fault |
STEPS 4 and 5:
After we are done identifying potential failure modes and causes of failure with the severity of failure captured under potential effects column, we list down current design controls prevention and recommend action(s) to mitigate these failures if already not in place. For example:
To mitigate brake pedal sensor failure, we can add a redundant sensor to fall back on in case the primary sensor fails. Also, we can add plausibility check which reads both the sensor voltages and compare against each other and set a fault if the difference between the two increases by some value for some period of time.Â
To check for CAN message corruption, we can verify CRC (Cyclic Redundancy Check), parity bit, etc. added to a field of CAN messages, on the receiver side and set a fault flag if invalid CRCs exceed a threshold.
To check for CAN message drop or loss, we can verify MC (Message Counter), sequence number, etc. added to a field of CAN messages, on the receiver side and/or check for timeout.
To mitigate risk, we can add sensor fault detection strategy in hardware, like what happens if the power supply to the sensor goes off, what if the sensor malfunctions, what if there is a register failure, etc. and what actions to take.
CONCLUSION
Bottoms up FMEA analysis approach helps in functionality level failure modes identification, assessment of severity and the impact on the overall system. If there is no impact, then we can be fairly confident that the system design is robust. If there is some impact, then preventive measures need to be initiated as highlighted in this paper.
留言