The automotive industry is grappling with a tradeoff between cost and safety. Safety is well understood in industries that are cost-insensitive, such as aerospace and medical, and the consumer industry has a long track record of driving down costs while increasing functionality. But can these two industries be brought together in a safe and effective manner to enable automobiles to achieve the goal of autonomous driving?
Functional Verification of functional safety means that a design has been shown to be able to detect and recover from random failures of the executing system. This is above and beyond the requirement for the elimination of systemic defects that are meant to be caught with traditional functional verification and beyond the uncertainties that come about from the deployment of technologies, such as machine learning, which are statistical in nature and outside of the scope of current verification technologies.
There are two fundamental approaches that can be used to tackle this problem. One is to incorporate design-for-safety techniques that can guarantee all random errors are detected, such as duplication, and the other is to insert detection logic and use safety verification tools to show that all errors will be detected or will not lead to a safety failure.
Eliminating systemic failures
Both techniques rely on a good starting foundation, meaning that no systemic issues remain in the design. “Systematic verification, as defined in ISO 26262, requires a rigorous approach to cataloging requirements and ensuring that the test plan matches them on an individual basis,” explains Adnan Hamid, CEO of Breker Verification Systems. “This is consistent with the Portable Stimulus (PS) approach of defining scenarios to be tested. PS scenarios can be defined to match the original requirements, and the coverage of the scenarios may be evaluated against the original requirement intent. In order to ensure good random fault coverage, and unlike manufacturing test which only requires one fault detection, functional safety in the context of automotive requires ensuring that a fault will always be detected under all conditions. That means you have to start with a set of scenarios, or testcases, that adequately cover all of the relied upon functionality of the device. The right way to do that is to start with a model of functional intent.”
“Formal property checking also can be used to avoid systematic failures in the design,” says Roger Sabbagh, vice president of applications engineering for Oski Technology. “It can play a huge role in ensuring that no latent bugs are going to cause harm or damage, thus increasing device safety.”
Design for safety
In cost-insensitive industries, duplication has been used for a long time. “The easy solution used to be duplication,” explains Alexis Boutillier, functional safety manager at ArterisIP. “You try and minimize the areas in which this has to happen and today. You limit it to the part of the logic that cannot be protected by techniques such as ECC. This is usually where you are doing transformation of information. Much of it is protected in similar ways to communication where you can use the same techniques used for telecommunication – CRC, ECC and parity. It is not always perfect, but it does provide very good coverage for minimal cost.”
Another popular technique involves checkers. “You have to design in safety in the form of watchers and checkers that sit around the design,” says Apurva Kalia, VP of R&D in the System & Verification Group at Cadence. “These detect if something bad is happening. That has to be done in the early stages of the design process, and that is where design for safety comes in.”
The question is where to put them. “You need monitors that check everything possible,” adds Boutillier. “This starts with clock, to reset, to timeouts, especially for an IP where you do not control everything.”
Another approach is to do the checking in software. “Some users rely on software test libraries for part of their coverage,” explains Srikanth Rengarajan, vice president of products and business development at Austemper Design Systems. “One company deploys thresholding techniques to define a safety fault specifically for the datapath.”
But what is a fault? “Not all errors are faults and applications decide when to declare a fault,” adds Rengarajan. “At the IC level, the standard bag of tricks (ECC, parity, lockstep, TMR) continues to see wide deployment, but as chip sizes increase into the 100-million gate+ range the scope of these techniques has narrowed. Targeted tools for replication and ECC protection are in demand by IC and IP designers to reduce design-bloat and maintain the schedule.”
What about new hardware structures such as neural networks (NNs)? “Part of the requirement that you have to fulfill for the automotive market is to do an analysis of the network and show how it can fail,” explains Boutillier. “There are many ways that it could fail, but you are doing a qualitative analysis after it is correctly trained (which is an assumption), looking for faults that occur randomly inside the network that have a consequence on the output.”
Basically, the network has to be free of errors, but there is no checking for the training. “The definition of functional safety, or at least the one used within ISO 26262, is the freedom from phenomena which will cause a change in the behavior of the design,” says Kalia. “This means that if such a phenomenon occurs, the rest of the system should be able to react to it in a manner that ensures the entire automobile operates in a safe state. By applying this to ADAS, which takes input from many sensors, and using AI to determine what to do next, the definition of safety would be, ‘If any input, for example LiDAR, sends an image that is wrong or has a fault in it, the rest of the system should be able to correct or detect, or at a minimum warn the user that something has happened in the system and is no longer able to make decisions.'”
Boutillier also point to the structure of the NN, which can help. “NNs often run in iterations. The more iterations you run, the closer you are to the final answer. Errors that happen in early iterations are more prone to errors than the later stages because you are converging. ‘When’ the fault happens can be important. This qualitative analysis is very important and needs to be performed by experts in the domain that can show that if something changes on one of the layers then it has a minimal impact on the output.”
Safety verification vs. functional verification
Consider one tool whose function is often confused. “Synopsys’ Certitude lies in the domain of functional verification because it provides an objective metric about how good your functional verification environment is,” explains David Hsu, director of product marketing for Synopsys. “It mutates the RTL and tries to see if the testbench can sensitize, detect and capture the failure. However, functional safety verification fault injection is for random defects. While you could use a similar technology, it will not scale because the quantity of faults, even for ‘stuck at,’ which is the basic minimum, is a huge number. If you are going to simulate the design against the fault effect for every single fault, then that will not scale. Scalability is essential.”
Mark Serugetti, director of business development for automotive solutions in Synopsys, provides a framework. “You have to start thinking about different technologies that are used along the way in the functional safety flow. First you have to do failure mode analysis. The work here is to look at the safety mechanisms and then try and define what safety level or FIT (failure in time) rate I need to reach. Then you start using multiple technologies. We talk about Certitude and systematic errors. Then fault simulation that is geared towards looking at fault injection. You may also start to apply other techniques such as formal methods to reduce the fault list because some of them will never be reached or may be safe. Then on top of that you bring in another layer which is to run real software. This may require running faster so, using emulation. You take all of this information and annotate your FMEDA analysis so that you can generate the work product.”
Safety validation has long been the Achilles’ heel of the industry. “ISO 26262 recommends the adoption of fault injection as the preferred approach to making the safety case,” adds Rengarajan. “Historically, the feasibility of running an exhaustive fault campaign on all but a microcontroller-based design renders this recommendation moot, with vendors resorting to proof-by-analysis/induction to make their case. This could take the form of the ISO-coverage recommendations supported by arguments of prior-art and in-production or simple replication at added cost.”
“With the double whammy of larger chips performing more safety-critical applications, assessors are reluctant to allow the arguments adduced while the economics makes the duplication intolerable,” continues Rengarajan. “This creates a need for a new verification methodology devoted to fault testing that can scale to SoCs for advanced driver assistance systems of tomorrow. Using a combination of parallelization techniques and innovative fault propagation algorithms, products are starting to appear that provide exponential speedup for fault campaign. The adoption of fault-injection as the default method for safety validation will be the inflection point that the industry has been awaiting for mass deployment of self-driving technologies.”
Fault simulation is hardly new. “Fault simulation has been around in the Design for Test (DFT) and ATPG world for many years to help analyze fault coverage of manufacturing tests,” says Bryan Ramirez, strategic marketing manager at Mentor, a Siemens Business. “But analyzing faults for manufacturing is different than analyzing faults for safety.”
Ramirez points to three aspects in dealing with random hardware failures:
Safety analysis to understand how the design could fail, how those failures relate to the safety goals, and identifying areas of the design that need to be safer. This is largely an expert-driven process managed by complex spreadsheets. “There may be opportunities for tools to better manage and automate some of this, but removing the ‘expert’ completely is not within the spirit of ISO 26262,” he says.
Enhancing the design through the addition of safety mechanisms to make the design safer. There are opportunities to improve the efficiency of this process. This starts with helping customers understand at an architectural level where safety mechanisms are needed and providing guidance on the types of safety mechanisms to implement. These could include built-in-self-tests, additional logic for protection and redundancy. Automatic insertion of these safety mechanisms will further improve the efficiency of increasing design safety and keep functional safety costs manageable.
Fault injection, which is used to test how the design behaves in response to a fault in the hardware. “This is the area where there has been a lot of focus from tool providers already, but it has a long way to go if the industry wants to be able to provide technology that can efficiently analyze (for both cost and time) the scale of faults and failures of these modern automotive SoCs.”
The final step can be enhanced using formal analysis. “Formal verification can be used to pre-qualify the faults before simulation to prune out those that are proven to be redundant or non-propagatable,” says Oski’s Sabbagh. “This saves time and resources.”
The necessary fault list is evolving. “You need a way to differentiate systematic errors, such as an error in the process, which will always have a problem, to a random fault that will generate, for example, a random read failure, but there is a third case which is something in between,” says ArterisIP’s Boutillier. “If you have a fault that changes Quality of Service, where you now have something that is high latency instead of low latency, it is difficult with the existing ISO to say that you have a fault. The product is not yet dead, but if we continue in this manner it may lead to a failure but you cannot say it is not working.”
There is a problem with fault simulation in general. “Although fault simulation is an accepted approach, it suffers from the fact that the results are as only as good as the stimulus provided by the testbench,” points out Jörg Grosse, product manager for functional safety at OneSpin Solutions. “Given the plurality of the fault scenarios and the possible input combinations, it can be difficult to achieve a high level of confidence in the results. Results might be even inconclusive for certain fault scenarios.”
Grosse outlines the alternative. “The great advantage of formal verification is that it can provide a conclusive answer to whether a fault is detected because it considers all possible input scenarios and thus eliminates the dependency on input stimulus. Similar to the signal/value force in simulation, modern formal verification tools have the capability to cut and tie internal signals. This capability can be explored to create fault scenarios and run proofs that such fault scenarios are detected by the safety mechanisms. Fault scenarios can include timing, enable conditions, and number and type of faults, which makes automation beyond cut and tie desirable.”
ISO 26262 recommends that formal verification be used to verify safety-related requirements because it is the most exhaustive approach for detecting failures and bugs in designs. “These practices can and should be used in any design that requires functional safety,” adds Sabbagh. “To ensure effectiveness, a requirements management system must be used to provide traceability of the safety requirements through the verification and validation phases to demonstrate the design meets the requirements. Assertions that capture these safety requirements and are subsequently formally proven will provide the ultimate confidence in meeting those requirements.”
Whatever tools are used, they have to be qualified, as well. “ISO defines tool confidence levels,” explains Boutillier. “You look at what can go wrong. For example, if I am synthesizing something and the tool has a fault, are you able to detect that? One way is to say that when you generate the RTL you also provide the testbench to ensure that the generated RTL corresponds to the requirements. For synthesis, you may use a formal proof to ensure that the netlist is equivalent to the RTL. This provides good confidence that if something happened in the tool. Then I will detect it and take corrective action.”
Synopsys’ Hsu concurs. “Customers have to certify their tool chain, so the tools and technologies they choose are important. If they are not already certified, the customer has to prove that the tools will not introduce an unsafe issue.”
The verification of functional safety is a nascent technology, but one that is essential to achieving autonomy in automobiles. While the tools are beyond the first generation, as developed for manufacturing test, there is a lot more that can be done. Today, the expert in the loop is necessary because the problem as defined is close to being intractable.