The cost of test is rising as a percentage of manufacturing costs, fueled by concerns about reliability of advanced-node designs in cars and data centers, as well as extended lifetimes for chips in those and other markets.
For decades, test was limited to a flat 2% of total manufacturing cost, a formula developed prior to the turn of the Millennium after chipmakers and foundries saw the trajectory for rising test costs. They responded by sharply limited the time that devices could be under test in manufacturing. That, in turn, paved the way for highly advanced automated test equipment capable of performing multiple different tests at the same time on multiple die or wafers, and ultimately for different functions under system-level test. Alongside of that, inspection and metrology capabilities improved proportionately.
Today, those test percentages are rising again. Depending upon end markets, as well as how new the technology is, in some cases those numbers can rise by an order of magnitude or more, while in others they have stayed largely the same.
There are a number of reasons behind those spikes. One of the key ones is the rapid deployment of AI systems in cars and in mission-critical applications, which requires more compute elements to process more data. In fact, some AI/ML/DL chips developed at the most advanced nodes are being stitched together today to boost processing beyond the number of compute elements that will fit on a chip the size of a reticle. In AI, more is better, particularly when it comes to training and some inferencing, and this is particularly important in autonomous vehicles of all sorts, as well as edge and cloud computing. But that also means an increase in the amount and complexity of testing required, and the time it takes to do that testing to achieve sufficient coverage.
“At 7/5nm, there is much heavier reliance on functional test, scan-chain test and system-level test,” said Doug Elder, vice president and general manager of the semiconductor business unit at OptimalPlus. “The key is to get as much data as you can, so you can load as much as possible for a baseline comparison. The less data you have to work with, the harder it is to question a data-driven approach, and it doesn’t give you the ability to leverage AI. You also need to look at the structure of that data because the collection process can be difficult due to the volume of data and the structure of the devices.”
This helps explain why many chipmakers still rely heavily on burn-in, a time-consuming and increasingly expensive process that provides a single image (or multiple images) across a chip or system.
“They’re reluctant to let go of that because if there’s a latent defect they want to find it,” said Elder. “But when you collect burn-in data you find that, statistically, most devices are not failing at burn-in. The other thing they look at is a lot of RMA (return merchandise authorization) data. If an ADAS camera is returned, you want to know why it failed, particularly in a closed-loop system. But you need to collect that data all the way through to the end user level, and then correlate that back to the manufacturing process. This is becoming very important in automotive, where you need to understand why a sensor died at 60 MPH. So you collect information, feed it back, and correlate it.”
All of this adds to the complexity of a system, and it stretches the test process well beyond manufacturing.
“It’s not simply that the cost of test is going up, because you aren’t always increasing the amount of time a chip spends on an ATE machine,” said Keith Arnold, who runs global technical sales at PDF Solutions. “There’s a lot of stuff you couldn’t do before. Now you can look across multiple process steps, from ingots to finished products. Very few people have systems in place to take advantage of this, but it has evolved to the point where you can predict early life failures and the likelihood of each die failing in the field.”
Arnold noted that over the past few years, by applying machine learning, data could be used to understand the root cause of a defect. “But now you if you do physical failure analysis, electrical failure analysis and data failure analysis, you can use those to predict when something will fail. And that’s just the tip of the iceberg about what you can do with this stuff.”
Ultimately, this could reduce liability concerns and limit the number of expensive returns and recalls, but that needs to be calculated on a system-level basis. If this works as planned, test and analytics costs will still rise, but other costs such as burn-in will fall.
Still, this has to be applied to thousands of components that make up a complex system, such as an autonomous vehicle. A single package may contain multiple die, hundreds of IP blocks, I/Os, memories, various types of interconnects, and a mixture of different materials ranging from conformal films to rigid substrates and various types of bumping and solder balls.
Packaging adds another dimension to this problem, because even known good die may be damaged in the packaging process. Nevertheless, most companies willing to spend hundreds of millions of dollars to develop a chip also want to maximize that performance by shortening the distance signals need to travel. So while shrinking features provides more room for additional processing elements, at the most advanced process nodes just shrinking features does not provide significant improvements in power and performance.
This is where complex architectural changes come into focus, and they add yet another level of complexity for testing.
“Test represents more of the cost than it used to,” said Carl Moore, yield management specialist at yieldHUB. “If you think of the old pie chart, the costs were the wafer, the packaging and a small amount for test. The test portion is getting bigger. You’re now testing hundreds of nodes internally. These devices also are getting bigger, and with multi-die, they’re getting higher. At the same time, there’s a trend to minimize the number of pins because pins cost money.”
That makes it even harder to test using standard approaches, and it makes it harder to design chips so enough regions of a chip can be included in those tests.
Shift right Rather than just doing more testing concurrently in the fab, chipmakers are taking a different approach. While there has been a big emphasis across the semiconductor industry to shift left on everything from verification and software development to some initial test strategies, in some markets such as automotive and servers, there is a concurrent trend that is pushing testing both left and further right.
“On the hardware test side, there is a massive crossover between AI and automotive with the auto industry developing its own AI,” said Lee Harrison, marketing manager for automotive test at Mentor, a Siemens Business. “We can run a lot of this in system test, where you have arrays of processing cores and you keep the whole thing running to make sure the hardware is correct. But there is also increasing demand on usage with self-driving features, and the number of hours the underlying technology is expected to run will increase massively. At the same time, automotive cycle time is being reduced. So you need to run tests in the system, which in an AI chip may contain up to 1,000 identical processing cores. And for a larger AI chip, which is going to be the brain of a car, you also need to implement repair for logic. We’ve seen repair in memory, and that represents a big amount of content on most chips. Now, with AI chips, you’re going to need repair at the core level over the lifecycle of a chip.”
This is a recognition that in safety-critical applications, where there are thousands of cores in an AI system, something will go wrong at some point — even if it’s damage to a transistor caused by a stray alpha particle. The key is to both understand when that happens, through ongoing testing and monitoring, and to be able to immediately fail over to a spare processor or transistor. All of this has to be monitored and tested repeatedly throughout the operation of a system.
But testing is also being applied for more proactive reasons. In an array of 1,000 identical processing elements, it’s also good to distribute the compute load at any point in order to even out the aging process on circuits. Those circuits can be tested on an ongoing basis and corrections applied as needed.
“There are times when you want all 1,000 cores for peak usage,” said Harrison. “But if you’re down to 10%, you do not want that all in the top left corner of the die. So you do a checkerboard pattern and try to extend the life of a component as long as possible. You don’t want everything redundant, though, because if you have 100 extra cores you’re going to decrease the efficiency of a system. Maybe you go from a system with 100% performance where you have failures down to 90% performance where you don’t see those failures.”
Notes from the real world To understand just how complex this has become, consider the case of EAG Laboratories, which tests both materials and chip failures.
“We have $150 million for the equipment, at least,” said Daniel Sullivan, account executive at EAG. “And we have something like 100 or 120 different techniques. If you start talking about the subtle changes inside of technique, you probably have 200 techniques.”
Identifying the cause of a defect can run thousands of dollars per defect. This pales in comparison to the cost of an idle fab, however, which can be millions of dollars a day. So figuring out what went wrong, such as where the purity of materials is being disrupted, can save huge amounts of money. That accounts for the willingness to pay for more testing in this case, and the sense of urgency when something goes wrong on the manufacturing side.
“These machines range in value, but they can be a $1.5 million to $2 million each for some of the machines,” said Sullivan. “They also have a Ph.D. who’s been doing this for 30 years, looking at the technology and figuring out what’s going on. And then they get an official report from a third party that’s neutral. So the Ph.D. has to go to his vendor and say, ‘You put x y, z in that. You’re not supposed to do that.’ They say, ‘Oh, no, that’s your test. You just made it up.’ That’s when things come to us. We’re considered a disinterested third party. So there are companies where you made something. ‘I thought it was supposed to be something else, the contract says what it is. If it’s wrong, you owe me $100 million. If you are right, I am out of business.’ So you need that third party because these two people are not going to trust each other. Plus, if you have an expert who’s been in this industry for a long time, they can talk to both sides say, ‘Hey, I’ve seen this before. The problem looks like this and the test shows me this, so this is probably the cause of the problem. They both sort of take that instead of fighting, they can go and try and solve the problem. But there’s an awful lot of times where somebody machined something or is developing some product and there’s some contamination in the system that they don’t think is a big deal.”
The bottom line Test costs are rising again for the first time in a couple decades, but it’s difficult to assess at this point just how much is essential, what can be cut, and what else can be done with the test results and constant monitoring data that has not been done in the past. If test is 25% of the cost of manufacturing, but it reduces liability costs by 30%, that’s a substantial savings. If, on the other hand, test costs rise and liability costs stay flat or increase, then the process needs to be fixed.
It’s too early to tell how this will play out across a bunch of new markets. What is clear, though, is that the insights provided by ongoing monitoring will provide more information to more parts of the supply chain than at anytime in the history of semiconductors. The question now is whether the industry will find sufficient value in the underlying data to continue ratcheting up that investment, and that’s something that may take years to decide.