The design flaw in NVIDIA’s Blackwell GPU that caused a delay in the delivery of AI chips has been fixed. The improved B100/B200 are about to enter mass production.
CEO Jensen Huang admitted that the flaw was entirely caused by NVIDIA, denied TSMC’s rumored fault, and emphasized that the Taiwanese manufacturer helped fix it in time.
«We had a design flaw in the Blackwell, it was functional, but the design flaw caused the low yield. It was 100% NVIDIA’s fault».
When the first reports of the design flaw surfaced, some media outlets reported that TSMC was to blame and suggested that it could cause tension between NVIDIA and its partner. According to Huang, this is not the case, and the problem was caused by NVIDIA’s own miscalculations. He dismissed reports of tension between the two companies as «fake news».
«In order for the Blackwell computer to work, seven different types of chips were developed from scratch and had to be put into production simultaneously.
….
What TSMC did was help us fix that [working chip] yield issue and get Blackwell production back up and running at an incredible pace.
Graphics processors NVIDIA Blackwell B100 and B200 connect the two chips using TSMC’s CoWoS-L packaging technology, which relies on an RDL interposer with local silicon bridges (LSIs). The placement of these bridges is critical. However, a predictable mismatch in thermal expansion properties between the GPU chips, LSI bridges, RDL interposer, and substrate caused system deformation and failure. NVIDIA was forced to modify the top metal layers and silicon bumps of the GPU to improve performance.
To solve such problems, it usually takes about 10 stepping-ups, each taking about three months. It is therefore impressive to see the speed at which NVIDIA and TSMC patched Blackwell GPUs. The patched Blackwell GPUs for artificial intelligence and supercomputers will go into mass production at the end of October, with deliveries expected to begin early next year.
However, earlier this year, NVIDIA warned that to meet the demand for its Blackwell GPUs among major cloud service providers such as AWS, Google, and Microsoft, it will still need to ship some initial low-end processors in 2024.
Sources: Reuters, Tom’s Hardware