News Devices 02-20-2024 at 12:17 comment views icon

Legendary processor architect Jim Keller criticizes Nvidia’s CUDA parallel computing architecture

author avatar
https://itc.ua/wp-content/uploads/2023/07/2023-07-19-12.08.01-2-96x96.jpg *** https://itc.ua/wp-content/uploads/2023/07/2023-07-19-12.08.01-2-96x96.jpg *** https://itc.ua/wp-content/uploads/2023/07/2023-07-19-12.08.01-2-96x96.jpg

Yurii Oros

News writer

Legendary processor architect Jim Keller criticizes Nvidia’s CUDA parallel computing architecture

Jim Keller, a legendary processor architect who worked on x86, Arm, MISC, and RISC-V processors, this weekend criticized the architecture of CUDA and the Nvidia software stack and compared it to x86, calling it a swamp. He noted that even Nvidia itself has several specialized software packages that rely on open source frameworks for performance reasons.

CUDA — is a swamp, not a moat. x86 was a swamp too. […] CUDA is not beautiful. It was built by piling up one thing at a time.”

— Keller wrote.

https://twitter.com/jimkxa/status/1758943525662769498

Indeed, like x86, CUDA is gradually adding functionality while maintaining backward compatibility in software and hardware. This makes the Nvidia platform backwards compatible, but it does affect performance and makes program development more difficult. Meanwhile, many open source software development frameworks can be used more efficiently than CUDA, transmits Tom’s Hardware.

Basically, nobody writes CUDA. If you write CUDA, it’s probably not fast. […] There is a good reason why Triton, Tensor RT, Neon, and Mojo exist.

Even Nvidia itself has tools that don’t rely solely on CUDA. For example, Triton Inference Server is an open-source tool from Nvidia that simplifies the deployment of AI models at scale by supporting frameworks such as TensorFlow, PyTorch, and ONNX. Triton also provides features such as model versioning, multi-model maintenance, and parallel model execution to optimize GPU and CPU resource utilization.

TensorRT from Nvidia — is a high-performance deep learning optimizer and runtime library that accelerates deep learning on Nvidia GPUs. TensorRT takes trained models from a variety of frameworks, such as TensorFlow and PyTorch, and optimizes them for deployment, reducing latency and increasing throughput for real-time applications such as image classification, object detection, and natural language processing.

And while architectures such as Arm, CUDA, and x86 can be considered a quagmire due to their relatively slow development, mandatory backward compatibility, and cumbersome nature, these platforms are also not as fragmented as, for example, GPGPUs, which is not a bad thing.

It is not known what Jim Keller thinks about AMD’s ROCm and Intel’s OneAPI, but it is clear that although he has devoted many years of his life to developing the x86 architecture, he is not enthusiastic about its future prospects. His statements also indicate that although he has worked for some of the largest chipmakers in the world, including Apple, Intel, AMD, Broadcom (and now Tenstorrent), we won’t see him among Nvidia’s employees anytime soon.


Loading comments...

Spelling error report

The following text will be sent to our editors: