New Korean NPU speeds up AI by 60% and saves 44% of energy

Korean researchers from the KAIST School of Computing in collaboration with HyperAccel Inc. have developed a new neural processor that improves inference performance generative models of AI by an average of 60%.

Modern models of generative AI, in particular, ChatGPT-4 and Gemini 2.5, require not only high memory bandwidth but also much more memory. In this regard, companies that manage cloud-based generative AI, including Microsoft and Google, purchase hundreds of thousands of NVIDIA GPUs.

A new neural processor, created by Korean researchers, increases data output performance generative AI models by 60% and reduces energy consumption by 44%. The technology was proposed by scientists led by by Professor Jongseok Park. It is designed specifically for cloud-based AI services such as ChatGPT.

Currently, GPU-based AI infrastructure requires at least a few GPUs to provide high bandwidth and capacity. The technology of Korean researchers allows using a smaller number of neural processors to support the same infrastructure by quantizing the KV cache. The KV cache occupies most of the memory used. Thus, quantization significantly reduces the cost of building generative AI cloud services.

Новий корейський NPU пришвидшує ШІ на 60% та заощаджує 44% енергії — General architecture of the Oaken/ACM accelerator

The new NPU integrates with memory interfaces without changing the operating system algorithm based on existing architectures. It not only implements the KV-cache quantization mechanism, but also manages memory at the page level, efficiently utilizing limited bandwidth and memory capacity, and introduces new coding methods optimized for the KV quantized cache. It is expected that this NPU will create a cloud generative AI infrastructure by high performance and low power consumption will reduce operating costs.

«This research, conducted in collaboration with HyperAccel Inc. found a solution in algorithms for lightweight inference of generative AI and successfully developed an underlying NPU technology that can solve the memory problem. With this technology, we have realized an NPU with more than 60% performance improvement over the latest GPUs by combining quantization techniques that reduce memory requirements while maintaining output accuracy with hardware designs optimized for this purpose», — said Professor Jongseok Park.

The results of the study were published in the journal ACM

Source: TechXplore