Bypassing sanctions: DeepSeek FlashMLA improves the performance of NVIDIA H800 AI chips by 8 times

Обхід санкцій: DeepSeek FlashMLA покращує продуктивність чипів ШІ NVIDIA H800 у 8 разів

China may have found a way to circumvent the restrictions on the supply of powerful AI chips. DeepSeek FlashMLA technology multiplies the TFLOPS of the NVIDIA Hopper H800.

At «OpenSource Week», which DeepSeek launched on February 24, the company introduced the «decoding kernel» (decoding kernel) FlashMLA. This is a software technology for optimizing the performance of NVIDIA Hopper processors.

According to DeepSeek, the performance of the H800 with its use is 580 TFLOPS for multiplying the BF16 matrix, which is about eight times the standard capacity. Also, due to the efficient use of memory, FlashMLA provides up to 3000 GB/s of memory throughput, which is almost twice the maximum of the H800. Incredibly, this is done by code alone, without any hardware improvements.

This is crazy.
-> Blazing fast: 580 TFLOPS on H800, ~8x industry avg (73.5 TFLOPS).
-> Memory wizardry: Hits 3000 GB/s, surpassing H800’s 1681 GB/s peak.

— Visionary x AI (@VisionaryxAI) February 24, 2025

DeepSeek’s FlashMLA implements «low-rank key-value compression» (low-rank key-value compression) — in other words, it breaks data fragments into smaller parts for faster processing. It also reduces memory consumption by 40%-60%. The technology uses a block-based «paging» system that dynamically allocates memory depending on the intensity of the task, instead of fixed allocation values. This helps models to process variable length sequences much more efficiently and run faster.

The new DeepSeek technology demonstrates the potential of the software in the field of artificial intelligence computing and the possibilities for improving the performance of expensive and energy-consuming accelerators. At the moment, FlashMLA is designed for H800 only, but it would be interesting to see it work on H100 processors.

Recently work is actively underway in China on computing optimization. Recently, scientists from the University of Shenzhen and the Beijing Institute of Technology improved the performance of a regular NVIDIA RTX 4070 by 800 times in the tasks of peridynamics. Unfortunately, the result was achieved together with the Russians, and its consequences will accelerate and improve military-industrial calculations.

Source: Wccftech