Рубрики Technologies

How NVIDIA RTX 5000 is built and what makes it cool: Blackwell architecture under the microscope

Published by Vladyslav Vasylenko

Last time we mentioned RTX 3000 Ampere and RTX 4000 Ada. Today it’s time to take a look at Blackwell’s new RTX 5000. Let’s take a look at the features of the new generation of graphics cards (GDDR7, DLSS), their SM blocks, and new technologies in Tensor Cores. We’ll also say a few words about the RTX 5070.

Features of the NVIDIA RTX 5000 Blackwell

The main NVIDIA RTX Blackwell graphics cards are:

  • New features for SM-blocks The new RT Core and Tensor Core features improve and accelerate neural visualization capabilities. They provide twice the integer math throughput per clock cycle compared to RTX 4000 Ada GPUs
  • New 4th generation RT cores. Significant improvements to the RT core architecture have been made to Blackwell, enabling the use of new ray tracing and neural imaging technologies.
  • New 5th generation Tensor Cores – include new FP4 capabilities that can double the throughput of artificial intelligence while halving memory requirements. Also included is support for the new second-generation FP8 Transformer Engine, which is used for data centers.
  • NVIDIA DLSS 4. Blackwell’s architecture supports AI multi-frame generation, which increases the frame rate by up to 2 times compared to the previous DLSS 3/3.5 version, while maintaining or even exceeding the original image quality and ensuring low latency for systems.
  • Artificial intelligence management processor (AMP) – enables AI models that can generate conversations, translation, image vision, animation, behavior, and more, and use the GPU for graphics workloads in parallel.
  • GDDR7 memory — is a new ultra-low voltage GDDR memory standard that utilizes PAM3 (Pulse Amplitude Modulation) signaling technology, including faster memory subsystems and improved power efficiency.
  • Mega Geometry technology – New RTX technology aimed at dramatically increasing the geometric detail possible in ray-tracing applications.

NVIDIA RTX 5000 Blackwell architecture

GB202 chip and SM blocks

The GB202 chip is the new flagship processor in today’s generation of graphics cards for the consumer market. So far, it is only available as part of a new graphics card GeForce RTX 5090. The GB203 GPU is used in GeForce RTX 5080 and GeForce RTX 5070 Ti graphics cards, and GB205 — in GeForce RTX 5070. These GPUs are based on the same basic architecture and are customized for different market segments.

About GB205 I will say a few words separately below.

The full GB202 GPU includes 12 graphics processing clusters (GPCs), 96 texture processing clusters (TPCs), 192 streaming multiprocessors (SMs), and a 512-bit memory interface with sixteen 32-bit memory controllers.The GPC is the dominant high-level hardware unit in all Blackwell GB20x GPUs, with all key GPUs residing in the GPC. Each GPC includes a dedicated Raster Engine, two Raster Operations Partitions (ROPs), with each partition containing eight separate ROP units and eight TPCs. Each TPC includes one PolyMorph Engine and two SM units.However, although the RTX 5090 has GB202, it is slightly «cut». Namely, 1 GPC is disabled.

GB202 (full) GB202 (RTX 5090)
GPC 12 11
TPC 96 85
SM blocks 192 170
CUDA kernels 24576 21760
RT kernels 192 170
Tensor kernels 768 680
L2 cache 128 MB 96 MB


Image №1. GPU GB202. Author: Nvidia.

Each SM unit consists of: 128 CUDA cores, one fourth-generation RT core, four fifth-generation tensor cores, 4 texture units, a 256 KB register file, and 128 KB of L1/shared memory.Note that the number of possible INT32 integer operations in Blackwell is doubled compared to Ada due to their complete unification with FP32 kernels, as shown in Figure #2 below. However, unified cores can only operate as either FP32 or INT32 cores in any given clock cycle

Image №2. SM block in the Blackwell architecture. Author: Nvidia.

Let me remind you once again that regarding the RTX 4000 Ada, I did separate post. More details can be found in that article. However, even now it is clear that the new generation will be better suited to new neural shaders and AI work.

Image №3. Comparison of SM blocks of two generations. Author: Nvidia.

What’s wrong with the RTX 5070?

And the first thing you might have thought was — No, RTX 5070 is not equal to RTX 4090. The fact that this phrase was said and shown at the presentation is complete nonsense. It is simply too weak for that. Of course, if you enable DLSS 4 with a frame generator, then the fps may be the same. However, the keystroke delays and image artifacts cannot be undone.

Images №4 and №5. A slide from the presentation and a class joke. Author: NVIDIA, IndianGaming.

However, something else is interesting. Here is a table of the basic characteristics of RTX 3070/4070/5070

RTX 3070 RTX 4070 RTX 5070
GPU chip GA104 AD104 GB105
GPC 6 5 5
TPC 23 23 24
SM blocks 46 46 48
CUDA kernels 5888 5888 6144
Tensor kernels 184 184 192
RT kernels 46 46 48
GPU frequency 1725 2475 2512
Tire 256-bit 256-bit 192-bit
Number of video memory 8 GB GDDR6 8 GB GDDR6X 12 GB GDDR7

So, we can see something strange — the name of the graphics chip stands out. This is GB205. This last five breaks everything, because usually, there was no such name for chips. They were either xx204 or xx206. So I think that RTX 5070 was originally called RTX 5060 or RTX 5060 Ti.

For those who are interested, the 5070 Ti has the same chip as the RTX 5080 — GB203. And it turns out that there is no intermediate GB204, which was supposed to be in the RTX 5070


GDDR7 memory subsystem

Blackwell graphics cards come with the new GDDR7 video memory, an ultra-low voltage standard that utilizes PAM3 signaling technology to provide a significant advancement in high-speed memory design. NVIDIA’s collaboration with the JEDEC technology association helped create PAM3 (Pulse Amplitude Modulation with Three Levels). It is the fundamental high-frequency signaling technology for GDDR7 DRAM.

Image №6. Comparison of GDDR6X and GDDR7. Author: Nvidia.

The transition from PAM4 (4 levels transmit 2 bits per cycle) in GDDR6X to PAM3 (3 levels transmit 1.5 bits per cycle) in GDDR7, combined with an innovative pin-coding scheme, enables GDDR7 to achieve a significantly improved signal-to-noise ratio (SNR). This evolution also doubles the number of independent channels with minimal cost in terms of I/O density.

And no, it’s not a quantum computer or quantum memory.

With increased channel densities, improved PAM3 SNR, advanced alignment schemes, updated clocking architecture, and advanced I/O learning, GDDR7 delivers significantly higher throughput. The GeForce RTX 5090 comes with 28Gbps GDDR7 memory and delivers a peak memory bandwidth of 1.792TB/s, while the GeForce RTX 5080 comes with 30Gbps GDDR7 memory, delivering a peak memory bandwidth of 960GB/s.

Blackwell 5th generation tensor cores

Blackwell tensor kernels support operations with FP4, FP6, FP8, INT8, FP16, BF16, TF32. However, FP4 support is associated with the need to run new generative models of artificial intelligence. These models increase the requirements for computing resources and memory, and it can be difficult to run such models even on the latest hardware.

FP4 provides a lower quantization method, similar to file compression, that reduces model size. Compared to FP16 (used by most models by default), FP4 requires less than half the memory and Blackwell GPUs deliver up to 2x the performance of the previous generation. FP4 maintains virtually no loss of quality thanks to the advanced quantization techniques offered by NVIDIA TensorRT Model Optimizer.

RT Blackwell 4th generation cores

The RTX 2000 Turing, RTX 3000 Ampere, and RTX 4000 Ada GPUs have special hardware blocks to accelerate the traversal of the Bounding Volume Hierarchy (BVH) data structure and perform both Ray-triangle intersection and Ray-bounding box intersection calculations. Ray intersection calculation is a complex operation that is performed with high frequency when visualizing a ray-traced scene. The fourth-generation RT core provides double the bandwidth of Ada.

Image №7. New fourth generation RT cores. Author: Nvidia.

RT cores, which are found in both Ada and Blackwell GPUs, include a special block known as Opacity Micromap Engine. The Opacity Micromap Engine evaluates the Opacity Mask, which is a simple triangular mesh defined using a barycentric coordinate system that is used to report ray and triangle intersections.

The other two blocks (Triangle Cluster Intersection Engine and Triangle Cluster Compression Engine) are required to use a new technology called Mega Geometry. This is a new RTX technology aimed at dramatically increasing the geometric detail that is possible in ray-tracing applications. Specifically, Mega Geometry enables game engines such as Unreal Engine 5, which use modern Level-of-Detail (LOD) systems such as Nanite, to track their geometry with complete accuracy. No more falling back on low-resolution proxies for ray tracing effects, bringing new levels of quality to shadows, reflections, and indirect lighting.

Various variants of curve primitives are commonly used to represent hair, fur, grass, and other strand-like objects. For ray tracing, these primitives are usually implemented in software using special intersection shaders. However, the intersection of a ray curve requires intensive computation, which limits the use of curves in real-time ray tracing rendering and increases rendering time for offline renderers.

Blackwell’s RT Core introduces hardware support for ray-crossing for a new primitive called Linear expanded spheres (Linear Swept Spheres, LSS). An LSS is similar to a mosaic curve, but is constructed by sweeping spheres in space in linear segments. The radii of the spheres can vary between the start and end points of each segment, allowing for flexible approximation of different types of filaments. For common use cases, such as human hair visualization, LSS is about 2 times faster and requires about 5 times less VRAM to store the geometry.

Image №8. Linear Swept Spheres (LSS). Author: Nvidia.

AI Management Processor (AMP)

The AI Management Processor (AMP) is a fully programmable GPU context scheduler designed to offload contexts from the CPU to the GPU. AMP improves GPU context scheduling on Windows to better manage the various workloads running on the GPU. A GPU context encapsulates all the state information that the GPU needs to perform one or more tasks. Multiple contexts can be used to ensure that multiple applications can use the GPU simultaneously without conflicts. An example is the coordination and scheduling of asynchronous artificial intelligence model workloads such as NVIDIA Avatar Cloud Engine (ACE) with speech, translation, vision, animation, and behavior models, as well as G-Assist, that run concurrently with other GPU graphics workloads.

AMP is implemented using a dedicated RISC-V processor located on the front of the GPU and provides faster scheduling of GPU contexts with lower latency than previous CPU-driven methods. Blackwell AMP’s scheduling architecture follows Microsoft’s architectural model, which describes a customizable scheduling kernel on the GPU using Windows Hardware-Accelerated GPU Scheduling (HAGS), introduced in Windows 10 (Update from May 2020!). HAGS allows the GPU to manage its own memory more efficiently, reducing latency and potentially improving performance in games and other graphics-intensive applications.

Image №9. AMP schedules different tasks. Author: Nvidia.

AMP’s role is to take charge of scheduling GPU tasks, reducing the reliance on the CPU, which is often the bottleneck for game performance. In fact, by allowing the GPU to manage its own task queue, it is possible to reduce latency due to less feedback between the GPU and CPU. This results in smoother frame rates in games and better multitasking in Windows because the CPU is less stressed.

Deep Learning Super Sampling 4

And here we need to make a correction. NVIDIA now uses several technologies under the DLSS name. Namely:

  • DLSS (Deep Learning Super Sampling) – scaling an image from a lower resolution to the one required by the user;
  • MFG (Multi Frame Generation) – same generator of additional frames between real ones;
  • RR (Ray Reconstruction) – improves ray tracing performance;
  • SR (Super Resolution) – also scales the image;
  • DLAA (Deep Learning Anti-Aliasing) – an image smoothing technology.

Therefore, it is more appropriate to replace the word Sampling with DLSS to Services. I think it will be much clearer.

Image №10. Support for new technologies in different RTX series. Author: Nvidia.

And for those who are too lazy to read further, NVIDIA has prepared a special video:

DLSS 4 Multi Frame Generation

Frame generation technology was first introduced in the Ada architecture in 2022. A single frame was generated between each pair of traditionally rendered frames using an optical flow field along with the game’s motion vectors and AI network. Blackwell’s architecture enables DLSS Multi Frame Generation to increase FPS by generating up to three additional staff members for each traditionally visualized frame.

The new model for frame generation is 40% faster, uses 30% less video memory, and requires only one run per playback frame to generate multiple frames. Optical flow field generation has been accelerated by replacing the hardware optical flow with a very efficient AI model

Image №11. An example of MFG’s work. Author: Nvidia.

Transformer models in DLSS 4

DLSS is moving to an entirely new neural network architecture, and it brings many benefits. The ability of artificial intelligence to classify images has been revolutionized by a technology called Convolutional Neural Network (CNN). CNNs work by locally aggregating pixels and analyzing the data in a tree-like manner from the lowest level to the highest.

DLSS 4 improves image quality and rendering smoothness by introducing more powerful Transformer-based AI models for DLSS Super Resolution, DLSS Ray Reconstruction, and Deep Learning Anti-Aliasing (DLAA), trained by NVIDIA supercomputers to better understand and render complex scenes. Neural networks using the Transformer architecture excel at tasks involving sequential and structured data. The idea behind Transformer models is that the attention to how computation is spent and how it is analyzed should be driven by the data itself, so the neural network must learn to direct its attention to look at the parts of the data that are most interesting or useful for decision making.

Compared to CNN, Transformers use self-awareness and can more easily identify long-term patterns in a much larger pixel window. Transformers also scale more efficiently, allowing the models used for DLSS 4 to take on twice as many parameters, and to use more processing power from tensor cores to reconstruct images with even better quality for all RTX owners. The result is improved stability from one frame to the next, enhanced lighting detail, and more detail in motion. Changing the neural network architecture from CNN to Transformer has resulted in a significant improvement in image quality in many scenarios.

DLSS Super Resolution (SR)

SR improves performance by using artificial intelligence to derive higher-resolution frames from lower-resolution ones. DLSS selects multiple lower-resolution images and uses motion data and feedback from previous frames to create high-quality images. The final product of the Transformer model is more stable over time with fewer halos, more detail in motion, and improved anti-aliasing than previous versions of DLSS.

Image №12. An example of Super Resolution work. Author: Nvidia.

DLSS Ray Reconstraction (RR)

RR improves image quality by using artificial intelligence to create additional pixels for intense ray-traced scenes. DLSS replaces manually adjusted noise reduction with an AI network trained by an NVIDIA supercomputer that generates higher quality pixels between sampled rays. In intense ray-traced content, the Transformer for RR model gets an even bigger quality boost, especially for scenes with complex lighting. In fact, all the usual artifacts of typical noise reduction engines are significantly reduced.

Image №13. An example of Ray Reconstruction work. Author: Nvidia.

Deep Learning Anti-Aliasing (DLAA)

DLAA delivers superior image quality with artificial intelligence-based anti-aliasing technology. DLAA utilizes the same Super Resolution technology developed for DLSS, creating a more realistic, high-quality image at its original resolution


And don’t forget to come and read Overview of the RTX 5080!

Автор: NVIDIA.