News Devices 06-04-2024 at 11:13 comment views icon

Intel introduces Lunar Lake — architecture with +18% performance, +50x faster Xe2 graphics and no Hyperthreading in P-core

author avatar
https://itc.ua/wp-content/uploads/2024/05/photo_2023-11-12_18-48-05-3-268x190-1-96x96.jpg *** https://itc.ua/wp-content/uploads/2024/05/photo_2023-11-12_18-48-05-3-268x190-1-96x96.jpg *** https://itc.ua/wp-content/uploads/2024/05/photo_2023-11-12_18-48-05-3-268x190-1-96x96.jpg

Andrii Rusanov

News writer

Intel has revealed details of the Lunar Lake processor architecture during the Intel Tech Tour 2024 presentation ahead of the company’s Computex 2024 keynote. The new processors will feature significant improvements in every aspect of the design. Lunar Lake processors are primarily designed for laptops, although many of the fundamental changes may be carried over to Arrow Lake chips for desktops.

Intel представила архітектуру Lunar Lake

Every component of the Lunar Lake architecture has been optimized to balance power and performance. Energy-efficient cores (E-cores) have been improved the most, with a 38% increase in IPC (instructions per clock cycle) and a 68% increase in the new Skymont cores. IPC also increased by 14% for Lion Cove P-core. With the new Xe2 integrated graphics, the performance of the embedded video chip will increase by 50%.

Lunar Lake has Intel’s new neural processor for artificial intelligence with 48 TOPS performance. In fact, the Lunar Lake platform has even more AI performance — it offers 120 TOPS in total, including compute cores and iGPUs.

Intel представила архітектуру Lunar Lake

Lunar Lake mobile processors are designed with a new methodology that focuses on energy efficiency as a top priority. This basic architecture will be used in future Intel products such as Arrow Lake and Panther Lake.

Intel has turned to competitor TSMC for its advanced 3 nm N3B process to create computing cores, embedded graphics and NPUs. TSMC’s N6 process is used for the controller, which contains external I/O interfaces. The only element manufactured by Intel is the 22FFL Foveros passive base tile. Intel claims that it chose TSMC for the best available process technology. However, the company designed the architecture to be easily portable to other processes.

The structure of the Lunar Lake SoC

Intel’s Lunar Lake processors will have 4 P-core and 4 E-core. The chip consists of two logic tiles: a compute tile (TSMC N3B) and a platform controller tile (N6), as well as a stiffener (non-functional) placed on the Foveros 22FFL base tile. Intel has placed two LPDDR5X-8500 memory stacks directly on the chip body in 16GB or 32GB configurations. The memory exchanges data over four 16-bit channels and provides up to 8.5 GT/s of bandwidth per chip.

The computing tile contains main cores, Xe2 chips, and NPU 4.0. It also features a new «8 MB side cache that is shared between all compute units to increase access rates and reduce data movement. Technically, it does not meet the definition of an L4 cache, as it is shared by all elements.

The removal of the power subsystem from the chip also added to the power savings. Overall, Intel claims a 40% reduction in consumption compared to Meteor Lake.

Productive cores

Lunar Lake P-cores provide an average IPC increase of 14%, which increases performance. However, Intel has taken an unexpected step in optimizing the cores — it has eliminated Hyperthreading and all the logical blocks that provided this function. Intel’s architects concluded that hyperthreading, which increases IPC by ~30% in multi-threaded workloads, is not as appropriate in a hybrid design that uses more energy-efficient E-core for multi-threaded workloads. Intel claims an overall performance improvement of 10% to 18% over Meteor Lake, depending on the chip’s operating power.

Intel has expanded the prediction unit by a factor of 8 compared to the previous architecture while maintaining accuracy. It also tripled the request bandwidth from the instruction cache to L2 and doubled the instruction fetch bandwidth, from 64 to 128 bytes per second. The decode throughput was increased from 6 to 8 instructions per cycle, and the microoperation cache was increased along with the read throughput. The microoperation queue was also increased from 144 to 192.

The memory subsystem has a new L0 cache level. The architects completely redesigned the data cache to add a 192 KB level between the existing L1 and L2. This led to the renaming of L1 to L0. This increases the IPC and allows for increased L2 cache capacity without increasing latency due to the increased capacity. As a result, the L2 cache increases to 2.5 MB on Lunar Lake and 3 MB on Arrow Lake.

Energy efficient cores

Lion Cove’s efficient kernels have many improvements, but Skymont promises even greater progress: a 38% increase in IPC on integer workloads and a 68% increase in floating point workloads. This translates to up to 2x the single-threaded performance and up to 4x the performance on multi-threaded tasks. Intel has also doubled the throughput in AVX and VNNI vectorized workloads.

Intel has optimized the branch prediction engine by enabling parallel sampling of 96 bytes of instructions to feed the decoding engine. Skymont cores can support 9 instruction decodes per clock cycle. The microoperation capacity has also been increased from 64 to 96 entries.

Intel has set a goal of doubling vector performance by moving from two 128-bit vector channels with FP and SIMD to four with Skymont. Other improvements to the vector system are aimed at reducing latency and adding support for floating point rounding.

The previous E-core clusters had a shared L2 cache of 2MB, which has now been increased to 4MB with double the L2 bandwidth. L1 to L1 transfer bandwidth has also been improved.

Interestingly, Intel provided a comparison of Skymont and the Raptor Lake P-core, which uses the Raptor Cove architecture. The company claims a 2% advantage for Skymont in integer and floating point numbers.

Built-in Intel Xe2 graphics

The new Xe2 GPU delivers up to 1.5 times the performance of Meteor Lake’s Arc Graphics and up to 67 TOPS of AI performance. Intel has simplified the naming of the GPU and will simply refer to it as Xe2 in all configurations, as opposed to the previous generation’s Xe-LP, Xe-HP, and Xe-HPG suffixes.

Intel’s Xe2 architecture will appear not only in Lunar Lake processors, but also in future Battlemage gaming graphics cards. However, Lunar Lake uses transistors with lower power, while Battlemage will use faster transistors to maximize performance. This means that the performance of Lunar Lake cannot be directly extrapolated to Battlemage graphics cards.

The Xe2 architecture includes the second-generation Xe core, support for more data types, improved vector engines, larger ray tracing units, and a larger cache. The GPU is divided into second-generation Xe cores and rendering units, as well as fixed-function units for tasks such as geometry processing, texture sampling, and rasterization. These units are connected to a large cache with an I/O unit that varies from implementation to implementation. The design is modular, so it can be easily scaled to more or fewer elements.

The second-generation Xe core can perform eight 512-bit multiplications per clock cycle in XVE vector engines and eight 2048-bit vectors per clock cycle in XMX engines. Intel has also increased the width of the SIMD engine from 8 to 16 lanes, which will improve compatibility. The core has a shared L1 of 192 KB.

The second-generation vector engine supports INT2, INT4, INT8, FP16, and BF16 instructions for AI operations. You can also see a table with calculations for peak TOPS (Ops/clock) in the album above. The Meteor Lake GPU did not have the XMX engine, so laptops with Xe2 will gain a lot in AI performance. The visualization unit also received many accelerations and improvements.

The Lunar Lake video chip features 8 second-generation Xe cores, 64 vector engines, two geometry pipelines, eight ray tracing units, and 8 MB of L2 cache, among other components. Intel says that the iGPU delivers 1.5 times the performance of the Meteor Lake-U at the same power. However, the Lunar Lake GPU has lower power transistors for better efficiency.

The display engine supports resolutions of up to 8K60 HDR, three 4K60 HDR displays, as well as 1080p360 and 1440p360. Outputs include HDMI 2.1, DisplayPort 2.1 and eDP 1.5. The media processor supports decoding and encoding up to 8K60 10-bit HDR, as well as support for all media standards, including the new H.266/VVC codec — but only for decoding.

NPU 4.0 and controller

The new NPU outperforms some recently introduced competitors with 48 TOPS performance. The separate chip is primarily designed to offload artificial intelligence tasks and save battery power. The GPU is responsible for the more demanding AI workloads with 67 TOPS performance, and the CPU provides another 5 TOPS. All together, Lunar Lake delivers 120 TOPS.

Intel представила архітектуру Lunar Lake

Key architectural components include 12 enhanced SHAVE DSPs, six neural computing engines, and a MAC and DMA engine. It provides twice the memory bandwidth of previous generation NPUs. It also has access to a shared 8MB side cache on the compute tile. Overall, Intel claims a 4x improvement in maximum performance at the same power compared to the previous generation.

The controller tile contains all the external I/O for the chip, including Wi-Fi 7 and Bluetooth 5.4, USB 3.0 and 2.0, Thunderbolt 4, and PCIe 4.0 and 5.0 interfaces. It also houses the memory controllers.

Intel guarantees that all Lunar Lake laptops will have at least two Thunderbolt 4 connection ports, with some models offering up to three. The interface also supports the new Thunderbolt Share feature. Wi-Fi 7 and Bluetooth 5.4 still require a CNVi module connected externally via the CNVi 3.0 interface. The new BE201 CRF module is 28% smaller.

Source: Tom`s Hardware


Loading comments...

Spelling error report

The following text will be sent to our editors: