
On Tuesday, September 2, Tencent introduced a new AI model HunyuanWorld-Voyager, capable of creating sequential 3D videos from a single image.
Users can control the camera to explore the generated worlds. The model simultaneously generates RGB video and depth information, allowing users to directly change details without the need to use traditional modeling tools.
However, we shouldn’t expect HunyuanWorld-Voyager to become a full-fledged alternative to traditional video games in the near future. The generated results are not real 3D models, but they achieve a similar effect.
In particular, AI generates 2D video frames, that maintain spatial consistency, as if the camera was actually moving in three dimensions. Each generation creates 49 frames lasting about 2 seconds.
Tencent representatives note, that several videos can be sequentially combined with each other for a total duration of several minutes. Objects remain in the same relative position when the camera moves around them, and the perspective changes correctly, as expected in a real three-dimensional environment.
The end result is a video with depth maps, not true 3D models, but they can be transform them into three-dimensional point clouds for reconstruction. As an input, the system receives a single image and a given camera trajectory from the user.
Users can choose, how the camera moves, back and forth, left and right, or rotation through the interface. HunyuanWorld-Voyager combines the image and depth data with a “global cache” to create sequential videos, that reflect the user’s chosen camera movement.
The key limitation of all AI models on the architecture Transformer is actually an imitation of patterns, identified in the training datasets, which limits their ability to use these patterns in new conditions, that were not present in the training set.
To train HunyuanWorld-Voyager, the developers used more than 100 thousand video clips, including scenes, generated by on the basis of Unreal Engine. In this way, the model was trained to simulate camera movement in a three-dimensional gaming environment.
Most AI-based video generators, such as Sora, sequentially create frames without tracking or maintaining spatial consistency. Meanwhile, HunyuanWorld-Voyager is trained to recognize and reproduce patterns of spatial consistency, but with the addition of geometric feedback.
As each frame is generated, the system converts the original data into three-dimensional points and then projects these points back into two dimensions for use in future frames. This method forces the model to match the learned patterns with geometrically consistent projections of its own previous output. And while this provides significantly better spatial consistency, than traditional video generators, it is still a pattern matching based on geometric constraints, rather than actual modeling in a three-dimensional environment.
HunyuanWorld-Voyager complements the collection of AI-based video generators, which also includes Genie 3, announced by Google in August this year. This model, as noted, generates interactive worlds with 720p resolution and 24 frames per second with the help of text prompts.
At the same time Dynamics Lab’s Mirage 2 offers in-browser world generation, allowing users to upload images and transform them into game environments with real-time text prompts. Meanwhile HunyuanWorld-Voyager is more focused on video production and 3D reconstruction processes with the ability to output RGB depth.
HunyuanWorld-Voyager is an advanced version of the earlier HunyuanWorld 1.0, introduced in July this year. It is also part of Tencent’s broader Hunyuan system, which includes Hunyuan3D-2 for generating 3D models from text, and HunyuanVideo for generating videos.
For training HunyuanWorld-Voyager developers have created software, that automatically analyzes existing videos, processes camera movements, and calculates depth for each frame. The system processed over 100 thousand video clips from real recordings and Unreal Engine renders.
To work with this model you need significant computing power: at least 60 GB of VRAM with 540p resolution. At the same time, Tencent recommends 80 GB of VRAM for optimal results. Tencent published model weights on Hugging Face and included code, that works with both single and multiple graphics cards.
However, this model has significant license restrictions. Together with other models of Hunyuan, World-Voyager is not available for users from the EU, the UK, and South Korea. In addition, commercial use with more than 100 million active users per month requires separate Tencent licensing.
According to the benchmark WorldScore, which was developed by researchers from Stanford, Voyager scored the highest overall score of 77.62, followed by WonderWorld at 72.69 and CogVideoX-I2V at 62.15. The model excelled in object control (66.92), style consistency (84.89), and subjective quality (71.09), although it came in second in camera control (85.95), behind WonderWorld with a score of 92.98.
While these self-reported benchmark results look promising, wider use still faces challenges due to the high computing power. For developers who need faster data processing, the system supports parallel output on multiple GPUs using the xDiT framework.
Source: ArsTechnica
Spelling error report
The following text will be sent to our editors: