Рубрики NewsTechnologies

Chain-of-Zoom AI zooms in on photos up to 256 times without losing detail

Published by Oleksandr Fedotkin

Researchers from South Korea have developed an AI-based Chain-of-Zoom tool that can zoom in on low-resolution photos 256 times, generating realistic detail.

Scientists from KAIST AI, led by Kim Jaechul, sought to solve the problem of improving the quality of low-resolution photos while maintaining clear and realistic detail. Traditional ultra-high resolution systems for a single image work on the principle of guessing the missing details in requests to zoom in on an image. 

Generative models are trained to create realistic versions of low-resolution photos, predicting the details that are missing from the image. However, the effectiveness of such models depends on the technology they were trained with. They often become ineffective when it comes to going beyond the usual limits.

«Current models are excellent relative to the scale factors they were trained on, but fail when asked to zoom in on an image that goes beyond that range», — explain the developers at KAIST AI. 

Chain-of-Zoom efficiently overcomes limitations, using a step-by-step scaling process. This AI doesn’t stretch the image 256 times in one go, because then the image would be blurry and the details would be fictitious. Instead, Chain-of-Zoom zooms in step by step, building on each previous step, using an ultra-high resolution model, such as a well-trained diffusion model, to refine the image.

In addition, the Vision-Language language model also participates by generating language clues that help Chain-of-Zoom imagine what should appear in the image in the next step. Vision-Language selects several precise phrases, such as: «leaf veins», «fur texture», «brick wall», and so on, which guide the AI to further detail the image. 

As the scale increases, the original image loses clarity and it becomes difficult to visually recognize the context. At this point, speech cues play a crucial role. However, generating the right language cues is not easy. Standard language models can be repetitive, generate strange phrases, and misinterpret the input data.

KAIST AI

To optimize this process, the researchers used reinforcement learning and human feedback. They trained their hint generation model to match human preferences using a technique called generalized reward policy optimization. 

The training was based on three types of feedback:

  • A human critic evaluated the clues generated by the language model for relevance to the image;
  • The language model was penalized for confusing and incorrect phrases;
  • A special filter filtered out repetitive text.

As the training progressed, the prompts became clearer, more specific, and more useful. The results of Chain-of-Zoom were evaluated using several reference-free quality metrics such as NIQE and CLIPIQA. At four levels of magnification (4×, 16×, 64×, 256×), CoZ consistently outperformed the alternatives, especially at higher scales.

Other advantages of this tool are that the basic ultra-high resolution model does not require retraining. Chain-of-Zoom will be very convenient for applications that require fast, high-precision zooming without the use of complex computing power.

Chain-of-Zoom can be used in medicine, where improved detail can enhance diagnostic capabilities, in video surveillance, where high detail is required, in restoring old photos, in scientific visualization, in microscopy and astronomy.

An important drawback of this technology is that after a significant increase in the size of the photo, the original will actually disappear, and only an artificial copy generated by AI will remain. Therefore, the technology can be used to manipulate visual data to create fake images.

«High-quality generation from low-resolution input data can lead to concerns about misinformation or unauthorized reconstruction of sensitive visual data», — the developers admit. 

The results were published on the preprint server arXiv

Source: ZMEScience