Depositphotos
Canadian researchers from the Weizmann Institute and Intel Labs have presented a new algorithm that allows different AI models unite and work together to increase efficiency and reduce costs.
Each major language model (LLM) has its own unique electronic «language». However, until recently, different models could not directly interact with each other. A series of new algorithms proposed by canadian researchers removes this limitation, allowing users to use the combined computing power of several models, speeding up their work by 1.5 times.
Such powerful LLMs as ChatGPT or Gemini are able to perform a variety of complex tasks. However, they remain slow and consume a large amount of computing power on their own. In 2022, tech companies realized that AI models could be more productive and powerful if they worked together.
Thus, a method called «speculative decoding» was developed. This method assumes that a small and fast language model with a relatively limited data set will be the first to respond to a user’s query, a more powerful and great LLM will analyze and correct the answer as necessary.
This approach allowed for 100% accuracy, unlike alternative acceleration methods that reduced the quality of the output. However, this method was limited by the fact that language models had to use the same digital language. This did not allow combining models from different companies.
«The tech giants have embraced speculative decoding, benefiting from higher performance and saving billions of dollars a year in computing power costs, but only they have had access to smaller, faster models that speak the same language as the larger models. A startup looking to benefit from speculative decoding would have to train its own small model to match the language of the big model, which requires a lot of expertise and expensive computing resources», — the study leader explains, Nadav Timor, a PhD student in Professor David Harel’s research group at the Department of Computer Science and Applied Mathematics at Weizmann University.
New algorithms allow combining any small AI models with any large ones. Researchers have developed an algorithm that allows LLMs to translate output from the internal language of tokens into a common format that all models can understand. Another algorithm encourages such AI models to rely on tokens that have the same meaning for all models in their collaboration.
«Initially, we were worried that too much information would be «lost in the translation and that the different models would not be able to interact effectively. But we were wrong. Our algorithms speed up LLM by up to 2.8 times, resulting in significant computing power savings», — notes Nadav Timor.
Over the past few months, the team has published their algorithms on the Hugging Face Transformers open-source AI platform, making them available to developers around the world. Since then, these algorithms have become part of standard tools for the efficient implementation of AI processes.
The results of the study are published on the preprint server arXiv
Source: TechXplore