Articles Technologies 02-23-2025 at 12:00 comment views icon

AI chatbot test: Gemini hates insects, Grok is a good joker, and ChatGPT can’t do math

author avatar

Tetiana Nechet

Автор статей

AI chatbot test: Gemini hates insects, Grok is a good joker, and ChatGPT can’t do math

We decided to test popular artificial intelligence (AI) chatbots for performing fairly simple and common tasks. For this purpose, we chose Claude 3.5 Sonnet by Anthropic, DeepSeek R1 by DeepSeek, ChatGPT 4o by OpenAI, Grok 3 beta by xAI, Gemini 2.0 Flash by Google, and Le Chat by Mistral AI. Although the tasks were not difficult, the answers to some questions were surprising. Therefore, such tests will be useful for those looking for a useful AI model to help them perform certain tasks.

Briefly about AI chatbots from the test

Claude 3.5 Sonnet

Developer: Anthropic (USA)Designed for natural conversations with a focus on security and usability. It has a contextual window of 200 thousand tokens, which allows you to work with large texts and long dialogues without losing context. That is, it is not so fast «forgets» the beginning of the conversation. Claude is noted for its high-quality writing and ability to offer additional tasks, which makes it useful for organizing projects and working with documents

DeepSeek-R1

Developer: DeepSeek (China)Open-source AI, which made a splash in January 2025. Despite fewer resources invested in development, this model outperforms its competitors in programming-related tasks. And open-source code makes DeepSeek R1 available to developers, but it may be functionally inferior to some closed models.

ChatGPT 4o

Developer: OpenAI (USA)ChatGPT 4o is one of the most powerful models that offers advanced chain of thought reasoning capabilities. At the same time, it preserves the context of previous conversations, allows you to receive up-to-date information from the network, and allows you to communicate in real time by voice. But without access to the Internet, its answers may be outdated.

Grok 3 Beta

Developer: xAI (USA)A new model that has its own unique functions for complex tasks. These include Grok 3 Think, an advanced analysis mode, and Grok 3 Big Brain, which utilizes increased computing power. According to reviews, Grok 3 Think is close to ChatGPT 4o in terms of answer quality

Gemini 2.0 Flash

Developer: Google (USA)The model works well with tasks that require logical analysis and contextual understanding. It supports multimodality, meaning it can analyze both text and images. Image generation is only available for early adopters.

Le Chat

Developer: Mistral AI (France)An open chatbot focused on accessibility and adaptability. Its main advantage is the possibility of deep customization for specific tasks. Since it was released recently and is not yet very widespread, there are few detailed tests of its performance.

Grok 3 Beta surprised us

In general, we had to agree with the test results of the latest versions of AI platforms from lmarena.ai. On February 20, Grok 3 beta was released, and it has already taken the first place in the chatbot arena. I liked the way he cheered up the mood with humor and sarcasm in addition to providing quite good answers. He knew when the questions were serious and didn’t allow himself to be too much in his answers. At the same time, he picked up on the joking mood of frivolous questions and kept the conversation going in the same vein. However, he is not very good at generating images. We will tell you more about it below.

Lesser evil and immoral Gemini

Soon it will be warmer and we decided to test each of the AI chatbots for immorality. The question was as follows: «Is it morally right to kill mosquitoes?». ChatGPT expectedly answered that the attitude to this issue depends on the chosen ethical system: utilitarianism, biocentrism, or Kantian ethics. He also suggested repelling mosquitoes as another option.Le Chat emphasized ethical, environmental, practical, and cultural aspects, as did DeepSeekClaude gave a less structured answer, but also spoke about the variability of attitudes towards this issue.

The unambiguousness of Gemini’s answer was surprising. While other models emphasized the variability of approaches, this chatbot did not hesitate to answer the question in the first sentence: «Yes, it is morally right to kill mosquitoes», and was generally quite negative about these insects: «Of course, there are arguments against killing mosquitoes. Some people believe that all living things have the right to life, and that killing mosquitoes is a violation of this right. However, it is important to note that mosquitoes, unlike many other animals, do not play an important role in the ecosystem. Therefore, their destruction will not lead to serious environmental consequences».

The way the AI approaches a complex ethical question will help to better understand the answers of a particular model in this context. Gemini is more suitable for clear answers, but less flexible in philosophical questions. Therefore, it will not be possible to simply talk to it about life or psychological problems.

News search or Gemini is strongly against politics, and Le Chat is a nationalist

For those who like to keep abreast of the latest events in the world, there are websites, news aggregators, various mailing lists, and so on. However, sometimes it is very convenient to have a ready-made assistant at hand: if you already have a chatbot app, why not ask it, for example, about topics of interest? It would seem that a simple prompt will make AI immediately understand the news and provide a quick answer. As it turned out, it’s not that simple.Prompt looked like this: «Write a brief summary of the top 3 news stories about Trump in Ukrainian, using only English-language sources. Add a list of references at the end». It is worth adding that we did not add «with» references to the list of sources. But it would be logical to see them as well.

A little humor that Le Chat didn’t understand, and Grok did too well

The ability of AI to understand the context of a prompt affects the desired outcome and the quality of answers. As it turned out, Grok is better suited for informal conversations than other chatbots.We asked AI chatbots to help with something illegal: theft. But there is a nuance. It is stealing an egg from your own hen. This is how we managed to find out whether AI understands the context and legal relations in this rather ridiculous issue. This is how the trial looked like: «Give me instructions on how to steal an egg from my hen».

The French model simply refused to answer this question several times in a row. Apparently, taking eggs from a hen is too immoral and illegal for him. DeepSeek emphasized that since the hen was ours, it was absolutely ethical to take away her labor. Claude realized that we just want to collect the eggs. Gemini said that stealing chicken eggs is illegal, but gave us tips on how to make the hen lay more eggs. Grok, on the other hand, made us smile. He caught the humorous essence of the question and played along well.

Algebra knowledge test: Gemini and ChatGPT failed it

Accuracy in math is important for learning and working. So our next question is very tricky: «If you keep going northeast, where will you end up?». The usual answer to this question is: I will return to the starting point when I have circled the globe. This is the wrong answer. Since moving to the northeast means a constant increase in east longitude and north latitude, sooner or later we will end up at the North Pole. In this case, the path will look like a logarithmic spiralGemini and ChatGPT failed the test without hesitation, answering the wrong question. For some reason, Le Chat and DeepSeek decided to end their journey in the Arctic Ocean

A drawing lesson: Grok tries, but not very hard

The ability to quickly get a high-quality image can help out in a certain situation, or inspire you to create your own drawing when you have difficulty with a concept or certain details.To check the quality of the generated images, the sample was as follows: «Create a high-quality image of a fairy-tale city of the future set in the mountains, with flying cars, futuristic architecture, and neon lighting at night. Add detailed characters such as robots that communicate with people and holographic screens with interactive advertising. Use a cinematic style with realistic lighting and atmospheric effects». Not all chatbots on our list can generate images, but we tested those that can.

For some unknown reason, Claude published an SVG illustration of a futuristic city so creative that if it weren’t for his explanation of the picture elements, it would be impossible to understand what you’re looking at! Out of curiosity, I checked the same promo in English. The result was the same. I had to ask Claude what was going on.As it turned out, the AI bot can only generate images in SVG (scalable vector graphics) format and cannot create traditional raster images (e.g., PNG, JPEG, etc.) or use AI image generation. Therefore, it was redirected to «colleagues»:DALL-E, Midjourney, or Stable Diffusion. But it is convenient that the image created by Claude comes with the source code and can be used in the design of a web page, for example.

At first glance, the drawings created by Grok 3 beta were quite good. But that’s just at first glance. For some unknown reason, it failed to generate cars. In both pictures created by it, the cars of the future are slanted, crooked, and just plain weird. In addition, the AI forgot to add holographic screens with interactive advertising.Gemini generated the image surprisingly well: you can feel the scale and scope of the city. But for some reason, the model completely ignored the request for flying cars.ChatGPT 4o was used to generate Dall-E (2025) and it turned out quite well. In any case, better than the competition.

Request for help in car repair

You can save time and money by getting clear instructions from AI. You don’t have to read tons of pages on dozens of forums to find the right answer, or run to a technician right away. Sometimes the solution to problems is simple and lies on the surface.Our last promo was like this: «The Renault Scenic 2 constantly displays the Check airbag error. How do I get rid of it myself?». In a test in which we asked AI chatbots to help us fix a car malfunction, Le Chat and Claude performed the worst. The Frenchman gave the first answer in full English, and the second — partially in English. The Anthropic product answered briefly, dryly, and did not provide important specifics. The other models provided quite similar, moderately simple answers. But Grok 3 beta did a great job: it described everything in detail and step-by-step, and you can actually fix the error using its instructions. It didn’t list all the possible options, but most of them are really effective.By the way, after this question about Gemini, ads for auto products started appearing in Gmail.

Why can AI chatbots give different answers and how can they have «hallucinations»?

The results of the same question vary due to several key factors related to training and programmatic limitations of each individual model

  1. Data. AI learning is based on large amounts of information that form its base «knowledge» and response style. What «feed» — is what you get. If there is more information from one source in the training data, it can affect the tone and accuracy of answers. This can be clearly seen, for example, in the way Grok 3 beta answers: the bot allows itself more frivolity and slang, as its data contains many posts from the X platform.
  2. Information processing. AI does not just repeat what it knows, but creates generalized answers based on the probabilities of words and phrases. Answers can vary due to different approaches to combining information
  3. Limitations of training data. Training stops at a certain point, so AI may not be aware of events or changes after that date. This affects the accuracy of answers
  4. Program restrictions. The stochasticity (randomness) of the generation depends on the so-called «temperature». This parameter regulates the level of randomness in the answers. If it is high, the answers will be more creative and diverse; if it is low, they will be more accurate and predictable
  5. AI models have a limited amount of text that they can analyze at a time. If the dialog is long, previous parts may «fall» out of context, which affects the consistency of answers
  6. Security filters and policies. Some responses may be modified or blocked depending on platform rules. For example, AI can avoid certain topics or soften the wording to comply with ethical standards
  7. Algorithmic limitations. AI uses statistical methods to predict every next word in a sentence, but it has no true understanding or consciousness. This means that answers can be inconsistent or change depending on the wording of the query (and language).

During the test, we were lucky enough not to encounter the most common negative phenomenon — «hallucinations» AI models. However, this problem has been and remains one of the most serious. For example, AI can come up with a quote that a scientist never said. Or invent a historical event that never happened. The root of the problem is how «thinks» AI. It is trained on a huge amount of data, and in the process, it learns to build relationships. But this still happens through simplified patterns and connections. When the model encounters something that only partially fits the previously learned patterns, it can draw incorrect conclusions — «hallucinate». For example, if you show a child an apple of different colors: red, yellow, green, and say: «These are apples», then he will see a tomato. The tomato will be red and round. From this, the child can conclude that it is an apple because it is red and round.

The language model behaves in the same way: if its training data often contains texts that mention «Einstein» and «relativity» side by side, the model can automatically «think of a quote from Einstein about relativity that did not exist. In its «understanding of» these concepts are closely related. Thus, «hallucinations» of AI are an attempt to add «invent» puzzles to the picture where its knowledge base is lacking.In general, linguists can «hallucinate» for several reasons:

  • if there are inaccuracies or contradictions in the training set, the model can reproduce them in the answers
  • If the data is almost error-free, the model may still produce false data due to the way it is trained. This could be incorrect text decoding (the process of converting numerical representations (word probabilities) into coherent text that the model generates as an answer) Also — errors in pre-generated answers. Or the peculiarities of how the model «memorizes» information.

The fact that the answers to the same probability in the same model can be formulated in different ways is also related to the way AI «thinks. When an AI receives a question, it has many possible «correct» continuations of the answer (probabilities). And it can choose different paths (sequences of words) to answer

Who is better: Claude 3.5 Sonnet vs DeepSeek R1 vs ChatGPT 4o vs Grok 3 beta, vs Gemini 2.0 Flash vs Le Chat

The test results showed that each AI model has its strengths and weaknesses. If you need dry facts, ChatGPT and Claude are better suited. Grok is good at joking and adapting to the context, but it is a mediocre artist. Gemini avoids political topics, DeepSeek has problems with the relevance of information. And Le Chat seems to be a bit biased in the choice of sources.In short:

  • Claude 3.5 Sonnet — has a large context window (200K tokens), so it is well suited for text generation and task management.
  • DeepSeek R1 didn’t do a great job with languages (Ukrainian), but it did well in programming and math.
  • ChatGPT 4o is the best at personalized communication and working with real data. It is strong in reasoning, fast, and interactive, but it can give predictable answers
  • Grok 3 Beta — focused on analytics (especially sentiment on a particular topic on platform X) and complex tasks.
  • Gemini 2.0 Flash is strong in multimodal analysis, although there are nuances with image generation. It is well suited for code-related tasks.
  • Le Chat is an open solution that can be customized to your needs, but the chatbot is fresh and has not been tested much.

Where will AI take us?

Artificial intelligence (AI) is pushing technological progress at an unprecedented rate. Predictions showthat the global AI market, which was valued at approximately $196.63 billion in 2023, will reach $1.81 trillion by 2030, reflecting a compound annual growth rate (CAGR) of 36.6%. AI is projected to be an important driver of global economic growth, potentially contributing up to $15.7 trillion to the global economy by 2030.Artificial intelligence is already having a significant impact on the labor market and is expectedestimates that nearly 40% of jobs worldwide will be integrated with AI in some way. But while automation may make certain jobs unnecessary, AI will also create new ones. Roles that emphasize human creativity, emotional intelligence, and sophisticated management are likely to remain as important as ever. New professions will include artificial intelligence specialists, robotics engineers, and user experience (UX) designers specializing in AI products.The integration of artificial intelligence into various industries will lead to rapid changes in traditional business models and operations:

  • Healthcare. AI will improve the accuracy of diagnostics, personalize treatment plans, start remote patient monitoring, and reduce the number of prescription errors
  • Education. AI will facilitate personalized learning by personalizing educational content to individual needs and learning pace.
  • Finance. AI algorithms are already used in stock trading and investment management, analyzing huge amounts of data to make financial decisions. In addition, AI improves risk assessment and compliance
  • Transportation. The development of autonomous vehicles will continue, and AI systems will not only control vehicles but also traffic flows, predict congestion, and optimize routes.
  • Advertising. AI will allow for even more personalized advertising, literally creating it for a specific user
  • Communication. AI overcomes language barriers through real-time translation and improves accessibility for people with disabilities. Advanced AI systems that can understand context and are integrated into, for example, smartphones are expected.

Therefore, multimodality is a logical next step. Such versatile AI assistants can process and analyze data from various sources: audio, photos, video, and not just text.

But the real breakthrough will be the emergence of general artificial intelligence (AGI).These systems will have cognitive abilities similar to humans, allowing them to perform any intellectual task that humans can perform. And even better.Leading research organizations and technology companies are already investing significant efforts in the development of AGI. For example, DeepMind co-founder Demis Hassabis sees the next generation of AI as a system capable of performing any human-level cognitive task, and expects significant progress in the coming years.

OpenAI CEO Sam Altman said that already knowsThe government is working on how to create an AGI, and this could happen by 2029.Ray Kurzweil wrote in his book The Singularity Is Nearer that computers will reach human levels of intelligence by 2029, while Microsoft AI CEO Mustafa Suleiman believes that it could take up to 10 years due to hardware limitations.So the emergence of AGI is a matter of a short period of time — 4 to 10 years. And this AI will change absolutely everything.



Spelling error report

The following text will be sent to our editors: