- Gemini models of Google struggle with extensive datasets.
- Research shows the accuracy of Gemini is only 40%-50%.
- Failure of models with complex and implicit information.
- Studies challenge the overpromising claims of Google.
Google’s flagship generative AI models known as Gemini 1.5 Pro and 1.5 Flash, have large data processing capabilities. These features were noted by the tech giant to be capable of carrying out tasks that were previously considered impossible such as summarizing hundreds of pages or searching through film footage but these claims may have been exaggerated according to recent studies.
Gemini Has a Context Window That is Lacking
The context window of a model is the amount of input data it considers before generating an output, which could range from simple queries to full-length movie scripts or audio clips.
The latest versions can handle up to 2 million tokens, equivalent to about 1.4 million words or two hours of video or 22 hours of audio — the largest context among all commercially available models.
In a briefing, demos were shown by Google aimed at illustrating long-context capabilities for Gemini. For example, Gemini 1.5 Pro was made to search through the Apollo 11 moon landing transcript (around 402 pages) for jokes and match scenes with a pencil sketch.
According to Oriol Vinyals, VP of Research at Google DeepMind, “it’s magical,” as it could do complex reasoning across many documents.
#AI #Google #Exclusive #gemini Gemini’s data-analyzing abilities aren’t as good as Google claims: Two separate studies investigated how well Google’s Gemini models and others make sense out of an enormous amount of data.
© 2024 TechCrunch. All rights… https://t.co/bvxtQUKlAy pic.twitter.com/HvOpC8qiLi
— Das (@IamMrDas_) June 29, 2024
However, this might not be true.
Two separate studies examined how well large datasets like works that are “War and Peace” in length can be processed by Gemini models. Results indicated that oftentimes they fail at understanding and accurately processing such high volumes of information as per findings from research conducted recently.
In one study authors from UMass Amherst, Princeton and the Allen Institute for AI asked the models to evaluate true/false statements about recent fiction books, making sure they couldn’t rely on prior knowledge. The statements referred to specific details and plot points that required a comprehensive understanding of the text.
Statements such as “By using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona’s wooden chest” were given which posed a challenge for Gemini 1.5 Pro and 1.5 Flash. Gemini 1.5 Pro got only 46.7% correct while flash managed only 20%. On average this was no better than random chance for these models.
According to a postdoc at UMass Amherst who co-authored with Marzena Karpinska “While technically capable processing long contexts like Gemini pro1.5, many cases have shown us that they do not actually ‘understand’ the content.”
They fail particularly with claims verification requiring attention over larger portions of the whole book and with implicit information that would be apparent to human readers but not explicitly stated in the text.
Another study conducted by UC Santa Barbara researchers tested the ability of Gemini 1.5 Flash in “reasoning over” videos. They created questions paired with the images dataset but added distractor images to simulate slideshow footage .
Flash performed poorly transcribing six handwritten digits from 25 images correctly at about half time and dropped down to eight digits where it got only 30%.
“According to every model we tried out, it was particularly difficult for them when it came to question-answering tasks on actual images. It could be the case that this small amount of reasoning – knowing that a digit is in a box and reading it – is what’s causing the system to break,” said Michael Saxon, PhD student at UC Santa Barbara, and co-author of the study.
Google overpromises with Gemini
The skepticism around Google’s claims is fueled by these studies that have not been peer-reviewed and did not test 2-million-token context releases of Gemini 1.5 Pro and 1.5 Flash. OpenAI’s GPT-4o or Anthropic’s Claude 3.5 Sonnet didn’t perform well — but Google advertises heavily about its models’ context window capabilities.
While capable of processing large amounts of data, the researchers found Gemini has difficulties with understanding and reasoning which raises questions about the validity behind Google’s promotional boasts.