The Unreasonable Effectiveness of Google Gemini 1.5 — Alea iacta est

Krishna Sankar
5 min readFeb 19, 2024

--

This is going to be a short blog on the disruptively impressive capabilities of the Google Gemini 1.5 preview.

Google has unveiled a private preview edition of the Gemini 1.5 Pro — “… starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens …”

What does it mean ?

A huge jump in the token size, meaning it can process much larger amounts of data compared to other LLMs/chatbots as shown in the table below.

Token size is the currency in NLP — Tokens can be entire parts or subsections of words, images, videos, audio or code. The bigger a model’s context window, the more information it can take in and process in a given prompt — making its output more consistent, relevant and useful.

With a token size of 32,768, gpt-4–32k can intake around 25,000 words. Gemini 1.5 can ingest 32 times — 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words (1,400 pages!!). In Google’s research, they’ve successfully tested up to 10 million tokens !

1M + is definitely impressive — granted, it may lead to substantial challenges such as extensive hallucinations and unintended consequences; nevertheless as the saying goes, “Alea iacta est” (the die is cast) — a barrier broken, the bar has been risen !

Gemini 1.5 not only handles vast volumes of text but processes them with remarkable accuracy !

For text, Gemini 1.5 Pro achieves 100 percent recall up to 530k tokens, 99.7 percent up to 1M tokens, and 99.2 percent accuracy up to 10M tokens” — Results shared by Rowan Cheung from early access

While I don’t have access to the preview, Rowan Cheung has and he spent this Saturday working on this and write an excellent blog in twitter [Here] ! Rowan’s firsthand experience, as detailed in his excellent Twitter blog, provides valuable insights into Gemini 1.5’s capabilities.

Of course, I am on the waiting list !!

The Larger Picture

  • As Rowan writes and replies from others, “… so many LLM capabilities just unlocked …. so many more apps will be built …” — that is the key. Suddenly opens up to a host of previously unknown capabilities.
  • This is huge progress. Guess we will have to wait to see what brings GTP-5 to fight this” — hence my note Alea iacta est. This has raised the bar !
  • … Felt surreal to watch Gemini ingest multiple transcripts and spit out accurate details in less than a minute

The Architecture

  • Am going to slide-in a quick peek into some of the interesting technical details
  • Gemini 1.5 is not a single model, but a MoE ie Mixture of Experts, with conditional computation. “While a traditional Transformer functions as one large neural network, MoE models are divided into smaller “expert” neural networks.Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways (using the learned routing function) in its neural network. This specialization massively enhances the model’s efficiency.”
  • BTW, the MoE architecture has been the core of the (very successful) Mistral family of models from the French company
  • The model is natively multimodal and supports interleaving of data from different modalities, it can support a mix of audio, visual, text, and code inputs in the same input sequence
  • Gemini 1.5 Pro is trained on multiple 4096-chip pods of Google’s TPUv4 accelerators, distributed across multiple datacenters, and on a variety of multimodal and multilingual data.

Will it replace RAG (Retrieval Augmentation)?

  • An obvious question — Alexander Ratner has an interesting answer
  • Long context models — like Gemini 1.5 — will definitely eat up a lot of simpler use cases + pre-production development (that is very common nowadays)
  • But, RAG still wins from a cost, latency, and scale perspective. Even more durably: a RAG approach is modular. So for more complex, scaled, and/or production settings- RAG likely here to stay

Rowan has tried six interesting capabilities

  1. Breaking down + understanding a long video — he uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score. Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding!
  2. Understanding and comparing full movie transcripts from Interstellar and Ad Astra — Gemini 1.5 was able to understand, compare, and contrast entire transcripts from both movies to help me decide which movie to watch.
  3. Watching, understanding, and distinguishing if an OpenAI Sora video is AI or not — Gemini 1.5 highlighted the famous Sora cat video and highlighted key factors of why it could be AI-generated. He was surprised by how in-depth it went on the response. — encouraging. <ks> I always knew that we need LLMs to understand /guardrail LLMs. This is very good. I had mentioned this in my blogs [Here] & [Here]</ks>
  4. Finding, understanding, and explaining a small figure in a long paper — Gemini 1.5 was able to extract “table 8” from the Gemini 1.5 Pro paper by DeepMind and explain what the table meant. <ks> Excellent capability for financial applications, operations optimization et al </ks>

I will leave you with an urge to try out the Gemini 1.5 and all it’s wonders! The more interesting exploration is to comprehend and internalize the speed in which the LLM capabilities are evolving … as Sam Altman mentions in his interview with Bill Gates [Here], we will be on this steep improvement curve for the next 5+ years …

--

--