Gemini, Google’s most advanced AI model, has been in the news for beating the classic 1996 Game Boy game Pokémon Blue.
This achievement marks a significant step for artificial intelligence in gaming, showcasing how far technology has come in tackling complex, open-ended challenges that require reasoning and adaptability.
Gemini’s Journey through Pokémon Blue
Gemini’s run through Pokémon Blue was not a solo effort. The project was led by Joel Z., a software engineer unaffiliated with Google, who streamed the entire playthrough on Twitch.
Joel Z ran the livestream and helped with technical issues, but he made it clear that he didn’t do much and that it wasn’t cheating.
Instead, his role was to enhance Gemini’s decision-making and reasoning abilities, only stepping in for technical clarifications, such as informing Gemini about a bug that required talking to a Rocket Grunt twice to obtain a key—an issue later fixed in Pokémon Yellow.
Gemini used an “agent harness” during the trip. This system sent it new screenshots and other game data.
This gave Gemini the chance to look at the situation, call on specialized sub-agents when needed, and choose what to do in the emulator.
Such a setup highlights the importance of external tools in helping AI models navigate complex environments like Pokémon Blue.
Google executives, including CEO Sundar Pichai, celebrated the achievement, with Pichai humorously referring to the project as “Artificial Pokémon Intelligence.”
Logan Kilpatrick, product lead for Google AI Studio, also noted Gemini’s rapid progress, earning its fifth badge much faster than competing models.
Gemini vs. other AI Models: The Benchmark debate
Gemini’s success has naturally led to comparisons with other AI models, particularly Anthropic’s Claude, which is still working its way through Pokémon Red.
While Google has completed Pokémon Blue, Claude has yet to finish its version of the game, despite making steady progress and serving as an inspiration for the Gemini project.
But it’s hard to make direct comparisons between Gemini and Claude. During gameplay, each AI uses different resources and agents to get different information.

As Joel Z emphasized, these differences mean that performance in Pokémon Blue shouldn’t be considered a definitive benchmark for AI capabilities.
The unique setups and developer interventions make it impossible to declare one model categorically superior to another in this context.
Despite the limitations in benchmarking, using classic games like Pokémon Blue has become a popular way to test and showcase AI’s problem-solving skills.
These games present open-ended goals and require the AI to adapt to unexpected scenarios, making them ideal for evaluating reasoning and adaptability.
Final words
Gemini’s completion of Pokémon Blue is more than just a technical achievement—it represents a new era in AI development, where models are not only solving structured problems but also navigating complex, nostalgic worlds that many people grew up with.
The project continues to evolve, with ongoing improvements to the framework and growing interest from both the AI and gaming communities.
By demonstrating advanced reasoning, adaptability, and the ability to collaborate with human developers, Google is setting new standards for what artificial intelligence can achieve in interactive environments.





