3 Things ChatGPT Does Better Than Gemini

There are tens of thousands of different AI products out there, although most of us have only heard of a handful of them. Comparing two of the biggest AI systems — ChatGPT and Gemini – isn't a straightforward undertaking. For one thing, things can change overnight. Back in December 2025, people were speculating on whether OpenAI was losing the AI arms race, then a couple of days later, it released ChatGPT-5.2 and started topping the leaderboards again. 

So how can you tell which AI does stuff better? A few years ago, we could have run some side-by-side comparisons. Earlier generations of AI large language models (LLMs) could be quite noticeably different from one another. But the gaps are closing fast, especially when you're talking about big-name brands like OpenAI and Google. Although you'll still find some recent articles where someone has put a single prompt into both systems and ranked which response they prefer, this method is hopelessly flawed. For one thing, LLM outputs are "stochastic", meaning that responses include an element of randomness, so the same prompt can result in different responses. Also, there's very little that ChatGPT and Gemini can't do these days. Any preference in responses would really be about a preferred chatbot style. And that's only going to be its out-of-the-box personality. A chatbot's tone and conversational style can be customized to suit your preferences.

So, given that we're not going to undertake multiple trials using blind evaluations and aggregated results, we shall leave the rankings to the experts. There are a variety of benchmarks that test AI systems on things like reasoning, logic, and problem-solving. We'll cover three of the significant ones where ChatGPT performs well. There's an explanation of how we chose which benchmarks to include at the end of this article.

Answer difficult Google-proof science questions

The first benchmark we'll look at is GPQA Diamond. This is designed to test PhD-level reasoning in physics, chemistry, and biology. GPQA stands for Google-Proof Questions and Answers. There's a standard test and the 'Diamond' one, which has particularly difficult questions. Being Google-proof means these aren't just questions with one simple answer you can look up. They require complex reasoning skills. 

To answer correctly, an AI would need to apply multiple scientific concepts, resist making assumptions or taking shortcuts, and ignore red herrings. These are multiple-choice questions, so an AI model doesn't get any points for conversational fluency or confidence. It either arrives at the correct answer or it doesn't.

Both ChatGPT and Gemini score highly on this, with ChatGPT currently leading by less than 1%. GPT-5.2 scores 92.4% to Gemini 3 Pro's 91.9%. For comparison, a PhD graduate would be expected to score 65%, and regular non-expert humans score 34%. For obvious reasons, the actual Google-proof questions are not available online, but you can see an example of the sorts of questions the test includes here.

Fix real-world coding problems

Whatever you think about AI coding and the security risks it poses, the ability to fix bugs and solve other software issues is a required skill for today's AI systems. The SWE-Bench benchmarks come in a variety of flavors, with multiple variants designed to test different aspects of software engineering. The variant where ChatGPT outperforms its rivals is SWE-Bench Pro (Private Dataset).

SWE-Bench Pro evaluates whether an AI system can solve real software engineering tasks taken from actual issues on the GitHub developer platform. Each task requires understanding an unfamiliar codebase, interpreting the intent behind a bug report, making appropriate changes, and producing a workable solution. The private dataset is non-public, making it more difficult than the public dataset one.

The results show that ChatGPT-5.2 resolved about 24% of the issues, while Gemini only resolved about 18%. If these numbers seem unimpressive, that's because this is the trickiest SWE-Bench test to complete. On more straightforward coding benchmark tests, AIs fix around 75% of issues. For comparison, though, 100% of these private dataset engineering challenges were solved by humans. Having a known, workable fix is one of the criteria for each of the tasks in the test. So AI has a way to go before it matches the skills of human software engineering experts.

Solve abstract visual puzzles

You know those puzzles you have to do to prove you're not a robot? There's a benchmark to test that kind of intuitive visual reasoning. The original ARC-AGI test was devised in 2019, before LLMs were even a thing, and was designed to "measure a human-like form of general fluid intelligence". ARC-AGI-2 is an updated version launched in March 2025. It's designed to assess AI's ability to apply abstract reasoning to unfamiliar challenges. It needs to work out an underlying pattern from a small number of examples and then apply it correctly to a new example. These tasks often require identifying which aspects of a problem are relevant and ignoring any distractions. Crucially, it's something that humans, on the whole, are pretty good at, and where artificial intelligence still struggles to give the right answer

On the ARC-AGI-2 benchmark, ChatGPT-5.2 Pro scored 54.2%. Gemini appears several times on the list. A souped-up refinement version scored 54%, and Gemini 3 Deep Think scored 45.1. However, Gemini 3 Pro only scored 31.1%, considerably lower than ChatGPT. This is the model that's analogous to ChatGPT-5.2 Pro, as they're both paid subscription models in the same sort of price bracket, whereas Gemini Deep Think is a lot more expensive. Like SWE-Bench Pro Private Dataset, ARC-AGI-2 is a benchmark where the AI score is relatively low because it's something that is tricky for AI. However, it seems to be an area where ChatGPT is not only outperforming Gemini, but all of its other rivals as well.

Methodology

AI benchmark results change rapidly, and any numbers we've included here will change with the next OpenAI or Google AI release. For this article, we considered the current most up-to-date versions, which are GPT-5.2 and Gemini 3. As the paid-for Pro versions were the ones that ranked higher on the benchmarks, these were the versions we focused on.

We looked for examples where ChatGPT performs better than Gemini. There are many instances where Gemini ranks higher than ChatGPT, for example, SWE-Bench Bash Only and Humanity's Last Exam. We focused on just three benchmarks here, as they represented a good spread of different AI skills — knowledge and reasoning, problem-solving, and abstract thinking. There are many other benchmarks available, including others that ChatGPT does well on, such as GDPval-AA and FrontierMath. We couldn't include everything. 

By focusing on benchmarks, we ensured that we would have more accurate results than by conducting our own limited side-by-side comparisons. In order to keep that focus, we also excluded results from large-scale subjective studies like LLMArena, although we recognize that these are incredibly useful ways to compare AI systems, as they aggregate huge numbers of people's preferences in blind studies. So, for completeness, we should probably mention that Gemini currently far outranks ChatGPT for user preference on LLMArena.

Recommended