Skip to content
AI & Governance · 31 May 2026 · 11 min read

Measuring AI energy use is impossible, and that is by design

On energy labels that do not exist, leaderboards that measure the wrong models, and what you can still do tomorrow.

Illustration for article: Measuring AI energy use is impossible, and that is by design

Last week someone asked me which AI model is the most energy-efficient. Not the smartest. Not the cheapest. Which one uses the least electricity.

I had no answer. That surprised me too. Because if you want to compare the energy consumption of AI models and look up the existing leaderboards, you run into a fundamental problem: the models you probably use every day were deliberately kept outside the measurement system. Not because it was technically impossible, but because of a choice.

AI energy use is growing fast, but the debate misses the point

The IEA published a report in April 2026 (opent in nieuw venster) showing that electricity use by AI-focused data centres rose by 50 percent in 2025. Total consumption across all data centres was around 485 terawatt-hours (485 billion kilowatt-hours, roughly what a country like France uses in half a year), and the forecast is that this will roughly double to about 950 terawatt-hours by 2030. That latter figure is comparable to Japan's total annual electricity consumption.

Yet an important nuance is missing from the public debate. Most people think of training large models when they hear "AI and energy". Training GPT-4 was estimated to cost around 50 gigawatt-hours (enough to power a few thousand households for a year). That sounds enormous, but it was a one-off investment. What happens every day afterwards is a different story.

Inference, the daily answering of questions by a model, accounts for 60 to 90 percent of total energy use over a model's lifetime, depending on the source. The IEA shows the same pattern (opent in nieuw venster): if you use AI in your business, daily use is the relevant number, not the model's training cost. That turns last week's question into a strategic one.

The problem: the models you use cannot be measured

There is now a serious initiative to make AI models' energy use visible: the AI Energy Score by Hugging Face (opent in nieuw venster). It works like an energy label on a fridge, a one-to-five-star rating based on power use per task. There are now 166 models listed.

But there is a fundamental gap. Sasha Luccioni, the researcher leading the initiative, wrote openly in April 2025 that frontier models are missing (opent in nieuw venster): the heaviest, most capable models at the cutting edge, such as ChatGPT, Claude and Gemini, the very models most companies use daily, are not on the list. Worse: according to OpenRouter data cited by Hugging Face (opent in nieuw venster), 15 of the 20 most-used AI models are closed-source: their internals are not public. Their energy use cannot be measured because they run inside the closed data centres of OpenAI, Anthropic and Google.

The system exists and is accessible, but providers collectively choose not to participate. There is one exception: in August 2025 Google, alone among the major players, published its own measurement, more on that later. But a measurement a provider runs and publishes itself is not the same as an independent leaderboard where you can compare models objectively.

Luccioni built a secure test environment where providers can have their models measured without giving away source code. No major provider has done so. Not because of technical limits, but by choice. That is not a side note. It is the central gap in the whole story.

The workaround that does work (but is not perfect)

One platform still makes comparison partly possible: Artificial Analysis (opent in nieuw venster). They compare all major models on quality, speed and price per million tokens. A token is a piece of text, roughly part of a word, and it is the unit in which AI models compute and bill. Their overview covers all the big names, including GPT, Claude and Gemini.

Artificial Analysis does not measure power use directly, but there is logic that works: the price providers charge per token reflects the compute required. Some models use a smart technique where not the entire model thinks through every question, but only the relevant part. Those models are much cheaper per token, and cheaper per token is currently the best proxy we have for lower consumption among closed models.

The paradox you need to know

The intuitive idea is: smaller model = more efficient. For most tasks that is simply true. A small model that summarises an email or answers a simple question uses a fraction of what a heavy reasoning model needs.

But for complex tasks it is more nuanced. Reasoning models such as OpenAI's o3 series use significantly more power per question, and not by a little: researchers from the universities of Rhode Island and Tunis estimate in their paper How Hungry is AI? (opent in nieuw venster) that a long question to o3 costs nearly 39 watt-hours, compared with 0.24 watt-hours for an average question to Gemini, which Google studied itself (opent in nieuw venster). That is more than a hundred times as much for one question. For context: that one heavy question uses about as much electricity as twenty minutes of watching television.

Yet the story is not as simple as "reasoning is bad for the climate". If such a heavy model solves a complex programming problem in one go while a lighter model needs three attempts with corrections in between, total energy use can end up lower for the heavy model. Efficiency is measured per successful outcome, not per individual question.

Here is the honest caveat: most companies use AI for relatively simple tasks. Writing emails, summarising text, looking up information. For those use cases the paradox does not apply. There a smaller, more efficient model is simply better, both for CO2 and for cost.

What the EU is doing (and what it means for you)

The EU AI Act requires providers of large AI models (opent in nieuw venster) to document energy use. That obligation applies to OpenAI, Anthropic and Google, and enforcement starts in August 2026. Perhaps the major models will finally appear on Hugging Face's leaderboard. That would be a big step.

Until now, players such as Google and Anthropic have largely kept exact consumption figures secret for competitive reasons. But pressure is building, and not only on electricity use. Water use for cooling data centres is a second environmental dimension that is increasingly discussed, and it is completely absent from current measurements. The sector is also calling for standardised green AI labels, similar to what Hugging Face is building but for all models. That is not yet mandatory, but the direction is clear.

If you are a company using AI, that documentation duty does not apply to you directly. But if you fall under the CSRD, the European directive that requires larger companies to report on sustainability, you must also include emissions in your supply chain. Think of everything you buy in, and digital services are increasingly part of that. AI use is not yet specifically mandatory in that reporting, but the direction is clear. Those who think about it now will not be caught off guard later. The question of which model you use will eventually become part of that sustainability reporting.

I wrote earlier about the broader governance challenges around the EU AI Act in my post on the AI Act delay that did not happen. Energy reporting is one of the few concrete obligations that did make it through.

What you can do tomorrow

I could not answer the question at first. I can do better now, and I see three steps that make sense for every organisation.

Step 1: Inventory which models your organisation uses. Not "we use AI", but which models specifically. ChatGPT in the browser, Claude via the API, Gemini via Google Workspace, an internal model. Per department, per use case. You need this overview for the EU AI Act anyway, so the investment pays off regardless.

Step 2: Check Artificial Analysis. Look up the models you use on artificialanalysis.ai (opent in nieuw venster) and compare quality, speed and price per token. As explained above, price per token is currently the best proxy you have for consumption among closed models.

Step 3: Make the trade-off explicit. Quality, price and carbon footprint are three axes. For simple tasks it pays to use a smaller, cheaper, more efficient model instead of defaulting to the heaviest one. For complex tasks it is more nuanced. But making the trade-off consciously is already a win.

The answer is more complicated than the question

After a week of digging, my conclusion is clear: the answer simply is not there, and there is a reason. The models most people use were deliberately kept unmeasurable. Not because of technical impossibility, but by provider choice.

That will not change on its own. But when EU AI Act enforcement begins in August 2026, and as sustainability reporting tightens, pressure will build. The question of which model is most efficient will no longer be answered by a researcher on a leaderboard. You will need to justify it yourself. And the path there starts simply by asking the question.


Sources