25 Years Later: A Brief Analysis of GPU Processing Efficiency

The commencement 3D graphics cards appeared 25 years ago and since then their power and complication accept grown at a scale greater than whatever other microchip found in a PC. Dorsum then, these processors packed around 1 million transistors, were smaller than 100 mm2 in size, and consumed merely a handful of watts of electrical power.

Fast forwards today, and a typical graphics card might have 14 billion transistors, in a die 500 mm² in size, and swallow over 200 W of power. The capabilities of these behemoths will be immeasurably greater than their aboriginal predecessors, but accept they got any better at being efficient with all this tiny switches and energy?

A Tale of Two Numbers

In this article, we'll have a wait at how well GPU designers take utilized the increment in dice size and power consumption to give us always more than processing power. Before we swoop in, you lot kickoff might to castor up on the components of a graphics carte du jour or take a walk through the history of the modernistic graphics processor. With this data, you'll have a great foundation with which to follow this feature.

To understand how the efficiency of a GPU design has inverse, if at all, over the years, nosotros've used TechPowerUp's excellent database, taking a sample of processors from the final 14 years. Nosotros've picked this timeframe because it marks the start of when GPUs had a unified shader structure.

Rather than having separate circuits inside the chip for handling triangles and pixels, unified shaders are arithmetic logic units that are designed to process all the math required to practise whatsoever calculation involved in 3D graphics. This allows us to apply a relative performance measurement, consistently beyond the different GPUs: floating signal operations per second (FLOPS, for short).

AMD have used a unified shader architecture for most 12 years

Hardware vendors are oftentimes keen to state FLOPS figures every bit a measure of the top processing capability of the GPU and while information technology'south absolutely not the merely aspect behind how fast a GPU is, FLOPS give u.s. a number that we can work with.

The same is true of die size, which is a measure of the expanse of the processing flake. However, you could have 2 chips that are the aforementioned size, just have vastly differing transistor counts.

For instance, Nvidia's G71 (call up GeForce 7900 GT) processor from 2005 is 196 mm² in size and contains 278 million transistors; their TU117 released early on last year (GeForce GTX 1650), is just 4 mm² larger only has iv.7 billion footling switches.

A nautical chart of Nvidia'southward primary GPUs showing changes in transistor density over the years

Naturally, this must mean newer GPU transistors are much smaller than the older bit's, and this is very much the case. The and so-called procedure node -- the overall design calibration of the manufacturing process used to fabrication the processor -- used by hardware vendors has changed over the years, progressively getting smaller and smaller. So we'll clarify the efficiency from the perspective of die density, which is a measure of how many millions of transistors there are per mm^two of scrap area.

Perchance the almost contentious metric nosotros'll exist using is the figure for the GPU'due south ability consumption. We accept no doubt that many readers will not like this, as we're using the vendor's stated thermal design power (TDP) value. This is actually a measure (or at least, it'due south supposed to be) of the corporeality of oestrus emitted by the whole graphics card in an boilerplate, but high load, situation.

With silicon chips, the power they consume gets mostly turned to heat, merely this isn't the reason why using TDP is a problem. It'due south that unlike vendors state this number under dissimilar conditions, and it's also not necessarily the ability consumption whilst producing peak FLOPS. It's likewise the ability value for the whole graphics card, including the onboard memory, although most of information technology will be the GPU itself.

It is possible to directly measure the ability consumption of a graphics carte du jour. For example, TechPowerUp does it for their GPU reviews, and when they tested a GeForce RTX 2080 Super, with a vendor-declared TDP of 250 W, they establish it averaged at 243 W simply peaked at 275 Due west, during their testing.

But we've stuck with using TDP for the sake of simplicity and we've been somewhat cautious in making any judgements solely based on the processing performance against thermal design power.

We're going to direct compare 2 metrics: GFLOPS and unit dice density. 1 GFLOPS equates to one,000 million floating point operations per second, and we're dealing with the value for FP32 calculations, done exclusively by the unified shaders. The comparison volition take the class of a graph like this:

The 10-axis plots GFLOPS per unit TDP, and then you desire this to exist as high equally possible: the lower the position forth with axis, the less ability efficient the bit is. The same is true for the y-axis equally this plots GFLOPS per unit die density. The more than transistors yous have packed into a square mm, the more than performance you would expect. So the overall GPU processing efficiency (bookkeeping for the number of transistor, die size, and TDP) increases as y'all go towards the summit-correct hand corner of the graph.

Any data points near the superlative-left are basically maxim "this GPU is getting good performance out of dice blueprint, but at a price of using a relatively large amount of power." Going towards the lesser right and it's "good at using power finer, but the die design isn't generating much performance."

In short, we're defining processing efficiency every bit how much does the GPU do for the package and power it'due south got.

GPU Efficiency: TDP vs. Unit of measurement Dice Density

Without farther ado, let's move on to the results:

On face up value, the results seem rather scattered almost, but we can see a basic blueprint: old GPUs, such every bit the G80 or RV670, are far less efficient compared to newer designs, such every bit the Vega 20 or the GP102. Which is what y'all would expect! Afterwards all, it would exist a pretty poor team of electronic engineers who would get out of their way to constantly blueprint and release new products that are less efficient with each release.

But there are some interesting data points. The start of which are the TU102 and GV100. Both of these are fabricated by Nvidia and can be found in graphics such every bit the GeForce RTX 2080 Ti and Titan V, respectively.

Yous could debate that neither GPU was designed for the general consumer market, especially the GV100, as they're really for workstations or compute servers. So although they seem to be the well-nigh efficient of the lot, that's what you'd look for processors designed for specialized markets, that cost vastly more the standard ones.

Some other GPU that sticks out, and somewhat like a sore thumb, is the GP108 -- this is another one of Nvidia's chips and is mostly commonly institute in the GeForce GT 1030. This low-finish production, released in 2022, has a very small processor just 74 mm² in size with a TDP of only 30 West. However, it's relative floating point performance is actually no better than Nvidia's first unified shader GPU, the G80, from 2006.

Nvidia's G80 graphics processor. Image source: Hyins

Across from the GP108 is AMD's Fuji fleck that powered its Radeon R9 Fury serial. This pattern doesn't seem to be overly power efficient, specially given that utilize of High Bandwidth Memory (HBM) was supposed to help in this respect. The Fiji blueprint ran rather hot, which makes semiconductor processors less ability efficient due to increased leakage. This is where electrical energy gets lost to the packaging and surroundings, rather than being constrained within the circuitry. All fries leak, just the rate of loss increases with temperature.

Perhaps the most interesting information point is Navi 10: this is AMD's most recent GPU blueprint and is manufactured by TSMC, using their N7 process node, currently the smallest scale used. However, the Vega 20 scrap is made on the same node, but it seems to be more efficient, despite beingness an older design. So, what'south going on here?

In that location's a Vega 20 GPU underneath those fans. Image: Verte95 | Wikimedia Commons

The Vega xx (AMD used information technology only the ane consumer graphics carte du jour - the Radeon VII) was the last processor made by AMD to utilise their GCN (Graphics Core Next) compages. It packs in a huge number of unified shader cores into a layout that focuses heavily on FP32 throughput. Notwithstanding, programming the device to reach this functioning was not hands done and information technology lacked flexibility.

Navi ten uses their latest architecture, RDNA, which resolves this issue, only at a cost to FP32 throughput. Nonetheless, it is a new layout and manufactured on a relatively fresh process node, so we tin can expect to see efficiency improvements as TSMC develops its process node and AMD updates the architecture.

If nosotros ignore the outliers, the almost efficient GPUs in our chart are the GP102 and GP104. These are using Nvidia'due south Pascal architecture, and can exist constitute in graphics cards such equally the GeForce GTX 1080 Ti, GTX 1070, and GTX 1060. The i next to the GP102, but not labelled for the sake of clarity, is the TU104 which uses Nvidia's latest Turing design, and can be found in a raft of GeForce RTX models: 2060, 2070 Super, 2080, 2080 Super, to name a few.

These are also made past TSMC but using a procedure node specifically designed for Nvidia's products, called 12FFN, which itself is a refined version of the 16FF node.

The improvements focus on increasing die density, while reducing leakage, which would go some way to explaining why Nvidia's GPUs are seemingly the almost efficient.

GPU Efficiency: TDP vs. Unit Die Area

We can reduce the touch on of procedure node from the analysis, by replacing the metric of die density with just die area. This gives us a very different picture...

Efficiency increases in the same direction in this graph, but at present nosotros tin can see that some fundamental positions take swapped. The TU102 and GV100 have dropped right downwardly, whereas the Navi 10 and Vega 20 have jumped upward the graph. This is because the 2 former processors are enormous chips (754 mm² and 815 mmⁱⁱ), whereas the latter ii from AMD are much smaller (251 mm² and 331 mm²).

If we focus the graph so it only displays the more than recent GPUs, and the differences go even more pronounced:

This view strongly suggests that AMD have focused less on power efficiency compared to die size efficiency.

In other words, they've wanted to get more GPU chips per manufactured wafer. Nvidia, on the other hand, appear to have taken the arroyo where they're designing their chips to be larger and larger (and thus each wafer provides fewer dies), only they're utilizing electrical power better.

So volition AMD and Nvidia continue this mode with their adjacent GPUs? Well, the quondam has already stated they're focusing improvement the performance-per-watt ratio in RDNA two.0 by 50%, then we should run across their future GPUs sit further to the correct on our chart above. Merely what nigh Nvidia?

Unfortunately, they are notorious for keeping very tight lipped nearly time to come developments, but we exercise know that their next processors will be made by TSMC and Samsung on a similar process node to that used for Navi. There take been some claims that we will run into a large power reduction, only also a large hike in unified shader count, and so we perhaps volition see a similar position on the chart for Nvidia.

So How Have GPUs Become More than Efficient?

The to a higher place is pretty conclusive: over the years, AMD and Nvidia have raised the processing functioning per unit die density and unit TDP. In some cases, the increase has been amazing...

Accept Nvidia'due south G92 and TU102 processors. The get-go one powered the likes of the GeForce 8800 GT and 9800 GTX, and packs 754 million transistors into a chip 324 mmⁱⁱ in area. When it appeared in October 2007, it was well received for its performance and power requirements.

11 years later on Nvidia offered united states of america the TU102 in the form of the GeForce RTX 2080 Ti, with nearly nineteen billion transistors in an area of 754 mm² -- that's 25 times more microscopic components in a surface that'due south only ii.3 times larger.

None of this would be possible if it wasn't for the piece of work done by TSMC to constantly develop their fabrication technology: the G92 in the 8800 GT was congenital on a 65 nm process node, whereas the latest TU102 is their special 12FFN scale. The names of the product methods don't really tell us the sense of the difference betwixt the ii, merely the GPU numbers do. The current i has a die density of 24.67 one thousand thousand transistors per mm^two, compared to the erstwhile one'southward value of 2.33 million.

A ten-fold increment in the packing of components is the master reason backside the huge difference in the two GPU's efficiency. Smaller logic units require less energy to operate and the shorter pathways connecting them ways it takes less fourth dimension for data to travel. Along with improvements in silicon chip manufacturing (reduction in defects and better insulation), this results in being able to run at higher clock speeds for the same power requirement or go with using less ability for the same clock rate.

AMD's Vega x processor, with ii 4 GB HBM chips on the left

Speaking of clocks, this is another cistron to consider. Let'southward compare the RV670, from Nov 2007 in the Radeon Hard disk 3870, to the Vega 10 powering the Radeon RX Vega 64, released in August 2022.

The former has a fixed clock speed of around 775 MHz, whereas the latter has at least three available rates:

850 MHz - when just doing desktop, 2nd processing
1250 MHz - for very heavy 3D work (known as the base of operations clock)
1550 MHz - for lite-to-medium 3D loads (known as the boost clock)

We say 'at least' because the graphics menu can dynamically vary its clock speed and the ability consumed, between the in a higher place values, based on its workload and operating temperature. This is something that we take for granted now, with the latest GPUs, just this level of control simply didn't exist 13 years ago. The capability doesn't impact our efficiency results though, as we've simply looked at elevation processing output (i.e. at the maximum clock speeds), but information technology does affect how the card performs for the general consumer.

Merely the most important reason for the constant increase in GPU processing efficiency over the years has been down to the changes in the use of the processor itself. In June 2008, the best supercomputers around the world were all powered past CPUs from AMD, IBM, and Intel; 11 years later and there is one more flake vendor in the mix: Nvidia.

Nvidia'southward Tesla P100, featuring the GP100 bit

Their GV100 and GP100 processors were designed almost exclusively for the compute marketplace, they feature a raft of key architectural features to support this, and many of them are very CPU-like. For example, the internal memory of the fries (the cache) looks similar to the likes of a typical server CPU:

Register file per SM = 256 kB
L0 cache per SM = 12 kB didactics
L1 cache per SM = 128 kB education / 128 kB data
L2 enshroud per GPU = half dozen MB

Compare this to Intel'south Xeon E5-2692 v2, which has been used in plenty of compute servers:

L1 cache per cadre = 32 kB instruction / 32 kB data
L2 cache per cadre = 256 kB
L3 cache per CPU = 30 MB

The logic units inside a modern GPU back up a range of data formats; some have specialized units for integer, float, and matrix calculations, whereas others have complex structures that do them all. The units are continued to the enshroud and local retentiveness with loftier speed, wide interconnects. These changes certainly help in processing 3D graphics, only information technology would exist considered overkill for most games. But these GPUs were designed for a broader set of workloads than just images and there is name for this: general purpose GPU (GPGPU).

Automobile learning and data mining are 2 fields that have benefited hugely from the development of GPGPUs and the supported software packages and APIs (e.g. Nvidia'southward CUDA, AMD's ROMc, OpenCL) every bit they involve lots of complex, massively-parallel calculations.

Big GPUs, packed with thousands of unified shader units, are perfect for such tasks, and both AMD and Nvidia (and now Intel is joining the fun) have invested billions of dollars into the R&D of fries that offer increasingly better compute performance.

Intel'southward first discrete GPU in nearly 20 years

At the moment, both companies design GPU architectures that can be employed into a variety of market sectors, and typically avoid making completely separate layouts for graphics and compute. This is because the majority of the turn a profit from making GPUs withal comes selling 3D graphics cards, merely whether information technology stays that style isn't certain. It is possible that equally the demand for compute continues to rise, then AMD or Nvidia could dedicate more resource to improving the efficiency of chips for those markets, and less for rendering.

Merely any happens next, we know ane thing is sure: the next circular of multi-billion transistor, high ability GPUs will go on to be just that fiddling scrap more efficient than their predecessors. And that'southward good news, no matter who'southward making it or what it's beingness used for.