by Matt Harvey
In Part 1 of this series, I recounted how we came to be developing molecular dynamics simulation code for the IBM Cell processor. Back in 2006, the Cell seemed poised to make a profound impact on the high performance field – even becoming the processor around which the first petascale system, LANL‘s “Roadrunner”, was built. It is now barely remembered. Only those well-versed in IBM Kremlinology can say why it was effectively killed off, but quite likely its unpopularity amongst programmers forced to deal with its sheer complexity played no small part in the decision.
Lessons learned while working with IBM’s Cell Processor
The Cell was indeed complex. But its crime was not the complexity per se, but rather that so much of it was exposed directly to the programmer. The absence of any tool-kit providing a clean high-level abstraction of the device meant programmers had to deal with the hardware head-on. Nevertheless, our experience of developing for the Cell this way led us to several important conclusions:
1) Future highly parallel processors would have several levels of parallelism (instruction, vector, thread and program), all of which matter (and interact) when aiming for optimum performance.
2) Partitioned address spaces and different classes of memory would make explicit data movement essential, and with it the need to hide latency by overlapping data movement with compute.
3) Heterogeneous designs that combine different processor types each with complementary capabilities in a single device would be common.
These reinforced our view that developing high performance programs for these future processors would require much more than recompilation of existing applications with a special compiler. Their architectures would be so far away from the dominant single-core CPU + serial/threaded program model that major algorithmic changes would be required, necessitating substantial redesign/refactoring of code. Put in the context of scientific computing, where codes may live for decades, it’s clear that if you force change with every new hardware revision programmers won’t target your hardware.
The implication of this was that for the programmer to have any hope of developing new code that is both optimised and portable to other/future hardware, programming with the right high-level abstraction of the hardware would be critical. In other words, having a high quality set of software development tools would be more important than ever.
So, what happened next?
In 2007, NVIDIA, a company known for its PC graphics cards, released its new model, called the G80. Few people in the HPC space knew much about NVIDIA then – its products were mostly sold to gamers and architects – and so the implications of developments in the 3D graphics field had gone unremarked by most of the scientific computing world. The arrival of the G80 was, in retrospect, a rare moment of revolutionary change when an outsider enters a new field and really shakes it up. So what was so revolutionary about the G80? To understand, we need to take a trip down memory lane.
A Potted History of PC Graphics
In the early 90’s a PC graphics adapter was dumb hardware – all it really did was squirt out the contents of its memory (“frame buffer”) down a cable to a monitor. The host CPU was completely responsible for writing the data representing the color of each display pixel (or character) into the frame buffer.
Having the CPU do all the work was very inefficient, not least because of the slow bus connection, so it became increasing common for graphics adapters to have some degree of acceleration for simple, common operations: for example to move a block of pixels from one location to another (“blitting”, useful for moving or scrolling windows, for example), or to fill whole area with color.
When games with 3D graphics first started to become popular (if you worked in a computer lab in the 00s you surely played Quake death matches!) the graphics scene, typically expressed as a triangulated surface, was constructed and rendered by the CPU. Because the demand for better games graphics outmatched what improvements in CPUs and busses could provide (and because the dollar size of the games market was growing dramatically), many of these geometric primitive operations became implemented in hardware. The GPU was born.
The first GPUs bore all the hallmarks of a device engineered for a specific task – the hardware was a set of relatively simple fixed-function block all pipelined together, with each unit performing offload of one fixed aspect of the rendering and display process. Two of the more important functions, of this pipeline, texturing and lighting ( the process of making a surface look more realistic by covering it with an appropriate image and illuminating it) are quite demanding of hardware, so GPUs rapidly began to acquire their own large, very high bandwidth memories as well as limited floating point capabilities.
By the early 2000’s, GPUs were starting to boast performance figures high enough to cause some people in scientific computing, myself included, to start to wonder if they might be useful for other things beyond making pretty pictures. Superficially the GPUs of the day looked highly appealing – they had lots of memory bandwidth, reasonable floating point capability and low cost. Unfortunately, that’s where the good news stopped!
GPUs were still very much fixed-function devices and, furthermore, were only programmable through graphics API languages (NVIDIA’s Cg and OpenGL’s GLSL). To re-purpose them you had to have an algorithm that could map directly onto the structure imposed by the programming abstraction, in effect making the computation look like a special case of shading, with the input as a texture image and the output of the screen frame buffer. Not much fun!
To compound matters, the bus that connected the GPU to the host computer (called AGP) was horrendously slow – anytime you saved by going the compute on the GPU would be lost in the ponderous copy of the results back to the host.
When we were looking for novel processors to develop for we considered, but ultimately dismissed, the contemporary GPUs because of these short-comings, although they are – in essence – more extreme versions of the problems of parallelism, data locality and architectural abstraction that we encountered with Cell.
NVIDIA brings GPUs to HPC
Returning to the G80, NVIDIA had clearly seen the short-comings of their fixed-function hardware, and realised that a more general-purpose architecture would give them a march on their competition by giving the programmer greater flexibility. The G80 was designed around a fully programmable core and, although fixed-function hardware remained, much of the functionality was now performed in software.
Interesting though this was, what made it really significant in the world beyond 3D was that NVIDIA coupled the G80 release with a new programming language, called CUDA, which provided an expressive programming model that embodied an elegant hardware abstraction.
NVIDIA had clearly set its sights on expanding into the HPC space, and it has done so very successfully – from having no presence at all in 2007, by November 2010, it had acquired the number one spot on the Top 500 list of supercomputer, with the 2.5 petaflop Tianhe 1A system.
In the next post, we’ll dive into the architecture of the modern Nvidia GPU.