Acellera's Blog

The Road towards GPU Molecular Dynamics: Part 1

by Matt Harvey

In 2006, when we started to develop codes for molecular dynamics (MD) simulations, we did so in anticipation of riding a new wave of technological innovation in processors. Up until then, MD was firmly in the realm of high performance computing. Running a simulation on a desktop or workstation was an exercise in futility — it simply wasn’t possible to run simulations for long enough to reach the timescales of relevance to bio-molecular processes — you had to have access to a supercomputer.

Supercomputers are great things (I’ve spent most of my career building, running and using them) but they usually far too expensive for any one researcher to own exclusively, meaning that they are often owned by consortia and time on them rationed parsimoniously. Suddenly, a researcher thinking of using molecular simulation had not only to learn about MD, but also to bid for computer time then find their way around a peculiar new operating environment, all for the privilege of having their simulations sit inexplicably stuck in a batch system queue.

Building a better environment for MD based computer assisted drug discovery

All of these issues combined to make MD quite a niche activity, despite having – in our eyes – direct applicably to a wide range of bio-molecular problems. What, we wondered, if we could build a better workstation? A personal machine able to do useful MD simulations wouldn’t just bring a quantitative change to the field but a qualitative one: by becoming something that one could run almost on a whim, MD could become a standard tool in the toolbox of computational drug design.

The question for us to answer was, would it be possible?

A case for consumer hardware repurposing

If you are familiar with the history of high performance computing, you’ll know that it is littered with the corpses of companies that tried – and failed – to build specialised processors optimised for niche applications. In the best cases, the fate of these businesses is to be rendered irrelevant by the relentless Moore’s Law-paced improvements in mass-market technology. As if we needed any more discouragement from following that path, it was that it had just been trod by DE Shaw Research, with their special-purpose MD machine Anton[1]. From the published costs, knew we had no hope of raising financing for a similar effort.

What we needed, then, was to find some already existing processor that:
1) had characteristics making it markedly more suitable for MD than normal CPUs,
and 2) that was also available on the mass market, making it both cheap and likely to benefit from Moore’s Law performance improvements.

At around this time the processor industry, in response to ever-greater manufacturing challenges, was beginning its transition from single-core to multicore designs. It was pretty clear even then that this would be a major technology trend and, although it was still an open question what future multicore processors would look like in detail, it was evident that it was the type of processor we should be design for.

Looking around for examples that fitted these criteria, it didn’t take us long to whittle the list down to one: the Cell processor

The Cell processor for molecular dynamics simulations

The Cell processor was jointly developed by IBM, Toshiba and Sony as a high performance processor for media applications. It had a heterogeneous architecture – a single Cell processor contained a conventional Power CPU (called the “PPE”) connected to a set of “Synergistic Processing Elements” that were, in effect, high performance vector processors. The aggregate performance of these SPEs made a Cell processor over 10x faster at arithmetic than its CPU contemporaries. Unlike normal CPUs, the SPEs could not run general purpose programs directly, they operated under the direct control of the CPU. This made their programming substantially more complex than usual.

Nevertheless, the Cell had one decisive factor in its favor – it could be bought for next to nothing, as it was the processor at the heart of the Sony PlayStation 3 games console. (Granted, it could also be bought for a lot if you asked IBM for their version!). Serendipitously, the PS3 could also run Linux, giving us a sensible, familiar, development environment.

CellMD – the first MD code for running MD simulations on graphic processors

The programming of the Cell that ultimately led to CellMD proved quite challenging – to get good performance, we had to consider:

*) Explicit data management. The SPEs could not access main system meory directly, all input and output data for their programs had to be staged in explicitly using code written for the PPE. Furthermore the SPEs has only very limited memory (256KB), making it necessary to carefully pack data structures.
*) Multilevel parallelism. In mapping the MD work onto the SPEs we had to consider division of work not only across the array of SPEs but also across the 128bit SIMD vector word on which each SPE operated. The PPE also had its own incompatible set of vector SIMD operations.
*) Flow control complexity. The SPEs, being very simple vector processors, optimised for floating point arithmetic operations were very poor at handling flow control operations, so code had to be carefully modified, to unroll loops and use predication in place of conditionals.
*) Algorithm mapping. Since both the SPEs and PPE could be used simultaneously, getting best performance meant mapping different aspects of the computation onto the different processing elements and carefully overlapping work.

When we finally had our first working MD code, which we called CellMD, it ran over an order of magnitude faster than could be achieved on a single CPU workstation[2], equivalent to about 5 years of CPU development, and was a vindication of our approach.

Inconvenient revisions of the Cell and migration into GPUs

With CellMD working, we anticipated being able to turn our attention back to more scientific pursuits, expecting the Cell processor — which was garnering substantial interest in the HPC world by this point — to be further developed, so giving us “free” performance improvements. Naturally, things seldom turn out so well!

Although IBM would talk about the general roadmap, details and timescales were very vague and never included anything nearly as cost-effective as the PS3 (In fact, there was only ever one major revision, the PowerXCell 8i in 2008, which found its way into LANL’s Roadrunner, the first petaflop supercomputer[3]). To compound matters, the revisions of the Cell made for later updates the the PS3 were aimed at cost-optimisation: newer production techniques were used to make the same design smaller and cheaper, rather than faster.

Fortunately, the trend towards accelerator processors was occurring in other parts of the industry, and we didn’t have to look far for our next platform: NVIDIA’s G80 GPU.

For more, check out part 2 of this series, where we dive into GPUs.

References

[1] D E Shaw et al “Anton, A Special-Purpose Machine for Molecular Dynamics Simulation,” Communications of the ACM, vol. 51, no. 7, 2008, pp. 91-97
[2] G. De Fabritiis, Performance of the Cell processor for biomolecular simulations, Comp. Phys. Commun. 176, 670 (2007).
[3] http://www.lanl.gov/roadrunner/

gianniThe Road towards GPU Molecular Dynamics: Part 1