The Road towards GPU Molecular Dynamics: Part 2

by Matt Harvey

In Part 1 of this series, I recounted how we came to be developing molecular dynamics simulation code for the IBM Cell processor. Back in 2006, the Cell seemed poised to make a profound impact on the high performance field – even becoming the processor around which the first petascale system, LANL‘s “Roadrunner”, was built. It is now barely remembered. Only those well-versed in IBM Kremlinology can say why it was effectively killed off, but quite likely its unpopularity amongst programmers forced to deal with its sheer complexity played no small part in the decision.

Lessons learned while working with IBM’s Cell Processor

The Cell was indeed complex. But its crime was not the complexity per se, but rather that so much of it was exposed directly to the programmer. The absence of any tool-kit providing a clean high-level abstraction of the device meant programmers had to deal with the hardware head-on. Nevertheless, our experience of developing for the Cell this way led us to several important conclusions:

1) Future highly parallel processors would have several levels of parallelism (instruction, vector, thread and program), all of which matter (and interact) when aiming for optimum performance.
2) Partitioned address spaces and different classes of memory would make explicit data movement essential, and with it the need to hide latency by overlapping data movement with compute.
3) Heterogeneous designs that combine different processor types each with complementary capabilities in a single device would be common.

These reinforced our view that developing high performance programs for these future processors would require much more than recompilation of existing applications with a special compiler. Their architectures would be so far away from the dominant single-core CPU + serial/threaded program model that major algorithmic changes would be required, necessitating substantial redesign/refactoring of code. Put in the context of scientific computing, where codes may live for decades, it’s clear that if you force change with every new hardware revision programmers won’t target your hardware.

The implication of this was that for the programmer to have any hope of developing new code that is both optimised and portable to other/future hardware, programming with the right high-level abstraction of the hardware would be critical. In other words, having a high quality set of software development tools would be more important than ever.

So, what happened next?

In 2007, NVIDIA, a company known for its PC graphics cards, released its new model, called the G80. Few people in the HPC space knew much about NVIDIA then – its products were mostly sold to gamers and architects – and so the implications of developments in the 3D graphics field had gone unremarked by most of the scientific computing world. The arrival of the G80 was, in retrospect, a rare moment of revolutionary change when an outsider enters a new field and really shakes it up. So what was so revolutionary about the G80? To understand, we need to take a trip down memory lane.

GeForce 8600GT First GPU used for MD

A Potted History of PC Graphics

In the early 90’s a PC graphics adapter was dumb hardware – all it really did was squirt out the contents of its memory (“frame buffer”) down a cable to a monitor. The host CPU was completely responsible for writing the data representing the color of each display pixel (or character) into the frame buffer.

Having the CPU do all the work was very inefficient, not least because of the slow bus connection, so it became increasing common for graphics adapters to have some degree of acceleration for simple, common operations: for example to move a block of pixels from one location to another (“blitting”, useful for moving or scrolling windows, for example), or to fill whole area with color.

When games with 3D graphics first started to become popular (if you worked in a computer lab in the 00s you surely played Quake death matches!) the graphics scene, typically expressed as a triangulated surface, was constructed and rendered by the CPU. Because the demand for better games graphics outmatched what improvements in CPUs and busses could provide (and because the dollar size of the games market was growing dramatically), many of these geometric primitive operations became implemented in hardware. The GPU was born.

The first GPUs bore all the hallmarks of a device engineered for a specific task – the hardware was a set of relatively simple fixed-function block all pipelined together, with each unit performing offload of one fixed aspect of the rendering and display process. Two of the more important functions, of this pipeline, texturing and lighting ( the process of making a surface look more realistic by covering it with an appropriate image and illuminating it) are quite demanding of hardware, so GPUs rapidly began to acquire their own large, very high bandwidth memories as well as limited floating point capabilities.

By the early 2000’s, GPUs were starting to boast performance figures high enough to cause some people in scientific computing, myself included, to start to wonder if they might be useful for other things beyond making pretty pictures. Superficially the GPUs of the day looked highly appealing – they had lots of memory bandwidth, reasonable floating point capability and low cost. Unfortunately, that’s where the good news stopped!

GPUs were still very much fixed-function devices and, furthermore, were only programmable through graphics API languages (NVIDIA’s Cg and OpenGL’s GLSL). To re-purpose them you had to have an algorithm that could map directly onto the structure imposed by the programming abstraction, in effect making the computation look like a special case of shading, with the input as a texture image and the output of the screen frame buffer. Not much fun!

To compound matters, the bus that connected the GPU to the host computer (called AGP) was horrendously slow – anytime you saved by going the compute on the GPU would be lost in the ponderous copy of the results back to the host.

When we were looking for novel processors to develop for we considered, but ultimately dismissed, the contemporary GPUs because of these short-comings, although they are – in essence – more extreme versions of the problems of parallelism, data locality and architectural abstraction that we encountered with Cell.

NVIDIA brings GPUs to HPC

Returning to the G80, NVIDIA had clearly seen the short-comings of their fixed-function hardware, and realised that a more general-purpose architecture would give them a march on their competition by giving the programmer greater flexibility. The G80 was designed around a fully programmable core and, although fixed-function hardware remained, much of the functionality was now performed in software.

Interesting though this was, what made it really significant in the world beyond 3D was that NVIDIA coupled the G80 release with a new programming language, called CUDA, which provided an expressive programming model that embodied an elegant hardware abstraction.

NVIDIA had clearly set its sights on expanding into the HPC space, and it has done so very successfully – from having no presence at all in 2007, by November 2010, it had acquired the number one spot on the Top 500 list of supercomputer, with the 2.5 petaflop Tianhe 1A system.

In the next post, we’ll dive into the architecture of the modern Nvidia GPU.

gianniThe Road towards GPU Molecular Dynamics: Part 2
read more

Maxwell GPU review: MD simulations with GeForce GTX980

by Matt Harvey

After a long wait, the greatly-anticipated Maxwell GPU from NVIDIA has finally arrived, in the form of the Geforce GTX980, to rave reviews from the gaming world where it has been acclaimed as the new king of the performance hill.

At Acellera, we’re always tracking the cutting edge of technology to deliver the best systems for molecular dynamics simulation, we’ve been hard at work putting these new cards through their paces. Before we see how they perform, let’s have a quick look at what’s changed.

What’s new? Maxwell GPUs: the new state of the art for molecular dynamics simulations

NVIDIA’s previous generation GPU, named Kepler, has been our workhorse for almost three years now, first in the form of the GK104 silicon, and then its big brother the GK110. These two devices both have a similar design, based on a 192-processing element block, called a “Streaming Multiprocessor“, or SM for short.
Manufactured on a 28nm process, the GK110 had 15 SMs, although it’s only with the very latest Tesla K40 and Geforce GTX780Ti that we have seen products with all of those cores enabled – the GTX780 systems we have been shipping to this past year have had only 12 SMs activated.

Maxwell GPU vs Kepler: Main differences

Maxwell’s SM is a refinement of that of Kepler, reducing the number of processing cores by a third to 128 but incorporating additional design improvements. NVIDIA claims that the real-world performance is reduced by only 10% relative to Kepler. The new Maxwell processor, GM204, is still manufactured on the same 28nm process as GK110 rather than the anticipated 20nm process. Nevertheless, the smaller SM, and the intervening refinements to the manufacturing process mean that GM204 can run at higher clock frequencies and contain more SMs than Kepler (16 versus 15) on a die about 40% smaller.

Maxwell GPU Performance in MD Simulation: Faster, stable and more efficient

So how does the Maxwell-based GTX980 fare when running ACEMD, our flagship molecular dynamics code?

Over the last year, we have been selling GTX780-based systems. On the popular dihydrofolate reductase benchmark system of 23,500 atoms, we saw single-GPU performance of around 210ns/day.


Running the same test on a GTX980, with no other performance optimisations, yields an impressive rate of 280ns/day, around 30% faster. On a Metrocubo equipped with 4 GTX980s, that’s over 1 microsecond per day of MD sampling. If you prefer maximum performance over throughput, a two-GPU run can achieve 380ns/day, a new performance record!

Benchmarking conditions: ECC off. X79 chipset. CUDA4.2 and ACEMD ver 2500 or greater. Periodic boundary conditions, 9 A cutoff, PME long range electrostatic grid size 1, hydrogen mass repartitioning, rigid bonds, Langevin thermostat, time step 4 fs. Note: Other codes make benchmarks with smaller cutoff or less atoms. Performace as of October 8th 2014. See ACEMD page for latest results, and to run a benchmark simulation with your system.


And that’s not all: compared to Kepler, the new GPU is much more power-efficient – an at the wall measurement of a 4-GPU E3-based Metrocubo system running at full tilt draws almost 200W less, making it even quieter and cooler than before.

It’s quite remarkable that such an improvement has been made without moving to a newer manufacturing node, and makes the future 20nm parts even more tantalising.

GPU hardware for MD simulations available now with Maxwell GPUs

All in all, the new Maxwell has passed its tests with flying colours and we’re very pleased to announce that we are shipping them to customers now.

As usual feel free to request a test drive. We will be more than happy to make some time available in one of our machines. Maxwell is already available for testing.

Also, feel free to request a quote.

gianniMaxwell GPU review: MD simulations with GeForce GTX980
read more

The Road towards GPU Molecular Dynamics: Part 1

by Matt Harvey

In 2006, when we started to develop codes for molecular dynamics (MD) simulations, we did so in anticipation of riding a new wave of technological innovation in processors. Up until then, MD was firmly in the realm of high performance computing. Running a simulation on a desktop or workstation was an exercise in futility — it simply wasn’t possible to run simulations for long enough to reach the timescales of relevance to bio-molecular processes — you had to have access to a supercomputer.

Supercomputers are great things (I’ve spent most of my career building, running and using them) but they usually far too expensive for any one researcher to own exclusively, meaning that they are often owned by consortia and time on them rationed parsimoniously. Suddenly, a researcher thinking of using molecular simulation had not only to learn about MD, but also to bid for computer time then find their way around a peculiar new operating environment, all for the privilege of having their simulations sit inexplicably stuck in a batch system queue.

Building a better environment for MD based computer assisted drug discovery

All of these issues combined to make MD quite a niche activity, despite having – in our eyes – direct applicably to a wide range of bio-molecular problems. What, we wondered, if we could build a better workstation? A personal machine able to do useful MD simulations wouldn’t just bring a quantitative change to the field but a qualitative one: by becoming something that one could run almost on a whim, MD could become a standard tool in the toolbox of computational drug design.

The question for us to answer was, would it be possible?

A case for consumer hardware repurposing

If you are familiar with the history of high performance computing, you’ll know that it is littered with the corpses of companies that tried – and failed – to build specialised processors optimised for niche applications. In the best cases, the fate of these businesses is to be rendered irrelevant by the relentless Moore’s Law-paced improvements in mass-market technology. As if we needed any more discouragement from following that path, it was that it had just been trod by DE Shaw Research, with their special-purpose MD machine Anton[1]. From the published costs, knew we had no hope of raising financing for a similar effort.

What we needed, then, was to find some already existing processor that:
1) had characteristics making it markedly more suitable for MD than normal CPUs,
and 2) that was also available on the mass market, making it both cheap and likely to benefit from Moore’s Law performance improvements.

At around this time the processor industry, in response to ever-greater manufacturing challenges, was beginning its transition from single-core to multicore designs. It was pretty clear even then that this would be a major technology trend and, although it was still an open question what future multicore processors would look like in detail, it was evident that it was the type of processor we should be design for.

Looking around for examples that fitted these criteria, it didn’t take us long to whittle the list down to one: the Cell processor

The Cell processor for molecular dynamics simulations

The Cell processor was jointly developed by IBM, Toshiba and Sony as a high performance processor for media applications. It had a heterogeneous architecture – a single Cell processor contained a conventional Power CPU (called the “PPE”) connected to a set of “Synergistic Processing Elements” that were, in effect, high performance vector processors. The aggregate performance of these SPEs made a Cell processor over 10x faster at arithmetic than its CPU contemporaries. Unlike normal CPUs, the SPEs could not run general purpose programs directly, they operated under the direct control of the CPU. This made their programming substantially more complex than usual.

Nevertheless, the Cell had one decisive factor in its favor – it could be bought for next to nothing, as it was the processor at the heart of the Sony PlayStation 3 games console. (Granted, it could also be bought for a lot if you asked IBM for their version!). Serendipitously, the PS3 could also run Linux, giving us a sensible, familiar, development environment.

CellMD – the first MD code for running MD simulations on graphic processors

The programming of the Cell that ultimately led to CellMD proved quite challenging – to get good performance, we had to consider:

*) Explicit data management. The SPEs could not access main system meory directly, all input and output data for their programs had to be staged in explicitly using code written for the PPE. Furthermore the SPEs has only very limited memory (256KB), making it necessary to carefully pack data structures.
*) Multilevel parallelism. In mapping the MD work onto the SPEs we had to consider division of work not only across the array of SPEs but also across the 128bit SIMD vector word on which each SPE operated. The PPE also had its own incompatible set of vector SIMD operations.
*) Flow control complexity. The SPEs, being very simple vector processors, optimised for floating point arithmetic operations were very poor at handling flow control operations, so code had to be carefully modified, to unroll loops and use predication in place of conditionals.
*) Algorithm mapping. Since both the SPEs and PPE could be used simultaneously, getting best performance meant mapping different aspects of the computation onto the different processing elements and carefully overlapping work.

When we finally had our first working MD code, which we called CellMD, it ran over an order of magnitude faster than could be achieved on a single CPU workstation[2], equivalent to about 5 years of CPU development, and was a vindication of our approach.

Inconvenient revisions of the Cell and migration into GPUs

With CellMD working, we anticipated being able to turn our attention back to more scientific pursuits, expecting the Cell processor — which was garnering substantial interest in the HPC world by this point — to be further developed, so giving us “free” performance improvements. Naturally, things seldom turn out so well!

Although IBM would talk about the general roadmap, details and timescales were very vague and never included anything nearly as cost-effective as the PS3 (In fact, there was only ever one major revision, the PowerXCell 8i in 2008, which found its way into LANL’s Roadrunner, the first petaflop supercomputer[3]). To compound matters, the revisions of the Cell made for later updates the the PS3 were aimed at cost-optimisation: newer production techniques were used to make the same design smaller and cheaper, rather than faster.

Fortunately, the trend towards accelerator processors was occurring in other parts of the industry, and we didn’t have to look far for our next platform: NVIDIA’s G80 GPU.

For more, check out part 2 of this series, where we dive into GPUs.


[1] D E Shaw et al “Anton, A Special-Purpose Machine for Molecular Dynamics Simulation,” Communications of the ACM, vol. 51, no. 7, 2008, pp. 91-97
[2] G. De Fabritiis, Performance of the Cell processor for biomolecular simulations, Comp. Phys. Commun. 176, 670 (2007).
[3] http://www.lanl.gov/roadrunner/

gianniThe Road towards GPU Molecular Dynamics: Part 1
read more

Why we began optimizing GPU hardware for MD simulations

by D. Soriano and Gianni De Fabritiis, PhD.

We started using GPU computing for molecular dynamics simulations in late 2007, shortly after NVIDIA introduced their first fully-programmable graphics card. GPU adoption into our molecular dynamics program was a natural transition as we had spent the previous two years working with Sony’s Cell processor, the first widely-available heterogeneous processor, and had obtained very promising results (see 2007 papers cited). While working towards ACEMD, our molecular dynamics software, we were unable to find a stable GPU cluster from any of the major vendors that matched high GPU density with a cost effective specification. We were thus forced to build our own.

Development of the first GPU servers for MD simulations by Acellera

We started the hardware development work that ultimately led to Acellera’s Metrocubo by cherry-picking the appropriate components from a scarce pool. We knew we wanted to build cost effective GPU computing systems of high density, and so we focused on researching and developing 4xGPU nodes built around consumer hardware. After evaluating the possibilities we found a single 1500W power supply, and picked one motherboard of the very few which at the time supported 4 double-width GPUs. We then complemented this with CUDA GPUs from NVIDIA, and after deciding on the remaining components we obtained a system that allowed us to build a small GPU cluster, to which we tailored what would ultimately become ACEMD.

Development of a computer chassis specifically designed for GPU accelerated MD

Over the next two years, the range of consumer hardware choices available for GPU computing improved significantly. Nevertheless, we could not consistently get our hands on a commercial GPU server that had a suitably designed chassis, and consequently some of our systems required frequent support. For the most part, available chassis did not have the right combination of GPU cooling, low noise, and GPU density, and brand name options came with custom designs of power supplies and mainboards, which complicated updates. To address these issues, and be able to standardize hardware configurations so that we could confidently have robust and efficient machines optimized for GPU accelerated molecular dynamics simulations, we decided to design a new GPU chassis. To date this chassis remains the only one to be optimized for use in MD simulation work and computer aided drug design.

First generation GPU chassis: design and prototyping

The first prototype of the GPU computer chassis, which was made of stainless steel, already included some of the key signature features of the final product. First, we placed the power supply at the front, and fixed the dimensions of the box for exclusive single socket, 4-GPU motherboard support,as multiple sockets increased the cost and decreased performance by about 10% due to PCIe switch latency. The result was a versatile and compact design that permitted facile repurposing of workstations into rackmount solutions, and offered the possibility of high GPU density. While troubleshooting or testing new configurations we frequently moved machines from the server room to the office and vice versa, and therefore we preferred to design a machine that fitted in both environments. Second, we placed two large-radius fans with high quality bearings at the front, instead of the center or the back. Given the small volume of the box, we felt this would be enough to ensure proper cooling and curb noise to manageable levels. As one might expect the first prototype required revision, but minor modifications primarily introduced in the front panel of our second generation design produced a GPU chassis that fitted all of our requirements.

Second generation GPU chassis: results and final touch ups

The results obtained with GPU nodes built with the second generation chassis were most satisfactory. The cooling was much better than anything we had ever tested before and the temperature of the GPUs never exceed 78C, even after running the machine for extended periods of time with both actively and passively cooled GPUs. Furthermore, the amount of noise produced by the machines was very acceptable, and did not exceed the background noise produced by the lab’s AC. Some of Acellera’s customers have now six of these systems running satisfactorily full blast in their offices. Furthermore, the compact size of the chassis – each 8U tall and less than ⅓ of 19in wide – allowed fitting 3 units per computer rack shelve, on a tray, thus facilitating GPU cluster assembly.

Next, we shifted our attention to the unit’s cosmetics. GPU clusters and workstations are ubiquitously characterized by dull colors and unimaginative designs, and we wanted something more attractive to see in the lab every day. We ordered ten prototypes of various colors, and we decided we liked the green, the orange and the blue versions, but we settled for the last as it coincided with Acellera’s logo color. The feet for the workstations we made ourselves, printed in orange on our own 3D printer.

GPU clusters and workstations built with Acellera’s chassis are marketed in the USA, Canada, and Europe

Since then, Acellera has shipped over a hundred Metrocubo units built on this chassis design. In order to have the flexibility to adapt to hardware changes only small batches are made and modifications introduced as needed. Developing standardized hardware configurations allows Acellera to minimize hardware problems. Acellera sells the same standard configurations in the Europe and the US thanks to partnerships with companies such as Silicon Mechanics, or Azken. None of our MetroCubo machines ship until they have passed a battery of taxing stress tests, including running extensive ACEMD simulations. Our Metrocubo GPU workstations are plug and play devices that come with the OS (CentOS or Ubuntu) and molecular dynamics simulation software all fully installed and ready for use. For GPU cluster configurations this is up to the customers’ discretion, but nevertheless Acellera offers assistance should they prefer to do it on-site. Should you be interested in more photos of our GPU systems or details on a current sample configuration visit our Google+ profile, see more information on Metrocubo, or contact us directly.

Finally, if you prefer to build your own, here is a sample spec that we currently use:

  • Acellera Metrocubo Chassis
  • Intel Xeon™ E3-1245v3 Quad Core 3.4Ghz, 22nm, 8MB, 84W + P4600 GPU
  • MB ASUS Z87 WS
  • 32 GB ECC RAM
  • Silverstone 1500W power supply
  • Hard disk WD 4TB RED
  • 4 GPUs of choice (Tesla K40, GTX 780 TI)

Note that E3 processors have a built in GPU that can be used while all 4 Nvidia GPUs are computing. Enjoy.

gianniWhy we began optimizing GPU hardware for MD simulations
read more