Volume Number: 22 (2006)
Issue Number: 2
Column Tag: Programming
An inside look at the inner-workings of the original OS X-based supercomputer, plans for the final upgrade and the science that drives System-X, as well as, future directions in the world of High Performance Computing.
by Emmanuel Stein
High Performance Computing (HPC): An introduction
Many of today's HPC Clusters, such as Virginia Tech's System-X, take advantage of open source software and build their infrastructure around commodity hardware solutions in what are typically referred to as "one-off" systems. This type of approach differs dramatically from many of the more turn-key software driven cluster solutions, epitomized by Apple's Workgroup cluster, in which software solutions such as Oracle RAC, gridMathematica and Renderman, just to name a few, drive the hardware solution appropriate for a given application. In the case of these systems, enhanced user experience and facilitated deployment and management represent the driving forces. However, such solutions compromise efficiency for ease of use and therefore tend to take the performance out of HPC. Among today's real HPC solutions, a premium is placed on getting the most out of a given hardware architecture. In the case of System-X, as well as, similar clusters that have followed (e.g. Turing and COLSA HPC clusters), the PowerPC 970 processor, with its two double precision floating units and the availability of low-latency fabric such as Myrinet and InfiniBand, served as the prime movers behind the choice of platform. The majority of today's top HPC supercomputers take advantage of these low-latency solutions to realize their blindingly fast parallel computations. This, in turn, enables a cluster to take full benefit of the processing power of each individual node.There are a number of great vendors providing low-latency technology including Small Tree Solutions and Mellanox.
To take advantage of an HPC environment, Message Passing Interface (MPI) is employed to realize parallelization across a targeted set of cluster nodes. MPI is a low-level, but standardized method of programmatically enabling communication among disparate nodes. Implemented as libraries in C and subroutines in FORTRAN, MPI was developed as a portable, low-level interface for enabling parallelism within a multi-processor or multi-node environment. Although, other mechanisms for breaking down computational problems across multiple processors or nodes may be accomplished using approaches such as OpenMP, MPI remains the interface of choice for HPC applications due to its flexibility, raw performance and scalability. For example, Verari Systems Software makes the commercial MPI/Pro 2.1 software, considered to be a high-performance, scalable implementation of the MPI-2 standard; and Dauger Research makes Pooch, a graphical MPI solution.
System-X: The Race to Build Academia's Fastest Supercomputer
At 1pm, on June 23, 2003, Jason Lockhart watched on as Apple announced the new PowerMac G5. In that instant the frustration of months of dead end talks with potential vendors, for the acquisition of an HPC cluster capable of 10 teraflops, vanished as the future platform of the then unnamed System-X materialized as if out of thin air. A week later Dr. Srinidhi Varadarajan, who had submitted a proposal to the National Science Foundation (NSF) for a Major Research Instrumentation grant, was flown to Apple headquarters to discuss the feasibility of constructing a 1,100 node HPC supercomputer using Apple's yet to be released line of PowerPC 970-based systems. Varadarajan not only convinced Apple that Virginia Tech should be given the opportunity to build the high profile HPC cluster, but also managed to secure the first 1,100 production units. Although, not a Mac user at the time, Varadarajan took only 3 days to realize that OS X and the PowerMac G5 represented the ideal platform for building his ambitious supercomputing project.
Grace Under Pressure:
Following the choice of the G5 platform came the monumental task of adapting and extending core software components such as MPI, BLAS, InfiniBand drivers, and the custom high performance memory manager to run optimally on the 2,200 CPU cluster. In a whirlwind of prodigious programming, Varadarajan, 2 Mellanox engineers and a handful of graduate students accomplished what would have taken an army of dedicated developers many months to complete.
Among the most crucial tasks, was to adapt existing variants of MPI to run on such a large-scale cluster and, moreover, to do so in a manner that was both robust and that took full advantage of the low latency InfiniBand fabric. Although, MPI variants had been ported to OS X previously, none offered the ability to scale much past a 128-node cluster, much less a 2,200 CPU system. There was also the challenge to adapt the existing MPI stack to make efficient use of the InfiniBand fabric. In fact, the only available MPI implementation that did support InfiniBand was only available in beta form and had not been designed, much less optimized for either OS X or a cluster on the scale of System-X.
In collaboration with Dr. Panda, of Ohio State University, the author of the MVAPICH (MPI for Verbs API with InfiniBand support), Varadarajan worked tirelessly to sort through the MVAPICH code base, optimizing or rewriting substantial portions of the code. In this endeavor, virtually no element of the code was left untouched, since the ultimate performance and ranking of System-X would depend on the successful implementation of this critical piece of the puzzle.
At the same time, two Mellanox engineers aided by then graduate student Michael Heffner, worked around the clock for several weeks to develop the raw InfiniBand drivers for use with System-X. These raw drivers were integrated with the newly ported MPI stack using a kernel bypass to communicate directly with the InfiniBand Host Channel Adapter (HCA).
During this period of frantic development, Varadarajan, with extensive help from Dr. Goto, of the Texas Advanced Computing Center, worked to further tune and optimize Gotos' BLAS (Basic Linear Algebra Sub-routines) for use with his memory management system. GotoBLAS is the fastest available BLAS implementation and proved instrumental in obtaining increased performance of matrix operations--crucial for high scores on the LINPACK benchmark (http://www.netlib.org/benchmark/hpl/) used by the Top500 organization to rank supercomputers.
What resulted from these intense efforts was what principal designer Varadarajan termed a "pleasant surprise," with an amazing top 3 showing in the supercomputing challenge. This accomplishment is underscored by the fact that System-X cost only $5.2 Million and was put together in roughly 3 weeks! To put this in perspective, the top supercomputer that year, the custom designed Japanese Earth Simulator (5,104 processors), which outperformed System-X by a factor of three, cost $350 Million and was several years in the making.
Figure 1. System-X in 2003 after earning top 3 status on the Supercomputing Top500 with 10.28 TF
System-X Version 2.0
In 2004 System-X was upgraded from 1,100 PowerMac G5 towers to Xserve G5s running dual 2.3 GHz processors, up from the previous towers running dual 2 GHz G5s. The improved throughput enabled by the Xserve I/O subsystem, coupled with the increased clock speed of the processors, afforded the system a significant speedup, from 10.28 to 12.25 teraflops (peak performance 17.5 teraflops). With the addition of ECC memory the system moved from, what was, in essence, a benchmarking and proof-of-concept system, to a formidable production HPC supercomputer capable of precise and massively parallel scientific computations. The addition of ECC RAM proved essential to the success of subsequent research since, over 1100 nodes, a single bit flip, caused by a solar flare, could significantly corrupt the precision of the calculated data.
Figure 2. System-X: In Production and cruising along at 12.25 TF
System-X: Terminal Velocity
Although System-X runs a relatively unmodified version of OS X, the GUI is absent from compute nodes and job submission is done via the simple yet powerful terminal interface. What follows is a cockpit view of System-X.
The pbstop command serves as a monitor of node activity across the cluster.
Figure 3. pbstop command
With the qstat command, users can see the activity and queue status of various jobs submitted to or currently running on System-X. Although, some users have parallelized their code to run on all available nodes, the more common usage scenario involves having multiple users running jobs concurrently on smaller subsets of compute nodes. This occurs for several reasons: it takes a lot of expertise and time to parallelize your code to take full advantage of all available nodes, time spent in queue is proportional to the number of nodes requested for a given job, and some computations lend themselves better to parallelization across fewer nodes.
Figure 4. qstat command
Below is a generic job submission script for System-X. Interestingly enough, it is also employed as a simple benchmarking routine to test system performance and troubleshoot possible issues across system nodes.
Figure 5. qsub.sh
She Blinded Me With Science
System-X currently plays host to numerous researchers around the world, thereby setting the stage for increasingly groundbreaking research. In this section we will cover some of the most popular applications available on System-X and feature some of the stunning scientific developments being made with the advantage of 12.25+ teraflops of raw number crunching power.
A Community of Codes
System-X was designed to offer a generalized, rather than application specific (e.g. IBM's Blue Gene Machines) platform for scientific research. To capitalize on the performance capabilities of the machines, prospective researchers must become master parallel programmers in addition to competent scientists wuthin their respective disciplines. This rather daunting requirement is somewhat offset by the widespread availability of discipline-specific community code that is used as a starting point for the computational decomposition of scientific questions. Although, standards like MPI have made code portability among these systems more of a reality, each individual HPC cluster represents a unique computing environment and thus no two HPC clusters are identical. As a result, considerable effort is required in optimizing code for parallel runs specific to a given HPC environment.
What follows represents some of the community derived software tools that are available as part of the System-X:
- AMBER (Assisted Model Building with Energy Refinement): A molecular dynamic simulation package based on a molecular dynamics force field equation developed by Peter Kollman et al. at the University of California. The simulation of protein folding is among the variety of applications for which AMBER is used.
- ARPREC (C++/Fortran-90 Arbitrary Precision Package): Essentially a C++ Library with both C++ and Fortran-90 translation modules, ARPREC is a mathematical toolkit that enables researchers to conduct interactive and high-precision arithmetic computations up to the level of 10-million digits. This computational environment supports high-precision real, integer and complex datatypes and may be applied to all basic arithmetic operations, most transcendental and combinatorial functions, as well as, high-precision quadrature and summation of series.
- ARPS (Advanced Regional Prediction System): This advanced weather modeling tool, developed at the Center for Analysis and Prediction of Storms (CAPS), is an atmospheric modeling and prediction system, which affords researchers real-time data analysis and assimilation.
- CHARMM (Chemistry at HARvard Macromolecular Mechanics): In the same vein as AMBER, CHARMM is both a force field and a simulation package that allows scientists to apply the force field algorithms to problems in molecular dynamics.
- FASTEST (Flow Analysis Solving Transport Equations with Simulated Turbulence): Applied in the study of fluid dynamics, FASTEST is used to solve three-dimensional flow problems. The set of associated equations include several turbulence models, which feature free topology, implicit time stepping, and that are parallelized and multi-grid enabled.
- GAMESS (General Atomic and Molecular Electronic Structure System): Used by researchers in quantum chemistry, GAMESS is specifically designed software package for performing computational chemistry calculations of the Hartree-Fock, density functional theory and configuration interaction types, among others, within the domain of advanced electronic structure methods.
- Global Arrays: A shared memory programming interface for use in distributed memory environments such as computational clusters, this toolkit represents a departure from other shared memory interfaces inasmuch as it is fully compatible with Message Passing Interface (MPI) and, furthermore, allows programmers to combine message passing and shared memory schemes within the context of a single application. Global Arrays also goes beyond MPI compatibility by allowing users to exploit existing message-passing libraries for facilitated development of highly parallelized code.
- LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator): This molecular dynamics simulator is designed for use in both parallel and single processor environments and may be used to build molecular systems, automatically assign force field coefficients and perform complex molecular dynamics simulations analyses and visualizations. LAMMPS may be applied to atomic, polymetric, biological, metallic, granular, and hybrid systems. Written in portable C++, LAMMPS is highly extensible and is capable of distributed memory message passing for optimal performance in HPC environments.
- METIS (Family of Multi-Level Partitioning Algorithms) and parMETIS (Parallel METIS): METIS and its parallelized, MPI aware variant parMETIS represent a series of applications for partitioning unstructured graphs/hypergraphs and are used in the computation of fill-reducing orderings of spare matrices. This family of programs is applied in the fields of numerical computation, VSLI, geographical information systems, operation research, bioinformatics and knowledge discovery.
- NWChem: An HPC-enabled computational chemistry package used in the study of molecular mechanics and molecular dynamics using Hartree-Fock, Post-Hartree-Fock, and density functional theory methods. The NWChem software was developed at the Pacific Northwest National Laboratory by the Molecular Sciences Software group within the context of the Theory, Modeling & Simulation program and under the purview of the Environmental Molecular Sciences Laboratory.
- PETSc (Portable, Extensible Toolkit for Scientific Computation): A suite of data structures and routines used in the parallelized computation of both linear and nonlinear equations and available in C, C++, Fortran and Python.
- ScaLAPACK (Scalable Linear Algebra PACKage): Used in the solution of dense linear systems and for the computation of eigenvalues for dense matrices, these block-partition based algorithms serve to curtail the frequency of data movement between levels within memory hierarchies. Here, memory hierarchies can apply to the off-processor memory of other processors, hierarchies of registers, cache and local memory on individual processors.
- Unified Parallel C: A set of C programming extensions used to provide a framework for programming in the context of shared, as well as, distributed memory environments
- VASP (Vienna Ab-initio Simulation Package: A Fortran-90 application package used in the field of ab-initio quantum-mechanical molecular dynamics and for the purposes of calculating forces and stresses involved in relaxing atoms into their respective instantaneous groundstates.
- VecLib: (BLAS, LAPACK, FFT, DSP) An Apple designed framework for abstracting intensive mathematical processing in a manner optimized for the altivec engine and specific to the G5 processor. For more details see http://developer.apple.com/technotes/tn/tn2086.html,
- WRF (Weather Research and Forecasting Model: This mesoscale numerical weather prediction system is used for operational forecasting and atmospheric research and is useful for climate modeling across meters to thousands of kilometers.
A Gallery of Computations
To highlight some of the real-world science coming out of System-X, the following is a sampling of research visualizations, with accompanying quotes from the principal investigators. QuickTime renders of the screenshots below, including additional material, are available at http://people.cs.vt.edu/~ribbens/tcf/SC05/, courtesy of Cal Ribbens, Ph.D.
"Dr. Onufriev's group is using System-X to perform a series of molecular dynamics simulations aimed at gaining insight into (and ultimately proposing a mechanism for) the unusual flexibility of short DNA fragments. This new phenomenon was recently discovered in experiments that challenge the conventional picture of the DNA molecule, traditionally thought to behave more or less like a rigid rod at the biologically important length scales of up to tens of nanometers. The emerging picture of a much more flexible DNA may change our understanding of how DNA interacts with other biomolecules, such as proteins. An ability to correctly describe these interactions is of fundamental importance to molecular biology and medicine. In agreement with experiments, computer simulations reveal large-scale fluctuations of the DNA fragment, and most importantly provide full atomic details of the structural changes upon bending. Early results indicate that no unraveling of the famous double-helical structure is required for substantial bending, helping to zoom in on the possible mechanism explaining this newly discovered phenomenon." - Alexey Onufriev
Figure 6. Insights into the Mechanisms of DNA Flexibility (Alexey Onufriev)
"Phospholipid bilayers are cell membranes which play an essential role in protecting the cell and regulating biological activity between the extracellular and intracellular domain. The biophysics of cell membranes is of great importance in understanding biological phenomena, ranging from drug interaction to cancer treatment to bio-preservation. Conventional laboratory experiments are not always capable of probing molecular interactions and distinguishing among the many processes occurring at the molecular level. Professor Sum and his research group (www.che.vt.edu/Sum) are applying and developing advanced molecular modeling methods to study phospholipids bilayers to understand their structure, dynamics, and interactions with different solute molecules. One project models the interactions and diffusion of cryoprotectants with the phospholipids bilayer to obtain insight into preservation mechanisms of biological systems. In another project, Sum and colleagues are probing the activation mechanism of the sensory system in response to specific compounds. Even though these systems only encompass a few nanometers of a model cell membrane, they contain a large number of molecules. System X has enabled Professor Sum to study these systems in much greater detail by allowing extensive simulations for larger systems (order of 100,000 atoms) and for longer periods of time (order of 10-100 ns). The simulations on System X allow for the development of long-time dynamics, which is critical to understanding the phospholipids bilayers (long-relaxation times) and the diffusion process." - Amadeu Sum
Figure 7. Biophysics of Phospholipid Bilayers (Amadeu Sum)
System-X Version 3.0
Virginia Tech is in the process of upgrading System-X to take advantage of technological advances in both software and hardware and thereby increase the performance of this computational titan. Specific upgrades include a general system update from OS 10.3.9 to OS 10.4, which promises to offer benefits in terms of better memory management and, most significantly, the ability to use 64-bit address space to dramatically increase the resolution of experimental results sets. Further enhancements will include updated InfiniBand drivers, as well as, the deployment of newer versions of the Mellanox InfiniBand HCAs used for inter-node communication. Beyond these enhancements, each node will be upgraded from 4 to 8 Gigabytes of memory, thus doubling the available memory pool for distributed computation. In additional, there are plans to update the MPI stack and deploy 64-Bit compilers to take advantage of the addressing space in Tiger. Once these updates are applied, System-X will be considered stable and no further enhancements, beyond minor software updates, are envisaged. The goal with this final revision of System-X is to offer a robust production HPC platform that will service the scientific community for years to come, with longevity being a primary aspiration.
Figure 8. System-X today running at full bore and ready for Tiger!
System-X is only the beginning for Virginia Tech, which has opened the Center for High-End Computing Systems (CHECS). Under the direction of Dr. Varadarajan, the center's mission is to tackle the various challenges associated with developing the next generation of HPC clusters. The staff are currently working to train future computational scientists in the art of building HPC-enabling and enhancing technologies. One of the primary goals is a convergence of hitherto disparate CS disciplines such as processor and memory architectures, operating systems, runtime environments, communications subsystems, fault-tolerance, scheduling and load-balancing, power aware systems and algorithms and associated programming models. According to Varadarajan, the aim is to architect and implement "computing systems and environments that can efficiently and usably span the scales from department-sized machines to national-scale resources." One example of this is the National Lambda Rail (NLR), an Ethernet based optical network, which serves to connect research communities and associated computing resources across the country. Although not yet in widespread use, the NLR promises to advance research in the area of next generation networking technologies and its derivatives may well overtake today's traditional Internet backbone. CHECS will also serve as a test-bed for both theoretical and practical applications within the area of HPC and will work to develop successors to System-X, as well as, collaborate with scientific communities across the world to create much needed software environments for future HPC systems.
This article would not have been possible without the generous participation of the following individuals.
Srinidhi Varadarajan, Associate Professor, Computer Science Department
Director, Center for High-End Computing Systems
Associate Professor, Associate Department Head, Computer Science Department
Director, Laboratory for Advance Scientific Computing and Applications (LASCA)
Jason Lockhart, Director of High-Performance Computing and Technology Innovation,
College of Engineering
Kevin Shinpaugh, Director Research/Cluster Computing,
Hassan Aref, Former Dean and Reynolds Metals Professor, Engineering Science and Mechanics Department
Erv Blythe, Vice President of Information Technology
Jane Talbot, Photo Librarian, Visual Communications, The Visual and Broadcast Communications Department
"I'd also like to thank the many other administrators, contractors and volunteers that made this all possible. We may have been the glue that held it all together, but everyone who participated gave 110% to this project. Without their dedication and attention to detail the project would not have been possible" -Jason Lockhart
System-X: The Makings of an HPC Speed Demon
1,100 Dual 2.3Ghz Xserve Cluster Nodes configured with 4GB of ECC DDR400 RAM, 80GB hard drives, Gigabit Ethernet and Mellanox Cougar 4x InfiniBand Host Channel Adapters (HCA).
3 Dual 2.3Ghz Xserve nodes with 4GB of ECC DDR400 RAM, 3 250GB hard drives and Gigabit Ethernet.
For primary low latency fabric, 4 SilverStorm Technologies 9120 InfiniBand core switches, supporting 4x InfiniBand (10Gbs bidirectional port speed) with 11 leaf modules and 3 spine modules, as well as, 64 SilverStorm Technologies 9024 InfiniBand leaf switches using 4x InfiniBand (10Gbs bidirectional port speed) with 24 InfiniBand ports per leaf switch. Secondary communications provided by 6 Cisco Systems 240-port 4506 Ethernet switches.
Xserve RAID, configured as a RAID 50 array, storing 2.7TB of data and available as an NFS server with aggregate write bandwidth of 90-100MB/sec. System-X users employ this as a temporary storage area with results sets offloaded to more permanent storage areas as needed.
Custom Liebert Extreme Density cooling system, in a chilled water loop configuration and fed off of two 125 ton Carrier water chillers that supply about 3 million BTUs of cooling capacity. Liebert XDP systems with a R-134A refrigerant loop supplied to rack mounted liquid-to-air heat exchangers.
Apple OS X 10.3.9 (currently migrating to 10.4.x), MVAPICH for message passing, Torque (OpenPBS) for queue management, Moab (Maui) for job scheduling, Ganglia for system monitoring, as well as, IBM XL Fortran, IBM XLC and GCC 3.3 compilers. For example, Verari Systems Software makes the commercial MPI/Pro 2.1 software, considered to be a high-performance, scalable implementation of the MPI-2 standard; and Dauger Research makes Pooch, a graphical MPI solution.
Emmanuel Stein has been an avid Mac user since 1984 and has honed his cross-platform skills while working at several Fortune 100 companies. He has recently started his own Mac-centric consulting company, MacVerse, which offers implementation, system administration and development services geared towards the enterprise market. You may reach him at email@example.com.