
- Home
- Magazine
- Conference & Seminars
- News
- Archives
- Forums
- Store
- Directory
- Editorial
- Advertising
- User/Login
- Contact



Volume Number: 22 (2006)
Issue Number: 2
Column Tag: Programming
by Emmanuel Stein
Many of today's HPC Clusters, such as Virginia Tech's System-X, take advantage of open source software and build their infrastructure around commodity hardware solutions in what are typically referred to as "one-off" systems. This type of approach differs dramatically from many of the more turn-key software driven cluster solutions, epitomized by Apple's Workgroup cluster, in which software solutions such as Oracle RAC, gridMathematica and Renderman, just to name a few, drive the hardware solution appropriate for a given application. In the case of these systems, enhanced user experience and facilitated deployment and management represent the driving forces. However, such solutions compromise efficiency for ease of use and therefore tend to take the performance out of HPC. Among today's real HPC solutions, a premium is placed on getting the most out of a given hardware architecture. In the case of System-X, as well as, similar clusters that have followed (e.g. Turing and COLSA HPC clusters), the PowerPC 970 processor, with its two double precision floating units and the availability of low-latency fabric such as Myrinet and InfiniBand, served as the prime movers behind the choice of platform. The majority of today's top HPC supercomputers take advantage of these low-latency solutions to realize their blindingly fast parallel computations. This, in turn, enables a cluster to take full benefit of the processing power of each individual node.There are a number of great vendors providing low-latency technology including Small Tree Solutions and Mellanox.
To take advantage of an HPC environment, Message Passing Interface (MPI) is employed to realize parallelization across a targeted set of cluster nodes. MPI is a low-level, but standardized method of programmatically enabling communication among disparate nodes. Implemented as libraries in C and subroutines in FORTRAN, MPI was developed as a portable, low-level interface for enabling parallelism within a multi-processor or multi-node environment. Although, other mechanisms for breaking down computational problems across multiple processors or nodes may be accomplished using approaches such as OpenMP, MPI remains the interface of choice for HPC applications due to its flexibility, raw performance and scalability. For example, Verari Systems Software makes the commercial MPI/Pro 2.1 software, considered to be a high-performance, scalable implementation of the MPI-2 standard; and Dauger Research makes Pooch, a graphical MPI solution.
At 1pm, on June 23, 2003, Jason Lockhart watched on as Apple announced the new PowerMac G5. In that instant the frustration of months of dead end talks with potential vendors, for the acquisition of an HPC cluster capable of 10 teraflops, vanished as the future platform of the then unnamed System-X materialized as if out of thin air. A week later Dr. Srinidhi Varadarajan, who had submitted a proposal to the National Science Foundation (NSF) for a Major Research Instrumentation grant, was flown to Apple headquarters to discuss the feasibility of constructing a 1,100 node HPC supercomputer using Apple's yet to be released line of PowerPC 970-based systems. Varadarajan not only convinced Apple that Virginia Tech should be given the opportunity to build the high profile HPC cluster, but also managed to secure the first 1,100 production units. Although, not a Mac user at the time, Varadarajan took only 3 days to realize that OS X and the PowerMac G5 represented the ideal platform for building his ambitious supercomputing project.
Grace Under Pressure:
Following the choice of the G5 platform came the monumental task of adapting and extending core software components such as MPI, BLAS, InfiniBand drivers, and the custom high performance memory manager to run optimally on the 2,200 CPU cluster. In a whirlwind of prodigious programming, Varadarajan, 2 Mellanox engineers and a handful of graduate students accomplished what would have taken an army of dedicated developers many months to complete.
Among the most crucial tasks, was to adapt existing variants of MPI to run on such a large-scale cluster and, moreover, to do so in a manner that was both robust and that took full advantage of the low latency InfiniBand fabric. Although, MPI variants had been ported to OS X previously, none offered the ability to scale much past a 128-node cluster, much less a 2,200 CPU system. There was also the challenge to adapt the existing MPI stack to make efficient use of the InfiniBand fabric. In fact, the only available MPI implementation that did support InfiniBand was only available in beta form and had not been designed, much less optimized for either OS X or a cluster on the scale of System-X.
In collaboration with Dr. Panda, of Ohio State University, the author of the MVAPICH (MPI for Verbs API with InfiniBand support), Varadarajan worked tirelessly to sort through the MVAPICH code base, optimizing or rewriting substantial portions of the code. In this endeavor, virtually no element of the code was left untouched, since the ultimate performance and ranking of System-X would depend on the successful implementation of this critical piece of the puzzle.
At the same time, two Mellanox engineers aided by then graduate student Michael Heffner, worked around the clock for several weeks to develop the raw InfiniBand drivers for use with System-X. These raw drivers were integrated with the newly ported MPI stack using a kernel bypass to communicate directly with the InfiniBand Host Channel Adapter (HCA).
During this period of frantic development, Varadarajan, with extensive help from Dr. Goto, of the Texas Advanced Computing Center, worked to further tune and optimize Gotos' BLAS (Basic Linear Algebra Sub-routines) for use with his memory management system. GotoBLAS is the fastest available BLAS implementation and proved instrumental in obtaining increased performance of matrix operations--crucial for high scores on the LINPACK benchmark (http://www.netlib.org/benchmark/hpl/) used by the Top500 organization to rank supercomputers.
What resulted from these intense efforts was what principal designer Varadarajan termed a "pleasant surprise," with an amazing top 3 showing in the supercomputing challenge. This accomplishment is underscored by the fact that System-X cost only $5.2 Million and was put together in roughly 3 weeks! To put this in perspective, the top supercomputer that year, the custom designed Japanese Earth Simulator (5,104 processors), which outperformed System-X by a factor of three, cost $350 Million and was several years in the making.

Figure 1. System-X in 2003 after earning top 3 status on the Supercomputing Top500 with 10.28 TF
System-X Version 2.0
In 2004 System-X was upgraded from 1,100 PowerMac G5 towers to Xserve G5s running dual 2.3 GHz processors, up from the previous towers running dual 2 GHz G5s. The improved throughput enabled by the Xserve I/O subsystem, coupled with the increased clock speed of the processors, afforded the system a significant speedup, from 10.28 to 12.25 teraflops (peak performance 17.5 teraflops). With the addition of ECC memory the system moved from, what was, in essence, a benchmarking and proof-of-concept system, to a formidable production HPC supercomputer capable of precise and massively parallel scientific computations. The addition of ECC RAM proved essential to the success of subsequent research since, over 1100 nodes, a single bit flip, caused by a solar flare, could significantly corrupt the precision of the calculated data.

Figure 2. System-X: In Production and cruising along at 12.25 TF
Although System-X runs a relatively unmodified version of OS X, the GUI is absent from compute nodes and job submission is done via the simple yet powerful terminal interface. What follows is a cockpit view of System-X.
The pbstop command serves as a monitor of node activity across the cluster.

Figure 3. pbstop command
With the qstat command, users can see the activity and queue status of various jobs submitted to or currently running on System-X. Although, some users have parallelized their code to run on all available nodes, the more common usage scenario involves having multiple users running jobs concurrently on smaller subsets of compute nodes. This occurs for several reasons: it takes a lot of expertise and time to parallelize your code to take full advantage of all available nodes, time spent in queue is proportional to the number of nodes requested for a given job, and some computations lend themselves better to parallelization across fewer nodes.

Figure 4. qstat command
Below is a generic job submission script for System-X. Interestingly enough, it is also employed as a simple benchmarking routine to test system performance and troubleshoot possible issues across system nodes.

Figure 5. qsub.sh
System-X currently plays host to numerous researchers around the world, thereby setting the stage for increasingly groundbreaking research. In this section we will cover some of the most popular applications available on System-X and feature some of the stunning scientific developments being made with the advantage of 12.25+ teraflops of raw number crunching power.
A Community of Codes
System-X was designed to offer a generalized, rather than application specific (e.g. IBM's Blue Gene Machines) platform for scientific research. To capitalize on the performance capabilities of the machines, prospective researchers must become master parallel programmers in addition to competent scientists wuthin their respective disciplines. This rather daunting requirement is somewhat offset by the widespread availability of discipline-specific community code that is used as a starting point for the computational decomposition of scientific questions. Although, standards like MPI have made code portability among these systems more of a reality, each individual HPC cluster represents a unique computing environment and thus no two HPC clusters are identical. As a result, considerable effort is required in optimizing code for parallel runs specific to a given HPC environment.
What follows represents some of the community derived software tools that are available as part of the System-X:
A Gallery of Computations
To highlight some of the real-world science coming out of System-X, the following is a sampling of research visualizations, with accompanying quotes from the principal investigators. QuickTime renders of the screenshots below, including additional material, are available at http://people.cs.vt.edu/~ribbens/tcf/SC05/, courtesy of Cal Ribbens, Ph.D.
"Dr. Onufriev's group is using System-X to perform a series of molecular dynamics simulations aimed at gaining insight into (and ultimately proposing a mechanism for) the unusual flexibility of short DNA fragments. This new phenomenon was recently discovered in experiments that challenge the conventional picture of the DNA molecule, traditionally thought to behave more or less like a rigid rod at the biologically important length scales of up to tens of nanometers. The emerging picture of a much more flexible DNA may change our understanding of how DNA interacts with other biomolecules, such as proteins. An ability to correctly describe these interactions is of fundamental importance to molecular biology and medicine. In agreement with experiments, computer simulations reveal large-scale fluctuations of the DNA fragment, and most importantly provide full atomic details of the structural changes upon bending. Early results indicate that no unraveling of the famous double-helical structure is required for substantial bending, helping to zoom in on the possible mechanism explaining this newly discovered phenomenon." - Alexey Onufriev

Figure 6. Insights into the Mechanisms of DNA Flexibility (Alexey Onufriev)
"Phospholipid bilayers are cell membranes which play an essential role in protecting the cell and regulating biological activity between the extracellular and intracellular domain. The biophysics of cell membranes is of great importance in understanding biological phenomena, ranging from drug interaction to cancer treatment to bio-preservation. Conventional laboratory experiments are not always capable of probing molecular interactions and distinguishing among the many processes occurring at the molecular level. Professor Sum and his research group (www.che.vt.edu/Sum) are applying and developing advanced molecular modeling methods to study phospholipids bilayers to understand their structure, dynamics, and interactions with different solute molecules. One project models the interactions and diffusion of cryoprotectants with the phospholipids bilayer to obtain insight into preservation mechanisms of biological systems. In another project, Sum and colleagues are probing the activation mechanism of the sensory system in response to specific compounds. Even though these systems only encompass a few nanometers of a model cell membrane, they contain a large number of molecules. System X has enabled Professor Sum to study these systems in much greater detail by allowing extensive simulations for larger systems (order of 100,000 atoms) and for longer periods of time (order of 10-100 ns). The simulations on System X allow for the development of long-time dynamics, which is critical to understanding the phospholipids bilayers (long-relaxation times) and the diffusion process." - Amadeu Sum

Figure 7. Biophysics of Phospholipid Bilayers (Amadeu Sum)
Virginia Tech is in the process of upgrading System-X to take advantage of technological advances in both software and hardware and thereby increase the performance of this computational titan. Specific upgrades include a general system update from OS 10.3.9 to OS 10.4, which promises to offer benefits in terms of better memory management and, most significantly, the ability to use 64-bit address space to dramatically increase the resolution of experimental results sets. Further enhancements will include updated InfiniBand drivers, as well as, the deployment of newer versions of the Mellanox InfiniBand HCAs used for inter-node communication. Beyond these enhancements, each node will be upgraded from 4 to 8 Gigabytes of memory, thus doubling the available memory pool for distributed computation. In additional, there are plans to update the MPI stack and deploy 64-Bit compilers to take advantage of the addressing space in Tiger. Once these updates are applied, System-X will be considered stable and no further enhancements, beyond minor software updates, are envisaged. The goal with this final revision of System-X is to offer a robust production HPC platform that will service the scientific community for years to come, with longevity being a primary aspiration.

Figure 8. System-X today running at full bore and ready for Tiger!
System-X is only the beginning for Virginia Tech, which has opened the Center for High-End Computing Systems (CHECS). Under the direction of Dr. Varadarajan, the center's mission is to tackle the various challenges associated with developing the next generation of HPC clusters. The staff are currently working to train future computational scientists in the art of building HPC-enabling and enhancing technologies. One of the primary goals is a convergence of hitherto disparate CS disciplines such as processor and memory architectures, operating systems, runtime environments, communications subsystems, fault-tolerance, scheduling and load-balancing, power aware systems and algorithms and associated programming models. According to Varadarajan, the aim is to architect and implement "computing systems and environments that can efficiently and usably span the scales from department-sized machines to national-scale resources." One example of this is the National Lambda Rail (NLR), an Ethernet based optical network, which serves to connect research communities and associated computing resources across the country. Although not yet in widespread use, the NLR promises to advance research in the area of next generation networking technologies and its derivatives may well overtake today's traditional Internet backbone. CHECS will also serve as a test-bed for both theoretical and practical applications within the area of HPC and will work to develop successors to System-X, as well as, collaborate with scientific communities across the world to create much needed software environments for future HPC systems.
This article would not have been possible without the generous participation of the following individuals.
Srinidhi Varadarajan, Associate Professor, Computer Science Department
Director, Center for High-End Computing Systems
Cal Ribbens
Associate Professor, Associate Department Head, Computer Science Department
Director, Laboratory for Advance Scientific Computing and Applications (LASCA)
http://research.cs.vt.edu/lasca/
Jason Lockhart, Director of High-Performance Computing and Technology Innovation,
College of Engineering
Kevin Shinpaugh, Director Research/Cluster Computing,
http://www.computing.vt.edu/research_computing
Hassan Aref, Former Dean and Reynolds Metals Professor, Engineering Science and Mechanics Department
http://www.esm.vt.edu/php/person.php?id=10095
Erv Blythe, Vice President of Information Technology
http://www.it.vt.edu/administration/blythe.html
Jane Talbot, Photo Librarian, Visual Communications, The Visual and Broadcast Communications Department
"I'd also like to thank the many other administrators, contractors and volunteers that made this all possible. We may have been the glue that held it all together, but everyone who participated gave 110% to this project. Without their dedication and attention to detail the project would not have been possible" -Jason Lockhart
Compute Nodes:
1,100 Dual 2.3Ghz Xserve Cluster Nodes configured with 4GB of ECC DDR400 RAM, 80GB hard drives, Gigabit Ethernet and Mellanox Cougar 4x InfiniBand Host Channel Adapters (HCA).
Compile Nodes:
3 Dual 2.3Ghz Xserve nodes with 4GB of ECC DDR400 RAM, 3 250GB hard drives and Gigabit Ethernet.
Networking:
For primary low latency fabric, 4 SilverStorm Technologies 9120 InfiniBand core switches, supporting 4x InfiniBand (10Gbs bidirectional port speed) with 11 leaf modules and 3 spine modules, as well as, 64 SilverStorm Technologies 9024 InfiniBand leaf switches using 4x InfiniBand (10Gbs bidirectional port speed) with 24 InfiniBand ports per leaf switch. Secondary communications provided by 6 Cisco Systems 240-port 4506 Ethernet switches.
Storage:
Xserve RAID, configured as a RAID 50 array, storing 2.7TB of data and available as an NFS server with aggregate write bandwidth of 90-100MB/sec. System-X users employ this as a temporary storage area with results sets offloaded to more permanent storage areas as needed.
Cooling:
Custom Liebert Extreme Density cooling system, in a chilled water loop configuration and fed off of two 125 ton Carrier water chillers that supply about 3 million BTUs of cooling capacity. Liebert XDP systems with a R-134A refrigerant loop supplied to rack mounted liquid-to-air heat exchangers.
Software:
Apple OS X 10.3.9 (currently migrating to 10.4.x), MVAPICH for message passing, Torque (OpenPBS) for queue management, Moab (Maui) for job scheduling, Ganglia for system monitoring, as well as, IBM XL Fortran, IBM XLC and GCC 3.3 compilers. For example, Verari Systems Software makes the commercial MPI/Pro 2.1 software, considered to be a high-performance, scalable implementation of the MPI-2 standard; and Dauger Research makes Pooch, a graphical MPI solution.
Emmanuel Stein has been an avid Mac user since 1984 and has honed his cross-platform skills while working at several Fortune 100 companies. He has recently started his own Mac-centric consulting company, MacVerse, which offers implementation, system administration and development services geared towards the enterprise market. You may reach him at estein@macverse.com.




