W35 report

Code porting and optimization on a Beowulf

Ivan Rossi, Piero Fariselli, Pier Luigi Martelli, Gianluca Tasco and Rita Casadio
C.I.R.B. Biocomputing Unit, University of Bologna (Italy)

1. The Basic Local Alignment Search Tool (BLAST) code

1.1 What is BLAST
1.2 A "Divide and Conquer" approach to distributed BLAST
1.3 dncBLAST structure and implementation
1.4 SMP BLAST vs dncBLAST
1.5 Scaling Data on Protein sequences databases (NR)
1.6 Scaling Data on a Nucleotide Database (GENBANK)

2. Gromacs Porting

2.1 The Gromacs Package
2.2 Porting
2.3 The Gromacs code performance optimization

3. Conclusion
4. Bibliograpy

1. The Basic Local Alignment Search Tool (BLAST) code

1.1 What is BLAST Top
BLAST (Basic Local Alignment Search Tool) is a set of programs designed to efficiently search protein and DNA databases for sequence similarities [1]. Actually the NCBI-BLAST implementation is the de facto standard for biological sequence comparison and search in computational biology.

BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity. The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain multiple high-scoring pair of aligned words [2]. BLAST first scans the database for words (typically of length three for proteins) that score better than a given threshold T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit. The second step of the algorithm checks whether each hit lies within an alignment with score sufficient to be reported. This is done by extending a hit in both directions, until the running alignment s score has dropped more than X below the maximum score yet attained.

This extension step is the most computationally expensive: during it, a O(N^2) dynamic programming algorithm [3] is executed, that typically accounts for more than 90% of the execution time of a BLAST run .

The scores assigned in a BLAST search have a well-defined statistical interpretation [4], that is directly related to the probability for the reported alignment to happen by chance alone.

The NCBI-BLAST programs have been designed to maximize high-throughput searches. It is multithread-enabled and thus capable of taking advantage of multiple CPUs, when available on shared-memory SMP machines.

The NCBI-BLAST source code is available on the NCBI FTP site and it is distributed under the GPL free-software license.

Unfortunately the "NCBI tools", the package to which BLAST belongs, is also a large assembly (almost a million lines) of poorly-documented C code under constant evolution. The last publicly available release of the "NCBI- tools internals" official documentation is dated 1994. Furthermore and it is explicitly stated (in the 1994 documentation and elsewhere) that the library API may change without warning or notice, according to NCBI internal needs. NCBI staff releases new versions of the code several times a year, which is commendable and fruitful for the user base, but makes very difficult for non-NCBI developers to contribute fruitfully to the code development.

1.2 A "Divide and Conquer" approach to distributed BLAST Top
A simple and effective way to introduce parallelism in the BLAST code is to use a "divide and conquer" (DNC) approach. It is possible to split the database into multiple non-overlapping subdomains (slices from now on) , next, using the same query, run an independent search for each slice and then combine the results from each subsearch into a single query report. Clearly, if the database slices are properly distributed on multiple "slave" nodes, it will be possible to execute the subsearches in parallel, taking advantage of the resources available on a Beowulf cluster. This approach shows several convenient features:

Avoids modification of BLAST internals. The distribution and gathering code can be written as "wrappers" that will use the standard BLAST as a module.
Simplicity: easily portable.
Data distribution: BLAST uses memory-mapped I/O, and performance

The only difficulty for the implementation of the DNC scheme is related to result merging and normalization. BLAST hits are ranked according to two different scores: the bit score (S) and the expectation value (E). The former depends only on the degree of similarity (the alignment) between the query and the hit sequences, while the latter depends both on the alignment and on the database composition. In detail, E is a measure of the probability that the alignment is significant since it represent the expected number of alignments scoring at least S that can be obtained by chance alone . There is a simple relation connecting E and S that is

E= (m*n)2^-S

where m and n are the the length in character of the query and the database, and the product (m*n) is called the dimension of the search space. Thus by collecting the hits from the subsearches of the DNC strategy and calculating E with respect to the total database length, it should be possible to exactly reproduce the BLAST results coming from a search on a unpartitioned database. Actually things are not so, since BLAST programs implement a correction for the "edge effect"[4a], to take into account that, for long sequences, an optimal local alignment cannot begin with any aligned pair of residues. In fact, a high-scoring alignment must have some length, and therefore can not begin near to the end of either of two sequences being compared. This "edge effect" filter produces an "effective search space" (ESS) that depends both on the query and the database composition, whose length is the one used to calculate E values by the BLAST code. However by introducing the approximation that this equality holds

(ESS) = Sum(ESS_i)

our tests show that the deviation from the reference ESS is always less than +5% , with a corresponding deviation on the calculated E value that never changes the order of magnitude, but more important, never changes the relative scoring order of the hits. Moreover no false positives or false negatives are introduced in the final result. In the light of these results, we consider this approximation acceptable.

1.3 dncBLAST structure and implementation Top
Because of the design decision taken, the most computationally demanding task is the standard BLAST execution itself, while the wrapper tasks of initialization and data post-processing represent just a marginal overhead to the entire computation. This means that it is possible to code the wrapper using and interpreted scripting language without paying a large penalty in terms of performance. The wrapper has been coded in the Python language [5] that is is an interpreted, interactive, object-oriented programming language, often compared to Java, Scheme and Visual Basic. When required, it is easily extensible by means of external modules written in C, C++ or even Fortran. The Python implementation is portable and open- source: it runs on many brands of UNIX, on Windows, MacOS and other minor operating systems. Furthermore it shows a very clear and compact syntax, and it is (usually) quickly learned. A recent analysis [6] ranked Python as one of the most productive programming languages, with both average development time and source-code length almost halved with respect to C and C++. Since the coding team programming language knowledge was not uniform (coming from perl, C, Fortran) Python seemed a good choice.

The dncBLAST code has been designed to handle the same options ad to produce the same output that BLAST does, so that it can be used as its drop- in replacement.

The structure of the dncBlast code is summarized in the pseudocode listing below

Program dncBLAST
   Read DB distribution and configuration info
   Parse options and trap input errors
   if  (DB not distributed) then
         exec BLAST on front-end machine
   else
         initialize query_queue
         initialize subsearches on DB slices
         repeat
            spawn parallel BLAST subsearches on slave nodes
              Parse BLAST subsearch results
              merge hits
              update query_queue
         until query_queue is empty
         report hits
   endif
End program

1.4 SMP BLAST vs dncBLAST Top
The first test performed has been the comparison of the performance of dncBLAST with respect to the standard multithreaded BLAST on a single two- processors SMP slave node of the CIRB cluster. And the result has been positively surprising. As shown in the following Figure, the dncBLAST code outperforms the standard NCBI BLAST running in multithreaded mode for all query lengths in the test set, except the shortest one, and the performance difference becomes more and more favorable to dncBLAST with the increase of the query length.

While a similar behavior can be caused by memory contention on SMP machines with a larger number of processors, we do not believe this is the case on a two-processors machine. We think instead that the cause is rooted in the nature of the algorithms used: it is a known phenomenon that long query sequences are more likely to have wide regions that match favorably with the sequences stored in the database, producing long alignments. When this is the case the computing time is largely dominated by the alignment extension phase, performed by dynamic programming. If the dynamic programming code is either non parallel or not well parallelized, the multiple processors of an SMP machine will not be exploited to their fullest. On the contrary, in the DNC approach, an independent subsearch is running in parallel for each processor used, maximizing CPU usage. If this is the case the behavior found for a two-processor Intel SMP will be amplified an SMP machine with a larger number o processor, such as the Chiron Vaccines' SUN Enterprise 4500 server at IRIS, and preliminary tests confirmed this hypotheses.

1.5 Scaling Data on Protein sequences databases (NR) Top
Scaling tests for protein sequence databases have been performed using the April 23th 2001 release of the widely-used Non Redundant (NR) protein database. This database contains all the non-redundant entries coming from the GenBank coding sequence translations and from the the protein databank (PDB), SwissProt and PIR databases. It can be freely retrieved from the NCBI ftp server at ftp://ftp.ncbi.nlm.nih.gov/blast/db The released used contains 705,002 sequences; corresponding to 222,117,092 total residues for an entry average length of 315 residues. The test runs have been performed using a test set of five chimeric sequences of increasing length. The elapsed time has been measured by taking the average of the wall-clock time of ten independent runs for each query, measured by means o the Unix time utility.

A marked dependence of the scaling performance of the dncBLAST approach on query length have been found. As shown in the two following graphs the code scales very well, up to a eight processors speedup close to 7 (~85% efficiency ) for queries larger than 1000 residues, while it drops to around 4.5 (60% efficiency) for short (150 residues) query length, where speedup saturation appears to be imminent. For a 300-residues length, very close to the average protein length included in NR, a 5.5 speedup is achieved, which is still a reasonable 70% efficiency.

Up to six processors, the efficiency is higher than 70% for all queries in our test set in all the runs.

It has to be pointed out that the overhead introduced in the dncBLAST approach, mainly by the subsearch output parsing done in the data merging and normalization stage, increases with the number of subsearches performed and thus with the number of processors used. The shortest is the query length, the fastest is the search run, and thus overhead becomes progressively more important.

On the other hand, for long runs the overhead introduced remains a small fraction of the total computing time.

1.6 Scaling Data on a Nucleotide Database (GENBANK) Top
The scaling performance of dncBLAST on genome databank has been tested on a subset of the GENBANK database comprising 480708 sequences summing up to 2,439,606,762 bases for an average sequence length of 5,075. Again the test runs have been performed using a test set of five sequences of increasing length, and the elapsed time has been measured by taking the average of the wall-clock time of ten independent runs for each query. Results are reported in the following graph. This database is too large to fit into the memory of a single node: that explains the superlinear scaling appearing from the result, showing the benefit of the data partitioning implicit in the dncBLAST scheme for a large database.

2 Gromacs Porting

2.1 The Gromacs Package Top
Gromacs is a public domain package for simulating Biomolecules using Molecular Dynamics (MD) simulation [7]. It consists of a MD program together with an extended suite of programs to perform analysis and to prepare inputs for the simulation. The package is almost entirely written in ANSI-C: there are few parts written in F77 and some recently added tools are written in C++. The current version distributed is 2.0 and this is the version we installed and tested on the Linux Beowulf cluster.

The MD program ( mdrun ) comes in two flavors: serial and parallel version. All the other tools are serial programs due to the fact that the times spent in these other programs are negligible with respect to the times needed to perform the MD simulation. Therefore the real core of the package is the MD engine. We now briefly discuss the parallel version of MD program.

The parallel algorithm adopted by Gromacs is the so-called "domain decomposition algorithms". In particular it uses nd a "one-dimension space decomposition" scheme for the computations of non-bonded interactions and a "particle decomposition" scheme for all the other interactions, meaning that a subset of the cartesian space, or of the particles respectively, composing the system simulated is assigned to each processor. The processors are combined in a (virtual) ring topology, meaning that each processors communicates only with its "left" and "right" neighbors. At each step all the neighboring processors exchange information among them in order to have up-to date data to work on the next step.

This one-dimension partitioning scheme has clear advantages: it is relatively easy to program and communications are usually done in large chunks, which is more efficient for some hardware communication platforms, in particular the low-end ones such as switched FastEthernet. In this framework the computation of the long range interaction among atoms ( i.e. electrostatic interaction) take-up most of the CPU time, being O(N2) with respect to the particle number.

2.2 Porting Top
The porting and the installation of the package on the Scyld Beowulf cluster at CIRB was relatively simple: the package was already tested on similar platforms ( i.e. IBM SP) and no particular problems raised in the procedure. The mixing of different languages did not represent an obstacle and everything went fine. A relatively good optimization was easily reached turning on some specific optimization flags but an aggressive optimization ( using again compiler flags ) gives better performance but numerical problems (incorrect results) arise. However, this is a known behavior shared by many compilers.

A disappointing finding has been that compiling the code with the FORTRAN version of its inner loops did not improve the performance at all on the CIRB cluster, while it has been reported to markedly improve them on other systems. Two different Fortran compilers have been tested: GNU G77, which is free but also a notorious underachiever, and the Portland Group (http://www.pgroup.com) PGI PGF77 Workstation Intel compiler, that on the contrary has a solid reputation for producing efficient code, but with no substantial result. The G77-compiled code performs worse than the GCC- compiled version, while PGF77 improves the performance less than 5%. We also tested the Portland PGICC C compiler but the performance do not differ from those obtained with the free GNU GCC compiler.

2.3 The Gromacs code performance optimization Top
A preliminary scaling test on a small simulation system showed a less than satisfactory scaling behavior,

This test has been performed using the special MPICH MPI implementation shipped with the Scyld Beowulf distribution. It appears that network latency dominates the scaling, since much better scaling have been reported on systems with better network connectivity, such as Cray T3E or IBM SP.

It can be observed a discontinuity at 4 CPUs, when the number of processor used equals the number of nodes in the cluster. This is probably due to the fact that Scyld always tries to distribute the spawned MPI process evenly among the available compute nodes. When it begins to assign two MPI processes per node, the scaling behavior changes to a logarithmic one. While this automatic scheduling protocol is very convenient for uniprocessor compute node since there is no need to worry about load balancing , it is not for two-processors machines such as ours: by having two MPI processes on the same node, communication among them could pass through shared memory, thus avoiding the penalty hit caused by network latency.

The inclusion of a special patch for short network packets into the Linux kernel (version 2.2.17) did not improve the situation. The newer 2.4.x kernel versions are not supported by Scyld and could not be tested without discarding the entire cluster installation and reverting to a more traditional cluster design such as the one first tested with the RedHat distribution and discussed in a previous report (D2.2). Furthermore, several users, reported on the Beouwulf mailing list ( beowulf@beowulf.org ) stability problems under heavy load for the 2.4 series, that convinced us to postpone its testing oncea more mature version will be available.

A marked improvement in scaling has been obtained by replacing the Scyld- modified MPICH library with the LAM implementation of MPI (see http://www.lam-mpi.org ), as it can be seen in the following graph.

The two system used for the scaling tests are composed by a short polypeptide (PEP) and the HIV1 protease (HIV1) respectively, modelled using the GROMOS-89 force field and solvated in a box of SPC/E water molecules with periodic boundary conditions. The larger HIV1 system is composed by 19290 particles and has been simulated for 20 ps, while the smaller PEP system comprises 8652 particles and has been simulated for 10 ps. Timings has been taken as the average of three identical runs. Single-processor average timings amount respectively to 6050 s for HIV1 and to 1686 s for PEP.

The improvement obtained by using LAM is probably due to its optimized communication schemes: MPI client-to-client messages can bypass the MPI daemon, optimized shared-memory communication within a multi processor node, however the LAM environment is not integrated at all in the Scyld framework, and its installation and execution required a great deal of scripting and adaptation in order to have it run on a Scyld cluster, almost neglecting all the advantages that Scyld provides in terms of ease of installation and management. If GROMACS will be the main application used on a Linux cluster, it is more convenient not to use Scyld Beowulf, but to use a different cluster configuration scheme.

In order to further improve the scaling obtained, it is necessary to substitute the commodity-type network interconnection with some other network hardware designed for low-latency, such as Myricom Myrinet (http://www.myricom.com) or the SCALI interconnect (http://www.scali.com), that provide around 1Gbps point-to-point network bandwidth and very low latency. Obviously there is a premium price to pay for such kind of devices, which are not commodity, yet.

3. Conclusion

Top

The performances of two important codes for computational molecular biology, one dealing with biological sequence analysis and the other with macromolecular structure modelling, have been evaluated on a prototype Beowulf cluster

The strategy used to adapt the BLAST code to its use on a distributed- memory architecture such as the one of a Beowulf cluster has proven to be simple yet effective in our tests. In particular, because of data distribution, it appears to be particularly suited for searches on the genome databanks,. that are the largest, and thus the most unwieldy, used in the bioinformatic field. Thus exactly those for which exploitation of parallel-computing technology is most welcomed.

The results obtained for GROMACS show instead that the code requires non- commodity low-latency network interconnection in order to scale beyond 6-8 processors. Code performance is very sensitive to network performance, as shown by the marked improvement obtained by replacing the MPI library implementation. As a side note we cannot recommend the Scyld Beowulf distribution for cluster building if GROMACS is the application of choiche on the cluster: its native MPI implementation is tightly integrated in the environment and causes GROMACS to underperform. Replacing it negates the installation and administration ease that is its main advantage over a more traditional cluster design.

4. Bibliograpy

Top

1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. "Basic local alignment search tool." J. Mol. Biol. 215 (1990) 403-410

2. Altschul, S.F., Madden, T.L., Sch�ffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. " Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. " Nucleic Acids Res. 25 (1997) 3389-3402

3. Smith, T.F. and Waterman, M.S. (1981) "Identification of common molecular subsequences." J. Mol. Biol. 147:195-197

4. Altschul, S.F. and Gish, W. "Local alignment statistics." Meth. Enzymol. 266 (1996) 460-480. (b) Altschul, SF; Bundschuh, R; Olsen, R; Hwa, T " The estimation of statistical parameters for local alignment score distributions " Nucleic Acids Res. 29 (2001) 351 - 361

5. see http://www.python.org

6. Prechelt L. "An empirical comparison of seven programming languages" Computer 33 (2000) 23

7. H.J.C. Berendsen, D van der Spoel, and R van Drunen, "GROMACS: A Message Passing Parallel MD Implementation", Comp. Phys. Comm. 91 (1995) 43-56

HOMEPAGE

Code porting and optimization on a Beowulf

1. The Basic Local Alignment Search Tool (BLAST) code

2. Gromacs Porting

3. Conclusion 4. Bibliograpy

1. The Basic Local Alignment Search Tool (BLAST) code

E= (m*n)2-S

(ESS) = Sum(ESSi)

2 Gromacs Porting

3. Conclusion

4. Bibliograpy

3. Conclusion
4. Bibliograpy

E= (m*n)2^-S

(ESS) = Sum(ESS_i)