Code porting and optimization on a Beowulf
Ivan Rossi, Piero Fariselli, Pier Luigi Martelli, Gianluca Tasco and Rita
Casadio
C.I.R.B. Biocomputing Unit, University of Bologna (Italy)
1. The Basic Local Alignment Search Tool (BLAST) code
1.1 What is BLAST
1.2 A "Divide and Conquer" approach to distributed BLAST
1.3 dncBLAST structure and implementation
1.4 SMP BLAST vs dncBLAST
1.5 Scaling Data on Protein sequences databases (NR)
1.6 Scaling Data on a Nucleotide Database (GENBANK)
2.1 The Gromacs Package
2.2 Porting
2.3 The Gromacs code performance optimization
1. The Basic Local Alignment Search Tool (BLAST) code
1.1 What is BLAST Top
BLAST (Basic Local Alignment Search Tool) is a set of programs designed to
efficiently search protein and DNA databases for sequence similarities [1].
Actually the NCBI-BLAST implementation is the de facto standard for
biological sequence comparison and search in computational biology.
BLAST uses a heuristic algorithm which seeks local as opposed to global
alignments and is therefore able to detect relationships among sequences
which share only isolated regions of similarity. The central idea of the
BLAST algorithm is that a statistically significant alignment is likely to
contain multiple high-scoring pair of aligned words [2]. BLAST first scans
the database for words (typically of length three for proteins) that score
better than a given threshold T when aligned with some word within the
query sequence. Any aligned word pair satisfying this condition is called a
hit. The second step of the algorithm checks whether each hit lies within
an alignment with score sufficient to be reported. This is done by
extending a hit in both directions, until the running alignment s score has
dropped more than X below the maximum score yet attained.
This extension step is the most computationally expensive: during it, a
O(N^2) dynamic programming algorithm [3] is executed, that typically
accounts for more than 90% of the execution time of a BLAST run .
The scores assigned in a BLAST search have a well-defined statistical
interpretation [4], that is directly related to the probability for the
reported alignment to happen by chance alone.
The NCBI-BLAST programs have been designed to maximize high-throughput
searches. It is multithread-enabled and thus capable of taking advantage of
multiple CPUs, when available on shared-memory SMP machines.
The NCBI-BLAST source code is available on the NCBI FTP site and it is
distributed under the GPL free-software license.
Unfortunately the "NCBI tools", the package to which BLAST belongs, is
also a large assembly (almost a million lines) of poorly-documented C code
under constant evolution. The last publicly available release of the "NCBI-
tools internals" official documentation is dated 1994. Furthermore and it
is explicitly stated (in the 1994 documentation and elsewhere) that the
library API may change without warning or notice, according to NCBI
internal needs. NCBI staff releases new versions of the code several times
a year, which is commendable and fruitful for the user base, but makes very
difficult for non-NCBI developers to contribute fruitfully to the code
development.
1.2 A "Divide and Conquer" approach to distributed BLAST
Top
A simple and effective way to introduce parallelism in the BLAST code is to
use a "divide and conquer" (DNC) approach. It is possible to split the
database into multiple non-overlapping subdomains (slices from now on) ,
next, using the same query, run an independent search for each slice and
then combine the results from each subsearch into a single query report.
Clearly, if the database slices are properly distributed on multiple
"slave" nodes, it will be possible to execute the subsearches in parallel,
taking advantage of the resources available on a Beowulf cluster. This
approach shows several convenient features:
-
Avoids modification of BLAST internals. The distribution and
gathering code can be written as "wrappers" that will use the
standard BLAST as a module.
-
Simplicity: easily portable.
-
Data distribution: BLAST uses memory-mapped I/O, and performance
are significantly enhanced when the entire database can be loaded in
RAM. By splicing the database opportunely this condition can be
reached on each machine on the cluster, even those equipped with
relatively small RAM amounts.
The only difficulty for the implementation of the DNC scheme is related to
result merging and normalization. BLAST hits are ranked according to two
different scores: the bit score (S) and the expectation value (E). The
former depends only on the degree of similarity (the alignment) between the
query and the hit sequences, while the latter depends both on the alignment
and on the database composition. In detail, E is a measure of the
probability that the alignment is significant since it represent the
expected number of alignments scoring at least S that can be obtained by
chance alone . There is a simple relation connecting E and S that is
E= (m*n)2-S
where m and n are the the length in character of the query and the
database, and the product (m*n) is called the dimension of the search
space. Thus by collecting the hits from the subsearches of the DNC strategy
and calculating E with respect to the total database length, it should be
possible to exactly reproduce the BLAST results coming from a search on a
unpartitioned database. Actually things are not so, since BLAST programs
implement a correction for the "edge effect"[4a], to take into account
that, for long sequences, an optimal local alignment cannot begin with any
aligned pair of residues. In fact, a high-scoring alignment must have some
length, and therefore can not begin near to the end of either of two
sequences being compared. This "edge effect" filter produces an
"effective search space" (ESS) that depends both on the query and the
database composition, whose length is the one used to calculate E values by
the BLAST code. However by introducing the approximation that this equality
holds
(ESS) = Sum(ESSi)
our tests show that the deviation from the reference ESS is always less
than +5% , with a corresponding deviation on the calculated E value that
never changes the order of magnitude, but more important, never changes the
relative scoring order of the hits. Moreover no false positives or false
negatives are introduced in the final result. In the light of these
results, we consider this approximation acceptable.
1.3 dncBLAST structure and implementation
Top
Because of the design decision taken, the most computationally demanding
task is the standard BLAST execution itself, while the wrapper tasks of
initialization and data post-processing represent just a marginal overhead
to the entire computation. This means that it is possible to code the
wrapper using and interpreted scripting language without paying a large
penalty in terms of performance. The wrapper has been coded in the Python
language [5] that is is an interpreted, interactive, object-oriented
programming language, often compared to Java, Scheme and Visual Basic. When
required, it is easily extensible by means of external modules written in
C, C++ or even Fortran. The Python implementation is portable and open-
source: it runs on many brands of UNIX, on Windows, MacOS and other minor
operating systems. Furthermore it shows a very clear and compact syntax,
and it is (usually) quickly learned. A recent analysis [6] ranked Python
as one of the most productive programming languages, with both average
development time and source-code length almost halved with respect to C and
C++. Since the coding team programming language knowledge was not uniform
(coming from perl, C, Fortran) Python seemed a good choice.
The dncBLAST code has been designed to handle the same options ad to
produce the same output that BLAST does, so that it can be used as its drop-
in replacement.
The structure of the dncBlast code is summarized in the pseudocode listing
below
Program dncBLAST
Read DB distribution and configuration info
Parse options and trap input errors
if (DB not distributed) then
exec BLAST on front-end machine
else
initialize query_queue
initialize subsearches on DB slices
repeat
spawn parallel BLAST subsearches on slave nodes
Parse BLAST subsearch results
merge hits
update query_queue
until query_queue is empty
report hits
endif
End program
1.4 SMP BLAST vs dncBLAST
Top
The first test performed has been the comparison of the performance of
dncBLAST with respect to the standard multithreaded BLAST on a single two-
processors SMP slave node of the CIRB cluster. And the result has been
positively surprising. As shown in the following Figure, the dncBLAST code
outperforms the standard NCBI BLAST running in multithreaded mode for all
query lengths in the test set, except the shortest one, and the performance
difference becomes more and more favorable to dncBLAST with the increase of
the query length.
While a similar behavior can be caused by memory contention on SMP machines
with a larger number of processors, we do not believe this is the case on a
two-processors machine. We think instead that the cause is rooted in the
nature of the algorithms used: it is a known phenomenon that long query
sequences are more likely to have wide regions that match favorably with
the sequences stored in the database, producing long alignments. When this
is the case the computing time is largely dominated by the alignment
extension phase, performed by dynamic programming. If the dynamic
programming code is either non parallel or not well parallelized, the
multiple processors of an SMP machine will not be exploited to their
fullest. On the contrary, in the DNC approach, an independent subsearch is
running in parallel for each processor used, maximizing CPU usage. If this
is the case the behavior found for a two-processor Intel SMP will be
amplified an SMP machine with a larger number o processor, such as the
Chiron Vaccines' SUN Enterprise 4500 server at IRIS, and preliminary tests
confirmed this hypotheses.
1.5 Scaling Data on Protein sequences databases (NR)
Top
Scaling tests for protein sequence databases have been performed using the
April 23th 2001 release of the widely-used Non Redundant (NR) protein
database. This database contains all the non-redundant entries coming from
the GenBank coding sequence translations and from the the protein databank
(PDB), SwissProt and PIR databases. It can be freely retrieved from the
NCBI ftp server at
ftp://ftp.ncbi.nlm.nih.gov/blast/db
The released used
contains 705,002 sequences; corresponding to 222,117,092 total residues for
an entry average length of 315 residues. The test runs have been performed
using a test set of five chimeric sequences of increasing length. The
elapsed time has been measured by taking the average of the wall-clock time
of ten independent runs for each query, measured by means o the Unix time
utility.
A marked dependence of the scaling performance of the dncBLAST approach on
query length have been found. As shown in the two following graphs the code
scales very well, up to a eight processors speedup close to 7 (~85%
efficiency ) for queries larger than 1000 residues, while it drops to
around 4.5 (60% efficiency) for short (150 residues) query length, where
speedup saturation appears to be imminent. For a 300-residues length, very
close to the average protein length included in NR, a 5.5 speedup is
achieved, which is still a reasonable 70% efficiency.
Up to six processors, the efficiency is higher than 70% for all queries in
our test set in all the runs.
It has to be pointed out that the overhead introduced in the dncBLAST
approach, mainly by the subsearch output parsing done in the data merging
and normalization stage, increases with the number of subsearches performed
and thus with the number of processors used. The shortest is the query
length, the fastest is the search run, and thus overhead becomes
progressively more important.
On the other hand, for long runs the overhead introduced remains a small
fraction of the total computing time.
1.6 Scaling Data on a Nucleotide Database (GENBANK)
Top
The scaling performance of dncBLAST on genome databank has been tested on a
subset of the GENBANK database comprising 480708 sequences
summing up to 2,439,606,762 bases for an average sequence length of 5,075.
Again the test
runs have been performed using a test set of five sequences of increasing
length, and the elapsed time has been measured by taking the average of the
wall-clock time of ten independent runs for each query. Results are
reported in the following graph. This database is too large to fit into the
memory of a single node: that explains the superlinear scaling appearing
from the result, showing the benefit of the data partitioning implicit in
the dncBLAST scheme for a large database.
2 Gromacs Porting
2.1 The Gromacs Package Top
Gromacs is a public domain package for simulating Biomolecules using
Molecular Dynamics (MD) simulation [7]. It consists of a MD program
together with an extended suite of programs to perform analysis and to
prepare inputs for the simulation. The package is almost entirely written
in ANSI-C: there are few parts written in F77 and some recently added tools
are written in C++. The current version distributed is 2.0 and this is the
version we installed and tested on the Linux Beowulf cluster.
The MD program ( mdrun ) comes in two flavors: serial and parallel version.
All the other tools are serial programs due to the fact that the times
spent in these other programs are negligible with respect to the times
needed to perform the MD simulation. Therefore the real core of the package
is the MD engine. We now briefly discuss the parallel version of MD
program.
The parallel algorithm adopted by Gromacs is the so-called "domain
decomposition algorithms". In particular it uses nd a "one-dimension space
decomposition" scheme for the computations of non-bonded interactions and a
"particle decomposition" scheme for all the other interactions, meaning
that a subset of the cartesian space, or of the particles respectively,
composing the system simulated is assigned to each processor. The
processors are combined in a (virtual) ring topology, meaning that each
processors communicates only with its "left" and "right" neighbors. At
each step all the neighboring processors exchange information among them in
order to have up-to date data to work on the next step.
This one-dimension partitioning scheme has clear advantages: it is
relatively easy to program and communications are usually done in large
chunks, which is more efficient for some hardware communication platforms,
in particular the low-end ones such as switched FastEthernet. In this
framework the computation of the long range interaction among atoms ( i.e.
electrostatic interaction) take-up most of the CPU time, being O(N2) with
respect to the particle number.
2.2 Porting Top
The porting and the installation of the package on the Scyld Beowulf
cluster at CIRB was relatively simple: the package was already tested on
similar platforms ( i.e. IBM SP) and no particular problems raised in the
procedure. The mixing of different languages did not represent an obstacle
and everything went fine. A relatively good optimization was easily reached
turning on some specific optimization flags but an aggressive optimization
( using again compiler flags ) gives better performance but numerical
problems (incorrect results) arise. However, this is a known behavior
shared by many compilers.
A disappointing finding has been that compiling the code with the FORTRAN
version of its inner loops did not improve the performance at all on the
CIRB cluster, while it has been reported to markedly improve them on other
systems. Two different Fortran compilers have been tested: GNU G77, which
is free but also a notorious underachiever, and the Portland Group
(http://www.pgroup.com) PGI PGF77 Workstation Intel compiler, that on the
contrary has a solid reputation for producing efficient code, but with no
substantial result. The G77-compiled code performs worse than the GCC-
compiled version, while PGF77 improves the performance less than 5%. We
also tested the Portland PGICC C compiler but the performance do not differ
from those obtained with the free GNU GCC compiler.
2.3 The Gromacs code performance optimization Top
A preliminary scaling test on a small simulation system showed a less than
satisfactory scaling behavior,
This test has been performed using the special MPICH MPI implementation
shipped with the Scyld Beowulf distribution. It appears that network
latency dominates the scaling, since much better scaling have been reported
on systems with better network connectivity, such as Cray T3E or IBM SP.
It can be observed a discontinuity at 4 CPUs, when the number of processor
used equals the number of nodes in the cluster. This is probably due to the
fact that Scyld always tries to distribute the spawned MPI process evenly
among the available compute nodes. When it begins to assign two MPI
processes per node, the scaling behavior changes to a logarithmic one.
While this automatic scheduling protocol is very convenient for
uniprocessor compute node since there is no need to worry about load
balancing , it is not for two-processors machines such as ours: by having
two MPI processes on the same node, communication among them could pass
through shared memory, thus avoiding the penalty hit caused by network
latency.
The inclusion of a special patch for short network packets into the Linux
kernel (version 2.2.17) did not improve the situation. The newer 2.4.x
kernel versions are not supported by Scyld and could not be tested without
discarding the entire cluster installation and reverting to a more
traditional cluster design such as the one first tested with the RedHat
distribution and discussed in a previous report (D2.2). Furthermore,
several users, reported on the Beouwulf mailing list (
beowulf@beowulf.org
)
stability problems under heavy load for the 2.4 series, that convinced us
to postpone its testing oncea more mature version will be available.
A marked improvement in scaling has been obtained by replacing the Scyld-
modified MPICH library with the LAM implementation of MPI (see
http://www.lam-mpi.org
), as it can be seen in the following graph.
The two system used for the scaling tests are composed by a short
polypeptide (PEP) and the HIV1 protease (HIV1) respectively, modelled using
the GROMOS-89 force field and solvated in a box of SPC/E water molecules
with periodic boundary conditions. The larger HIV1 system is composed by
19290 particles and has been simulated for 20 ps, while the smaller PEP
system comprises 8652 particles and has been simulated for 10 ps. Timings
has been taken as the average of three identical runs. Single-processor
average timings amount respectively to 6050 s for HIV1 and to 1686 s for
PEP.
The improvement obtained by using LAM is probably due to its optimized
communication schemes: MPI client-to-client messages can bypass the MPI
daemon, optimized shared-memory communication within a multi processor
node, however the LAM environment is not integrated at all in the Scyld
framework, and its installation and execution required a great deal of
scripting and adaptation in order to have it run on a Scyld cluster, almost
neglecting all the advantages that Scyld provides in terms of ease of
installation and management. If GROMACS will be the main application used
on a Linux cluster, it is more convenient not to use Scyld Beowulf, but to
use a different cluster configuration scheme.
In order to further improve the scaling obtained, it is necessary to
substitute the commodity-type network interconnection with some other
network hardware designed for low-latency, such as Myricom Myrinet
(http://www.myricom.com)
or the SCALI interconnect (http://www.scali.com), that provide
around 1Gbps point-to-point network bandwidth and very low latency.
Obviously there is a premium price to pay for such kind of devices, which
are not commodity, yet.
3. Conclusion
Top
The performances of two important codes for computational molecular
biology, one dealing with biological sequence analysis and the other with
macromolecular structure modelling, have been evaluated on a prototype
Beowulf cluster
The strategy used to adapt the BLAST code to its use on a distributed-
memory architecture such as the one of a Beowulf cluster has proven to be
simple yet effective in our tests. In particular, because of data
distribution, it appears to be particularly suited for searches on the
genome databanks,. that are the largest, and thus the most unwieldy, used
in the bioinformatic field. Thus exactly those for which exploitation of
parallel-computing technology is most welcomed.
The results obtained for GROMACS show instead that the code requires non-
commodity low-latency network interconnection in order to scale beyond 6-8
processors. Code performance is very sensitive to network performance, as
shown by the marked improvement obtained by replacing the MPI library
implementation. As a side note we cannot recommend the Scyld Beowulf
distribution for cluster building if GROMACS is the application of choiche
on the cluster: its native MPI implementation is tightly integrated in the
environment and causes GROMACS to underperform. Replacing it negates the
installation and administration ease that is its main advantage over a more
traditional cluster design.
4. Bibliograpy
Top
1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. "Basic
local alignment search tool." J. Mol. Biol. 215 (1990) 403-410
2. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D.J. " Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. " Nucleic Acids Res. 25
(1997) 3389-3402
3. Smith, T.F. and Waterman, M.S. (1981) "Identification of common
molecular subsequences." J. Mol. Biol. 147:195-197
4. Altschul, S.F. and Gish, W. "Local alignment statistics." Meth.
Enzymol. 266 (1996) 460-480. (b) Altschul, SF; Bundschuh, R; Olsen, R;
Hwa, T " The estimation of statistical parameters for local alignment
score distributions " Nucleic Acids Res. 29 (2001) 351 - 361
5. see http://www.python.org
6. Prechelt L. "An empirical comparison of seven programming languages"
Computer 33 (2000) 23
7. H.J.C. Berendsen, D van der Spoel, and R van Drunen, "GROMACS: A Message
Passing Parallel MD Implementation", Comp. Phys. Comm. 91 (1995) 43-56
HOMEPAGE