Report D5

D 5.4 Deliverable project month: Report on the result obtained

R. Beltrami, D. Medini, C. Donati, N. Pacchiani, A. Muzzi, A. Covacci
IRIS Bioinformatic Unit, Chiron S.p.A.
biocomp_admin_siena@chiron.it

1. Introduction
2. DNC-Blast
2.1 Avoiding databank reload from disks
2.2 Database randomization and splitting on non-homogeneous cluster nodes
2.3 Results (I)
2.4 Results (II)
2.5 DNC-Blast typical usage scenarios
2.6 Results for a real-case test
3 GROMACS
3.1 Configuration of a dedicated "run-level" for MD tasks
3.2 Network communications: Fast-Ethernet vs. Gigabit-Ethernet
4 Conclusions
5 Acknowledgements
6 Appendix A

1. Introduction Top
Several successful experimentsi have already been done on the parallelization of commonly used bioinformatic tools. The scope of this project was to tailor standard algorithms widely used in genetic data analyses, the NCBI Blast and GROMACS, for parallel architectures and evaluate performances into our "in house"lf/sito/ facility. The project was segmented into two phases: i) development of a prototype and ii) its implementation. In the first phase the work was done on a Beowulf cluster while the second phase was completed using a Production Facility (PF) of industrial strength. Maximum portability and usability was achieved using i) Unix multiprocessor servers and ii) cluster of workstations with software codes and customization done using standard and freely available software tools. In this workpackage we have tuned and optimized the DNC-Blast engine - the BioWulf's parallelized version on the NCBI-Blast - on the PF (cluster and SMP). We have tried to i) minimize performance degradation caused by reloads of data from the storage subsystem; ii) achieve a better load-balancing by randomization and splitting of the databases on the PF cluster. Than we have tested system performances under heavy load and we have compared, in a "real-case"lf/sito/ test, the newly developed tool (DNC-Blast) with the original one (NCBI-Blast). We have also completed the optimization of Gromacs on the PF cluster, and we have compared the performances of two different network communication systems. The collected data suggest that the Production Facility (cluster and SMP) is particularly well suited for massive data analysis and that, with the developed tools, it is possible to greatly improve the system performances, as required from high-end genomic research projects such as grouping of protein families.

2 DNC-Blast

2.1 Avoiding databank reload from disks Top
We have already showni that the main part of the overall execution time of a Blast query was the I/O reload time of the databank(s) from the disk subsystem into the main memory. Therefore a strategy to minimize the I/O time was required. One possible solution is the implementation and use of a RAM-disk (i.e. a virtual disk driver that mimics a disk subsystem using the main memory as storage space). Sun Solaris (release 2.5.1 and above) provides a similar device, namely the Temporary File-System tmpfs, that uses the RAM memory and the swap disk space as storage area and is mounted under the /tmp mount point by default. However, the tmpfs actually does not implement a pure RAM-disk since data stored in /tmp could be swapped out to let other data use the RAM clearly diminishing the performances. Another possibility to allocate the databank in main memory is through the operating system call mlock(3C) and related callsii. This function can permanently force (until a subsequent call to munlock(3C)) a defined area of the process virtual memory to reside in system main memory avoiding I/O delays. However two major drawbacks come out with this solution: firstly, one need to directly modify the NCBI-Blast source files in order to use the mlock function; secondly, the mlock function can be used only by processes running with administrator privileges, thus adding potential system stability and security concerns. Since the development of a custom, platform-specific, high-performance RAM-disk driver or the modification of the NCBI-Blast internal structure both fall out of the scopes of the present project, we have fine-tuned the tmpfs configuration to minimize databank reload times. Actually, with such a configuration, we have not been able to measure any delay in our internal usage of the DNC-Blast due to databank reloads.

2.2 Database randomization and splitting on non-homogeneous cluster nodes Top
As reported in a previous deliverablei, in order to setup the DNC-Blast environment, both on the PF SMP and on the PF cluster, every databank being searched needs to be partitioned. The original NCBI Blast suite includes a tool for formatting -creating indexes- on the source textual data files. This program, namely formatdb, also allows to split the source file into one or more indexed files. Originally this was developed to circumvent the maximum file size limit imposed by older 32-bit operating systems (with a maximum file size of 2GB). Splitting the database to be searched when using the original multithreaded NCBI-Blast on a single SMP machine has no impact whatsoever on the performances. Indeed, this feature could be used to generate the databank slices as previously mentioned4. This procedure has two drawbacks. A: the splitting method provided by formatdb simply puts the first part of the input source file into the first resulting slice, the second into the second slice (and so on); B: the slices will result of the same size. Two major consequences will occur: when part of databanks are searched by the DNC-Blast sub-processes, it is most likely that an uneven distribution of "sequence types"lf/sito/ will be accessed and an unbalanced load will result. Since the databank source files often contain sequences from the same organism clustered together and similar organism sequences close one to each other, groups of similar sequences will be probably contained in the same slice and slices produced by formatdb will be biased on the type of organism they belong to. As an example, if we search a databank using a query sequence from a bacterial organism, the slices containing the bacterial sequences will show far more hits (i.e. longer execution time for HSP searching and alignment extension) than those containing data from other organisms and the overall execution time will depend on them. In addition, given the heterogeneous nature of the PF cluster, to improve the load balancing among the nodes it's mandatory to fragment the databanks in slices proportional to the computational power of each single node. We have solved the first issue by randomly changing the order of the input sequences to obtain a uniform distribution of all the "sequence types"lf/sito/ along the whole databank. Then applying the original formatdb we were able to obtain non-biased slices of equal size. These slices were used to improve the performances of DNC-Blast on the PF SMP, and the results are reported in the Results(I) section. We also have implemented a custom version of the NCBI formatting program, the DNC-formatdb, able to generate randomized database slices of variable sizes. The DNC-formatdb developed in this workpackage takes as input the databank to be formatted and a set of relative weights (expressed as percentual figures) of every slice to be generated. Then, it produces randomized slices with sizes varying according to input weights ready to be searched using DNC-Blast.

Table1
EqCPU numbers computed for each databank

To perform this task a modified randomizing algorithm has been applied. For each single input sequence a random number between 0 and 1 has been generated and the sequence assigned to one of the output file depending on the numeric range the number falls in. The slices obtained so far are then simply fed into the NCBI-formatdb to generate five searchable files to be distributed on the computing nodes. In practice, to distribute each databank over the nodes we have implemented the following procedure: compute the execution time of a reference Blast query (one for each databank) on a reference machine, namely a single CPU of the PF SMP. This number, from now on called one Equivalent CPU (1 EqCPU), represents the measure unit of the execution times of the reference query on the cluster nodes; measure the execution time of the reference Blast query (one for each databank) on each node and compute the EqCPU value for the node as the ratio with the execution time on the reference machine; compute the relative weight of the databank slice for each node (one for each databank) based on the number of EqCPU of that node; run the DNC-formatdb to generate the randomized and weighted databank slices to be distributed on each node. Table 1 shows the execution times and the computed EqCPU values, for each databank, on all the cluster nodes. The database slices obtained (whose weights are reported in Appendix A) were used to improve the performances of the DNC-Blast on the PF-cluster, and the outcomes are reported in the Results(II) section.

Figure 1

A efficiency increase close to 10% for queries against the randomized GenBank is obtained, for the other databanks the change is less pronounced, and negative in the BacteriaN case

2.3 Results (I) Top
Tests conducted to evaluate the effectiveness of the database randomization on the PF SMP. We have repeated the tests performed in a previous deliverablei, for BlastN and BlastP against four different databases, both on non-randomized databank slices and on randomized ones. In all cases the same DNC-Blast implementation was used, being the database randomization the only difference among the two tests. In Figure 1 the efficiencies of the algorithm are shown as a function of the CPU number n for the randomized (red points) and non-randomized (black points) databases: "A" for BlastN against the Genbank, "B" for the BlastP against the NR, "C" for BlastN against Bacteria_P and "D"lf/sito/ for Bacteria_N. The efficiency is defined as the ratio of the theoretical minimum execution time with n processors and the actual elapsed timeii, where the theoretical minimum execution time is given by the execution time of one processor divided by the number of processors n. Results show a performance increase due to the randomization procedure for three of the four databases under consideration. In particular, for BlastN against Genbank the performance exceeds the theoretical value at 2 and 4 CPUs, showing a behavior similar to a well known cache-aggregate effect, made possible in the randomized case for the high degree of balance among the concurrent sub-processes. The efficiency gain is less evident with BlastP ("B" and "C"), where the more complex structure of the alignment algorithm for protein sequences reduces the impact of the data parallelization. When small databases are searched with a fast algorithm (as for BlastN against Bacteria_N in "D") we register a reduced efficiency in the randomized case.

2.4 Results (II) Top
Tests on heterogeneous database splitting on the PF cluster Following the procedure outlined in the previous section, we have populated the cluster with heterogeneous randomized databank slices. In the scalability of the algorithm (defined in our previous testi) is shown as a function of the EqCPU number in order to provide a correct comparison among the two platforms. It should be reminded that the EqCPU for the cluster nodes are defined using one single CPU of the PF SMP as a unitary reference. Hence, for the PF SMP one EqCPU is exactly one processor while, for the PF cluster, the EqCPU is neither an integer nor it corresponds to the number of the nodes. In each graph, four different data sets are shown: the multithreaded NCBI-Blast using databanks formatted with standard formatdb as distributed in the NCBI toolkit (blue squares) the DNC-Blast on the PF SMP using databanks from the standard formatdb (green squares) the DNC-Blast on the PF SMP using databanks from the customized DNC-formatdb, in this case all the databank slices are equal in size (red squares) the DNC-Blast on the PF cluster using databanks from the customized DNC-formatdb using slice sizes as previously reported (black squares) In the Genbank graph ("A"lf/sito/) the DNC-Blast algorithm over-performs the NCBI-Blast on the PF SMP of a factor 1.6, 1.9 in the randomized case, and the heterogeneous splitting of the database allows to obtain an excellent performance also on the PF cluster. The PF cluster performs 2.6 times better than the PF SMP with the same number (8) of actual processors, being some of the CPU of the cluster in this experiment more powerful than a single SMP processor. For the protein databanks (NR in "D" and BacteriaP in "C"lf/sito/) we don't have any relevant performance gain due to the DNC-Blast algorithm itself. We notice that the BlastP algorithm have to deal with a much more complex alphabet than BlastN, which results in a much longer time spent in the HSP expansion phase. Being this the already parallelized portion of the NCBI-Blast code, the DNC-Blast data-parallelization strategy results in a less marked performance gain on the PF SMP platform. Nevertheless, as in the Genbank case, it allows to use the cluster platform with the same good results, particularly with the NR database where a performance ratio of 11/7 is visible when all the CPUs are in use. Conversely, for the BacteriaN ("D"lf/sito/), we observe a severe degradation of the performances on the cluster, showing

Figure 2

Scalability of DNC-Blast (PF SMP and cluster) vs. NCBI-Blast (PF SMP).

that with fast queries on small databases the data collection times become extremely relevant in the overall execution time of the process.

2.5 DNC-Blast typical usage scenarios Top
In order to improve the overall performance of the Production Facility, we have considered the real usage made for computing intensive genomic research. The typical usage of a Blast engine can be divided into two broad categories: interactive and batch execution. The first is implied whenever a user performs a Blast search for a single (or a small set of) query sequences against one databank. The second is typically made of a huge set of query sequences sequentially searched against one or more databanks and is usually run through a shell script that iterates the task. The two categories imply different resource usage schemes: the first is mainly interactive and the loading order of databanks is not predictable; in the second all the search jobs are known in advance and can be arranged to better exploit the platform performances. In the first case the PF SMP is more suited to correspond to the user requests, given its much higher flexibility in dynamically assigning resources to the tasks that change continuously; here the DNC-Blast can be useful, as we have seen in the previous section, when big nucleotide databases are searched. Conversely, the DNC-Blast algorithm can make a real difference in the second case, when the user requests are much more predictable, and the static allocation of the resources typical of the BioWulf cluster is not an issue. Hence it can be of great importance for computational intensive research projects such as protein family classification, assigning gene functions, etc. In the next paragraph we report the results of a test that mimics these tasks. In this "real-case" use of the DNC-Blasti on the PF cluster we have considered the concurrent usage of different query sequences and databanks to effectively measure actual execution times.

2.6 Results for a real-case test Top
It is well established protocol that when a new complete bacterial genome sequence is published or draft versions become available a complete Blast search against public and in-house databanks is immediately carried out. In fact, Blast results are the basis for further analysis, for example for protein family classification or for genome annotation or comparative genomics. Here we report the results of a "real-case"lf/sito/ test where we have compared the behavior of the DNC-Blast on the PF cluster with fine-tuned tmpfs and the randomized heterogeneous database slices and of the original NCBI-Blast on the PF SMP. We used the complete genome sequence of Streptococcus pneumoniae strain R6 as released on GenBanki. We took all the predicted genes and the encoded protein sequences and DNC-blasted against three of the four DBs used in this project. A whole-genome Blast search (both when the predicted genes or the encoded proteins are used as input queries) consists of a series of Blast commands (blastP or blastN accordingly) run sequentially, each producing a single output file and finally collecting the Blast reports of all the homology searches. The case of a search against a single databank is far from the real use since more than a databank is scanned at the same time, requiring balanced distribution of the load on the nodes. Hence, to give a reasonable estimate of the real activity performed by our Blast-engine, we have concurrently run three different whole-genome Blast searches against the Genbank, the Non Redundant protein database and the in-house bacterial genome database Bacteria_N. The three sequential multi-input Blast where run concurrently, using the DNC-Blast on the PF cluster and the NCBI-Blast on the PF SMP for comparison. First of all, we have estimated the execution time of the three jobs on the two platforms in order to achieve a balanced load that minimizes the overall execution time. For the PF SMP, on a trial and error base, 8 CPUs where assigned to the Genbank search, 2 CPUs to the NR and one CPU to the Bacteria_N. This showed to be the best balance achievable in order to minimize the overall execution time. The remaining CPU (from a total of 12) was left free for system needs. On the PF cluster, on a trial and error basis, 5 CPUs where assigned to the Genbank job, 4 to the NR and one to the Bacteria_N.

Table2

Times required for a complete genome search against three databases

Table 2 shows the timing results for the two platforms under comparison. The databanks are the same used in the previous test. The Real time columns is the relevant one for the evaluation of the overall performances. The Total execution time is the maximum of real execution times of the three concurrent processes.

3 GROMACS

3.1 Configuration of a dedicated "run-level" for MD tasks Top
In previous workpackagesi,ii we have reported how the molecular dynamics (MD) program Gromacs have been implemented on the prototype Beowulf cluster, and then ported on the PF cluster. We have also shown11 that with an accurate fine tuning of the MPI implementation and of some system parameter on the PF cluster it is possible to improve somehow the rather poor parallel scaling of the MD algorithm. In this section we report the results of a further attempt to increase the parallel performances of an MD simulation on a cluster of workstations (such as the PF cluster). Using workstations not specifically dedicated to the cluster has the downside of the presence of many tasks and services running on each workstation that consume system resources. In fact, most of the nodes have common daemons running sendmail, the X windows server, http server, etc. that diminish the performances of parallel jobs. Since molecular dynamic simulations are typical batch jobs and can be run off-time (i.e. when the workstations are not used for other purposes) one possible way to improve performances is to shut down all the system processes and daemons not strictly required to run the cluster. More precisely, the services required for the cluster are: the NFS client to access data and executable files on the master node and the daemons required for cluster communications (MPI layer), plus the very basic system processes to keep the operating system running. This running state can be defined by means of a custom run-level operated through the usage of the command init. We defined a custom run-level that is used for MD simulations.

Figure 3 shows the results obtained running the three usual MD tests (already defined in a previous workpackage11) in the specific run-level compared with those in the default run-level. As expected, we can appreciate only a small increase (about the 5%) in the performance. The small gain remains almost constant when increasing the CPUs number, thus becoming relevant above 8 CPUs. In fact, after that number, the introduction of the custom run-level gives a performance increase close to that of adding another CPU to the cluster.

3.2 Network communications: Fast-Ethernet vs. Gigabit-Ethernet Top
We have previously described10 how the high rate of communications between nodes generated by the parallel algorithm implemented and the type of network used limit the simulation time and the scalability of this software package. To further explore how to improve the performance of the PF cluster we conducted a series of tests using two different network devices: Fast-Ethernet and Gigabit Ethernet. The cluster using Fast-Ethernet network was composed of five bi-processor workstation SunBlade1000 equipped with UltraSparc III CPU at 750MHz while the cluster with Gigabit Ethernet was composed of four bi-processor E280R equipped with UltraSparc III at 750MHz. Both clusters were connected using Cisco network switches. The results obtained from the simulation sets (see Figure 4) confirm that the network layer is the limiting factor for performance and scalability. The Gigabit network showed known limitations of Ethernet networks for this kind of traffic and scalability is still far from optimal. However, Gigabit cluster produced better results: using 8 processors the scalability is generally more than one unit above the Fast-Ethernet network. Gigabit connections can be seen as preferred over the older one: even with a relatively small number of nodes the Gigabit cluster can give the same or better simulation time using one or two processors less than the Fast-Ethernet so the higher cost of the Gigabit network interfaces is balanced by a lower cost of the workstations. In addition, the availability of Gigabit Ethernet of copper wire (UTP cables) instead of optical fibers further simplify the adoption of this technology as a substitute of the older one.

4 Conclusions Top
DNC-Blast The Basic Local Alignment Sequence Tool (BLAST) is a typical example of bioinformatic tool that deals with the exponentially growing amount of genomics data available to the scientific community. Given the intrinsic nature of a homology search algorithm, the wider the data set the poorer the parallel efficiency of the algorithm itself. Starting from this evidence we have tried to apply a different parallelization strategy, namely the parallelization of the databank. The challenge was to show that such a strategy can be successfully applied on traditional architectures (such as our Production Facility Symmetrical MultiProcessing) and also on lighter structures such as our Production Facility cluster of heterogeneous workstations, here called BioWulf. This without interfering with the internal structure of the NCBI-Blast itself. We started by fractioning the databank using the indexing tool of the NCBI-Toolkit, formatdb. Then, we developed a software tool, called Divide aNd Conquer (DNC)-Blast, that performs a set of single Blast queries on portions of the databank and merges the results in a single output report. With an appropriate treatment of the statistical inferences, this output is completely equivalent with the original one. The next step was to face with the performances of such an algorithm, beginning with the PF SMP architecture. As it could be expected, on small data sets we found poor performance improvements while, a sharp gain in the parallel efficiency have been detected when increasing the database size, especially for nucleotide sequence alignments. Further improvements have been achieved by randomizing the databanks to be searched reaching, for BlastN queries against the Genbank, a performance gain close to 2 times. While porting the tool on the PF cluster, we found that the standard NCBI-formatdb indexing tool was not able to properly handle the heterogeneity of the BioWulf, then we developed a custom DNC-formatdb, that splits the databanks in randomized weighted portions, highly improving the balance of the computational load on the cluster nodes. In this way we showed that it is possible to use an heterogeneous cluster architecture, with any performance degradation with respect to the SMP, with all the intrinsic advantages. As a final result, we executed a typical task of a genomic research project with both the NCBI-Blast and the DNC-Blast: the latter completed in less than half the time of the former. Gromacs The GRoeningen MAchine for Computer Simulations (GROMACS) is a widely used molecular simulation program, with an algorithmic structure completely different from sequence alignment tools. As opposite to the Blast, a parallel MPI version of the code, suited for a cluster architecture, was already available before the start of the present project. The main challenge here was to show if and how a PF cluster, not specifically designed as a MD-engine, could perform the tasks. An initial study on a prototype cluster of various MPI implementations confirmed that network communications are the real bottleneck of the process. Hence, when porting the package on the PF cluster, we tried to fine-tune the configuration of the network and of the system itself, obtaining a performance increase that allowed us to stay close, for some simulated system, to the Amdhal's law limit. Finally we showed that switching to the Gigabit-Ethernet, a faster yet still affordable network connection, it was possible to achieve a performance gain of 1.5 times.

In conclusion, we showed that the algorithm used for DNC-Blast is particularly well suited for parallel systems and can be utilized with no modification on SMPs as well as on BioWulf clusters. Also hybrid solutions that employ both architectures can be applied. Present day biology requires to analyze an increasing amount of data in as short time as possible. As DNC-Blast gave excellent results when used on large volumes of data (while on small data sets it was almost impossible to obtain better results than the original Blast because of the low execution times already achieved) it showed to be a brilliant yet industrial strength bioinformatic tool for genomics projects. The production facility demonstrated to be a high performance platform suitable to execute iterative tasks on data sets in the order of several gigabyte in size. As an example, the grouping of protein sequences in meaningful families requires the execution of "all-to-all"lf/sito/ Blast alignments on a set of hundreds of thousands sequencesi. This took up to thirty days of continuous elaboration: properly increasing the number of nodes used by a DNC-Blast engine would cut the computing time by an order of magnitude allowing a more precise definition of the protein families in the same time. Also, the results obtained suggested that the parallelization strategy used for DNC-Blast can be profitably employed for other sequence alignment tool such as the Smith&Waterman searchii. A similar approach could be applied to this non-heuristic sequence alignment tool - that provide more precise and sensible results than Blast - overcoming the higher computing resources required by mean of a parallel implementation.

5 Acknowledgements Top
Part of the present workpackage -more precisely the part on Gromacs testing- has been completed with the helpful collaboration Carlo Nardone, Paolo Carini and Alfredo Polizzi.

6 Appendix A Top

Table3

Here we report the database volume sizes obtained according to the Equivalent CPU values listed in Table 1. Since the computing power slightly depends on the application employed (blastp or blastn) and on the overall databank size, the splitting is different for the four databanks. In Table 3, from top to bottom, are listed the volume relative sizes for the Genbank, the BacteriaN, the BacteriaP and the NR. After the heterogeneous randomization of the database and indexing by mean of formatdb, the single volumes have been moved locally on the assigned nodes. It should be observed that all the nodes are used with a single CPU per node, with the exception of the two Blade 1000@900 MHz, which are used with all the two CPUs available per node in the 6,8 and 10 volumes decomposition.

HOMEPAGE