Report 2.3

Hardware assembly, OS and comunication set-up

Ivan Rossi, Piero Fariselli, Pier Luigi Martelli, Gianluca Tasco
CIRB Biocomputing Unit, Universita` di Bologna, Via Irnerio 48, 40122 Bologna

1. Hardware

According to the Beowulf philosophy, th CIRB cluster prototype have been assembled using Components-Off-the-Shelf (COTS). The cluster is formed by the following main parts: the master front-end machine, four computing nodes (slaves) , the interconnection subsytem and console.

Front-end:
CPUs: two Pentium III/700 MHz (Coppermine)
RAM: 512 Mb SDRAM

Storage:
Adaptec AIC-7896/7 Ultra2 SCSI controller,
Quantum ATLAS IV 9 Gb SCSI Hard Disk,
LG CED-8080B EIDE CD-RW (as backup unit),
CD-Rom.

NICs:
Realtek RTL-8029 PCI (outbound to the Internet)
Intel EtherExpress Pro/100 PCI (100 MBps inbound)

Slaves:
CPUs: two Pentium III/500 MHz (Katmai)
RAM: 256 Mb SDRAM

Storage:
Symbios 53c856 SCSI III controller,
Quantum ATLAS IV SCSI 9 Gb Hard Disk,

NICs:
Intel EtherExpress Pro/100 PCI (100 MBps inbound)

Interconnection:
Intel Express 520T FastEthernet switch (1 Gbps backplane).

Console:
Maxxtro 6-port keyboard-video-mouse (KVM) switch
IBM PowerDisplay 17" monitor
PS/2 Keyboard and 3-buttons mouse

The slave machines were already available at the CIRB Biocomputing unit. The front-end and the FastEthernet and KWM switches have been acquired in order to assemble the cluster.

Comments
The front-end machine has been equipped with faster CPUs and and a larger quantity of RAM with respect to the slave nodes, because it suffers from higher loads since it has to support the interactive sessions of the users, X-windows services, Network File System (NFS) export services to the nodes. Furthermore it is loaded by IP traffic from both the cluster private network and the Internet.
For the private-network interconnection system FastEthernet has been selected instead of GigabitEthernet or Myrinet.
GigabitEthernet cards and switches are the newest entry in the market and suffers from and high prices, at least a factor of five-seven difference when compared to their FastEthernet counterparts. However the most important factor against its adoption is, in our mind its limited support in the Linux kernel, in terms of network drivers available. The situation is supposed to get better in a short time since many major vendors (including Intel) are now supporting this technology. We expect GigabitEthernet to be the way to go, as soon as the technology matures some more.
As opposed to GigabitEthernet, the Myrinet interconnections has been supported in the Linux kernel since the 2.0 series, however it suffers from being a single-vendor technology: only Myricom Inc. provides Myrinet hardware. The prices for both NICs and switches are 10-15 times higher than their FastEthernet equivalent, meaning that the price of the NIC only can amount to almost half of the total price for a compute node. Myrinet has been discarded based on budget limitations.
Since the network traffic is expected to be much lower on the Internet side of the network, special care has been taken to select the best-performing network interface card available to our budget. A survey of users group on the Internet, and in particular of the archives of the BEOWULF mailing list ( http://www.beowulf.org/pipermail/beowulf/ or more information), that is devoted to the discussion about all the aspects of Beowulf-class systems assembly and usage, indicated that one the most appeciated NIC in terms of price, reliability and low latency is the Intel "EtherExpress Pro/100" , which has been acquired and installed on all the machines of the cluster, also because of its high availability through almost any vendor.
A KVM switch is not strictly necessary in order to run a Beowulf cluster, howeevr it simplyfies install and maintenance operations by allowing a centralized, quick and practioal direct access to the console terminal of every machine in the cluster, which can be invaluable if a problem arises.
An EIDE CD/RW has been chosen as a the main backup unit for users files and data, due to its low cost and to the fact that CD-ROM are much more sturdier and dependable than magnetic tapes for long-term data storage. Their main drawback is their relatively limited capacity (700 MB), however SCSI tape backup units are already available at CIRB Biocomputing Unit and they can be connected to the front-end SCSI bus, should the necessity arise. The resulting assembly is shown in the photo included hereafter.

2. OS and communications

The Linux operating system has been widely used and tested to build highperformance computing cluster using the Intel x86 platform. Furthermore, the price is right: either free or extremely inexpensive, depending on the distribution that it is used. We tested two different distribution: a general-purpose distribution, the widely-used RedHat Linux 6.2, and Scyld Beowulf, a recentlyreleased distribution designed for the implementation of parallel computers.

Redhat 6.2 + Kickstart
Redhat has some advantage over other Linux distributions due to the RPM method of packaging software and to the Kickstart unattended installation method.
The first step for the beowulf cluster assembly is the installation of the frontend machine. Partitioning of the hard disk followed by a standard full installation of all the available packages on the distribution CDROM has been performed. After the installation a DHCP server has been configured, so that the cluster slave nodes will dinamically get their network configuration parameters at boot. Network file system (NFS) sharing of the CD-ROM and of the /usr/local and /home filesystems, where local applications and user data will reside, has been activated.
It is of paramount importance to secure the network installation on the frontend by configuring it as a network firewall, since a security breach in it will compromise the entire cluster. Only secure-shell (http://www.openssh.org ) logins and file transfers have been allowed from the Internet to the cluster. The standard Linux IPCHAINS and TCP_wrappers tools has been used for this purpose. This strategy showed its importance really soon: several intrusion attempts and TCP port scans have been logged in the first couple of days after the installation, some of them whom might have been successful on a plain RedHat installation.
On the slave nodes the installation is somewhat different: first these are machines that do not have a CD drive, so the NFS install from the shared CD- ROM has to be used. One of the slave node has been designated as the "golden standard" and a supervised NFS installation and configuration has been done for it. A stripped-down installation that includes only basic networking fonctionality plus client DHCP, NIS and NFS services, rsync, sshd, and MPI support has been performed. The end result of this process is a cluster of two machines: the front-end and the "golden standard" lave. The final step is the creation of the Kickstart file by means of the "mkkickstart" utility. The file generated contain a description of the installation process experssed in the Kickstart syntax. A example of the file is included below:

# "Golden node" kickstart file
lang en
network --dhcp
nfs --server 192.168.1.1 /mnt/cdrom
keyboard us
zerombr yes
clearpart --all
part / --size 500M --grow
part /tmp --size 2048M --grow
part swap --size 256M
install
mouse ps/2
timezone --utc Europe/Rome
rootpw --iscrypted X7hcxtGPw/A.
lilo --location mbr
%packages
@ Networked Workstation
rsync
rdate
%post
rpm -i ftp://192.168.1.1/usr/local/rpms/openssh.rpm 
rpm -i ftp://192.168.1.1/usr/local/rpms/mpich.rpm
echo "192.168.1.1:/home  /home  nfs  defaults" >>/etc/fstab
echo "192.168.1.1:/usr/local  /usr/local  nfs        defaults" >>/etc/fstab

This file can then be used as input for the RedHat kickstart installation process, allowing for a very quick reinstallation of the golden node machine, should the necessity arise, but also for fast cloning of it on all the other slave machines, thus generating the entire cluster.
The main drawback of this kind of installation appears during maintenance: in general every change in the configuration must be replicated consistently on every machine. For example in order to upgrade some software package, you have to make the RPM file available to all the machines, that can be done with up an RPM archive on the front-end and run rpm on each node, that is easily done by scripting, however should something go wrong, version skew problems will arise. Furthermore user id, password and groups need to be syncronized across the entire cluster, but NIS services showed instabilities, thus syncronization has been achieved by means of rsync transfers of the relevant files and scripting.

Scyld Beowulf
Scyld Beowulf includes an enhanced Linux kernel, libraries, and utilities specifically designed for clustering. Furthermore this distribution provides a "single system image" through a kernel extension called Bproc, that provides cluster process management by making the processes running on cluster "Computation Node" computers visible and manageable on the front- end "Master Node". Processes start on the front-end node and migrate to a cluster node, however the process parent-child relationships and UNIX job control are maintained with migrated tasks.
This unified "cluster process space", provides the capability to start, observe, and control processes on cluster nodes from the cluster's front- end computer
Another interesting feature of the Bproc system is its low latency decreases time to start processes remotely. Its process migration times has been measured to be of ten milliseconds, that represent an order of magnitude improvement over other job spawning methods. Scyld Beowulf ships with a customized version of the popular MPICH implementation of the MPI message passing library that has been modified modified to take advantage of the process creation and management facilities provided by the Bproc system. Moreover the supplied Linux kernel and GNU C library have bben patched in order to support Large File Summit (LFS) and 64 bit file access on the ext2 filesystem. Several includes utilities (Basic text utilities, scp, ftp client and server) have also been modified to take advantage of it.
Another feature of the single-system image provided by Bproc is that cluster slave nodes are not required to contain any resident applications. The dynamically-linked libraries are cached locally on the slave node , while the executables are started on the front end and then moved to the slave through Bproc system. This means that the cluster can be formed also by a master and several diskless slave. If present, the slave hard disks are used for application data and cache. Furthermore part of the available storage space on the slaves may also be shared with the whole cluster through some sort of cluster file system such as AFS ( http://www.openafs.org/ ), PVFS (http://parlweb.parl.clemson.edu/pvfs/ ), and GFS ( http://www.sistina.com/gfs/ ).
One huge advantage of the single system image approach from the sytem manager point of view is the great semplification in installation, administration and maintainence of the cluster. Scyld Beowulf installation is easy: it is mainly a matter of installing the distribution on the front- end, which it is done in the same way as for a RedHat Linux 6.2 installation. After that it is necessary to create a boot image for the slaves, to assign unique network numbers to the slaves and to define the partitioning scheme for the slaves hard drives . At their next boot, the slave nodes will join the cluster, download the appropriate kernel and cache locally the DLLs.
Maintenance is simplified too, because only the codes on the front end need to be upgraded, since the slaves will dinamically get what they need from the it. This also eliminates application version skew.
The same network security issues discussed for RedHat 6.2 apply to the Scyld distribution too.

System testing : networking
The cluster Communication system performance have been evaluated by means of the MPI variant of the NetPIPE benchmark (NPmpi). NetPIPE ( http://www.scl.ameslab.gov/netpipe/) is a protocol independent performance tool that can measure the network performance under a variety of conditions. By taking the end-to-end application view of a network, NetPIPE can shows the overhead associated with different protocol layers. Netpipe has been used here to determine the network's effective maximum throughputand saturation level when using the MPI communication layer. The graph clearly shows that the optimal
throughput is attained when large packet size are transmitted. In this conditions a maximum throughput of 84.5 Mbps has been reached corresponding to a 64 kb packet size. Larger packets cause a slight network performance degradation, with a saturation value of 83 Mbps, out of a theoretical value of 100 Mbps.

3. System testing : Overall performance

The overall cluster performance has been measured using the standard High Performance LINPACK (HPL) benchmark ( ( http://www.netlib.org/benchmark/hpl/ ). This is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. The HPL package provides a testing and timing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. This test is a widely recognized standard to monitor supercomputer speed: the Top500 list ( http://www.top500.org/ ), that every six months reports what are the 500 fastest computer systems on Earth uses the HPL results as a performance measure in ranking the computers. A matrix size of 9000 has been used in the HPL scaling test. Using a larger matrix size (11000) a peak performance of 1.97 GFlops has been attained. During the tests, the network switch has been continuosly monitored for backplane saturation. The situation never occured: network usage never exceeded 50% of the backplane saturation value.

Conclusion

The implementation obtained using the Scyld Beowulf distribution is much more convenient to mantain than the one obtained using the general-purpose distribution RedHat Linux, while the installation is simple in both cases. A working small Beowulf cluster has been deployed at the CIRB Biocomputing Unit of the University of Bologna, reaching the planned milestone for the workpackage.

HOMEPAGE