Installation of Torque/Maui for a Beowulf Cluster
Installing Torque on the OSU Physics Beowulf Cluster
Background:
The OSU Physics Department bought in the summer of 2005 39 Dell Optiplex GX620's with Intel Pentium D 830 (3.0 GHz) and 1 GB Ram to replace the aging Sun Ultra 5's being used as the departments computational lab. 35 machines were to be placed in the labs, 26 in Weniger 412, and 9 in Weniger 497. 1 machine was to have the RAM upgraded and the 80 GB hard drive replaced with 2 320 GB drives and to setup to act as a server to the other machines and is located in 405. The remaining 3 machines were to be used by faculty and staff involved with the labs. Currently these 3 machines are located in various rooms. 1 is used by Henri in room 403, 1 is used by Justin in 401, 1 is used by Rubin in 311?.We had also purchased a set of compilers from Intel. As part of this, we received licenses for the Intel MPI libraries. This would allow us to use the lab machines as a cluster once set up. It should be noted that it is possible to run MPI programs without installing Torque, however this is not really a good idea as these computers are used by people at different times, ie these computers are not dedicated, which is where a cluster without using a scheduling system like Torque is not neccessary.
Other advantages to using these machines to replace our (now obsoleted) cluster of Sun's is obviously speed, and also so there is only one physics cluster to maintain. The older cluster machines were almost 13-15 years old and had frequently failing components.
I had originally decided to use OpenPBS as it was available as a precompiled package available with the Suse 10.1 Operating System used on all the lab machines. (The server has a paid version of Suse Linux Enterprise Desktop 10). However, it became apparent that this package was set to only use the rsh and rcp protocols to transfer files, and since the machines are all publically available (not set with private IP's with front end machines public), it was decided that rsh and rcp was too insecure for our purposes. OpenPBS is no longer actively maintained or developed as an open source project. There is a new version called PBSPro that is developed but is not open sourced. Torque was a fork of OpenPBS before support was dropped. Torque is actively maintained and is freely available.
Since our cluster is to only be used as an educational tool, it was decided that it was not worthwhile to pay for such support. If a cluster was to be built for research purposes, PBSPro may be a better choice.
Sun Grid Engine (SGE) was also a choice considered, but configuration is more complicated, and again, it was decided to keep it simple.
********* EDIT (4/2/07) ************
After finding out that the defualt FIFO scheduler that is included with Torque will not properly allocate nodes, I have installed Maui, which is a much more capable scheduler. It is also open sourced, and put out by the same group that does Torque. See below for edits on how this was installed.
********************************
Installation:
Files are downloaded from the Torque website onto the server. They were unpacked and compiled as follows:# tar zxvf torque-2.1.6.tar.gz
# cd torque-2.1.6
# ./configure --prefix=/usr/local --enable-docs --enable-mom --enable-server --enable-clients --disable-gui --with-default-server=physics-server.physics.oregonstate.edu --with-scp
# make
# make packages
This will make installable packages for the server, docs, clients, and mom packages with no gui, default server and using scp and ssh instead of rcp and rsh.
On the server, install the following packages (note that this "packages" are not rpm's, but simply compressed files that place the needed files in the proper locations)
# ./torque-package-server-linux-x86_64.sh --install
# ./torque-package-clients-linux-x86_64.sh --install
# ./torque-package-mom-linux-x86_64.sh --install
# ./torque-package-devel-linux-x86_64.sh --install
******* Edit ************
I had to also install the devel package on the server so that maui would configure
***********************
This will install files into /usr/local/bin/ and /usr/local/sbin/ and /var/spool/torque . The configuration files are located in the /var/spool/torque directory while the binaries are in the other two.
Install onto nodes:
I copied the above packages to one of the nodes and did
# ./torque-package-mom-linux-x86_64.sh --install
# ./torque-package-clients-linux-x86_64.sh --install
# ./torque-package-doc-linux-x86_64.sh --install
# ./torque-package-devel-linux-x86_64.sh --install
The devel package was probably not neccessary, but done anyway.
******* Edit *************
To install Maui, you must first register at this site, then you can download the source files. Registration is free.
Download, configure, make and make install (I don't remember having to do anything special here)
# ./configure
# make
# make install
This will install files into /usr/local/maui
I had to change the following in maui.cfg
# Resource Manager Definition
#RMCFG[PHYSICS-SERVER.PHYSICS.OREGONSTATE.EDU] TYPE=PBS@RMNMHOST@
RMCFG[0] TYPE=PBS
# The default setting for RMCFG did not work. Setting it to the more generic setting fixed the communication between Torque and Maui
....
# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html
#NODEALLOCATIONPOLICY MINRESOURCE
#NODEALLOCATIONPOLICY CPULOAD
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF='- 10 * LOAD - 20 * JOBCOUNT'
# This setting makes it so that Maui will assign jobs to nodes that have the lowest load and least amount of jobs. This was the main reason for changing to Maui.
**************************
Configuration:
This was the difficult part. I mostly followed the instructions in the admin guide for Torque, setting the values in the basic configuration section as follows:Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 72:00:00
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long resources_default.nodes = 1
set queue long enabled = True
set queue long started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = True
set server acl_hosts = *.physics.oregonstate.edu
set server operators = justin@physics-server.physics.oregonstate.edu
set server operators += root@physics-server.physics.oregonstate.edu
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.6
set server allow_node_submit = True
I have 2 queues made. The queue "batch" is the default queue and has a forced walltime of 3 days. The queue long is to be used if computation times longer than 3 days is needed.
I also changed the following files
On server:
/etc/init.d/pbs_server
Use chkconfig to set this to run at boot time:
# cd /etc/init.d
# chkconfig pbs_server on
******** Edit ****************
This does not work on reboot. I have made a file start_pbs in /etc/init.d/ with the following commands:
##### start_pbs #######
#!/bin/bash
sleep 45
/etc/init.d/pbs_server start
/usr/local/maui/sbin/maui
##################
I also changed the following using chkconfig
# chkconfig pbs_server off
# chmod +x /etc/init.d/start_pbs
# chkconfig start_pbs on
***************************
On nodes:
------------------/var/spool/torque/mom_priv/config-------------------------------
$pbsserver physics-server.physics.oregonstate.edu # note: hostname running pbs_server
$logevent 255 # bitmap of which events to log
--------------------/etc/hosts--------------------------------------------
#
# hosts This file describes a number of hostname-to-address
# mappings for the TCP/IP subsystem. It is mostly
# used at boot time, when no name servers are running.
# On small systems, this file can be used instead of a
# "named" name server.
# Syntax:
#
# IP-Address Full-Qualified-Hostname Short-Hostname
#
127.0.0.1 localhost
# special IPv6 addresses
::1 localhost ipv6-localhost ipv6-loopback
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
128.193.97.4 wngr412-pc01.physics.oregonstate.edu wngr412-pc01
128.193.97.5 wngr412-pc02.physics.oregonstate.edu wngr412-pc02
128.193.97.6 wngr412-pc03.physics.oregonstate.edu wngr412-pc03
128.193.97.7 wngr412-pc04.physics.oregonstate.edu wngr412-pc04
128.193.97.8 wngr412-pc05.physics.oregonstate.edu wngr412-pc05
128.193.97.9 wngr412-pc06.physics.oregonstate.edu wngr412-pc06
128.193.97.10 wngr412-pc07.physics.oregonstate.edu wngr412-pc07
128.193.97.11 wngr412-pc08.physics.oregonstate.edu wngr412-pc08
128.193.97.12 wngr412-pc09.physics.oregonstate.edu wngr412-pc09
128.193.97.13 wngr412-pc10.physics.oregonstate.edu wngr412-pc10
128.193.97.14 wngr412-pc11.physics.oregonstate.edu wngr412-pc11
128.193.97.15 wngr412-pc12.physics.oregonstate.edu wngr412-pc12
128.193.97.16 wngr412-pc13.physics.oregonstate.edu wngr412-pc13
128.193.97.17 wngr412-pc14.physics.oregonstate.edu wngr412-pc14
128.193.97.18 wngr412-pc15.physics.oregonstate.edu wngr412-pc15
128.193.97.19 wngr412-pc16.physics.oregonstate.edu wngr412-pc16
128.193.97.20 wngr412-pc17.physics.oregonstate.edu wngr412-pc17
128.193.97.21 wngr412-pc18.physics.oregonstate.edu wngr412-pc18
128.193.97.22 wngr412-pc19.physics.oregonstate.edu wngr412-pc19
128.193.97.23 wngr412-pc20.physics.oregonstate.edu wngr412-pc20
128.193.97.24 wngr412-pc21.physics.oregonstate.edu wngr412-pc21
128.193.97.25 wngr412-pc22.physics.oregonstate.edu wngr412-pc22
128.193.97.26 wngr412-pc23.physics.oregonstate.edu wngr412-pc23
128.193.97.27 wngr412-pc24.physics.oregonstate.edu wngr412-pc24
128.193.97.28 wngr412-pc25.physics.oregonstate.edu wngr412-pc25
128.193.97.29 wngr412-pc26.physics.oregonstate.edu wngr412-pc26
128.193.97.31 wngr497-pc01.physics.oregonstate.edu wngr497-pc01
128.193.97.32 wngr497-pc02.physics.oregonstate.edu wngr497-pc02
128.193.97.33 wngr497-pc03.physics.oregonstate.edu wngr497-pc03
128.193.97.34 wngr497-pc04.physics.oregonstate.edu wngr497-pc04
128.193.97.35 wngr497-pc05.physics.oregonstate.edu wngr497-pc05
128.193.97.36 wngr497-pc06.physics.oregonstate.edu wngr497-pc06
128.193.97.37 wngr497-pc07.physics.oregonstate.edu wngr497-pc07
128.193.97.38 wngr497-pc08.physics.oregonstate.eud wngr497-pc08
128.193.97.39 wngr497-pc09.physics.oregonstate.eud wngr497-pc09
---------------------/etc/ssh/shosts.equiv-------------------------------
wngr412-pc01.physics.oregonstate.edu
wngr412-pc02.physics.oregonstate.edu
wngr412-pc03.physics.oregonstate.edu
wngr412-pc04.physics.oregonstate.edu
wngr412-pc05.physics.oregonstate.edu
wngr412-pc06.physics.oregonstate.edu
wngr412-pc07.physics.oregonstate.edu
wngr412-pc08.physics.oregonstate.edu
wngr412-pc09.physics.oregonstate.edu
wngr412-pc10.physics.oregonstate.edu
wngr412-pc11.physics.oregonstate.edu
wngr412-pc12.physics.oregonstate.edu
wngr412-pc13.physics.oregonstate.edu
wngr412-pc14.physics.oregonstate.edu
wngr412-pc15.physics.oregonstate.edu
wngr412-pc16.physics.oregonstate.edu
wngr412-pc17.physics.oregonstate.edu
wngr412-pc18.physics.oregonstate.edu
wngr412-pc19.physics.oregonstate.edu
wngr412-pc20.physics.oregonstate.edu
wngr412-pc21.physics.oregonstate.edu
wngr412-pc22.physics.oregonstate.edu
wngr412-pc23.physics.oregonstate.edu
wngr412-pc24.physics.oregonstate.edu
wngr412-pc25.physics.oregonstate.edu
wngr412-pc26.physics.oregonstate.edu
wngr497-pc01.physics.oregonstate.edu
wngr497-pc02.physics.oregonstate.edu
wngr497-pc03.physics.oregonstate.edu
wngr497-pc04.physics.oregonstate.edu
wngr497-pc05.physics.oregonstate.edu
wngr497-pc06.physics.oregonstate.edu
wngr497-pc07.physics.oregonstate.edu
wngr497-pc08.physics.oregonstate.edu
wngr497-pc09.physics.oregonstate.edu
-----------------------------/etc/ssh/sshd_config ---------------------------
change the following lines:
# For this to work you will also need host keys in /etc/ssh/ssh_known_hosts
#RhostsRSAAuthentication no
# similar for protocol version 2
HostbasedAuthentication yes
# Change to yes if you don't trust ~/.ssh/known_hosts for
# RhostsRSAAuthentication and HostbasedAuthentication
IgnoreUserKnownHosts no
# Don't read the user's ~/.rhosts and ~/.shosts files
IgnoreRhosts no
-------------------------/etc/ssh/ssh_known_hosts----------------------------
To do this file, I cleaned out roots .ssh/known_hosts file, then initiated a ssh session to all of the other machines to reset all host_keys in the file. I then copied the file with :
# cp ~/.ssh/known_hosts /etc/ssh/ssh_known_hosts
This step should have been easier by using ssh-keyscan, but I had problems using this correctly for some reason.
*************** Edit ****************
--------------------------/usr/lib64/ssh/ssh-keysign-----------------------------
This file needs to be suid root. To do this:
# chmod u+s /usr/lib64/ssh/ssh-keysign
------------------------------------------------------------------------------------
**********************************
-----------------------/etc/init.d/pbs_mom-------------------------------------
This file is long, so I have made it a download. Make sure you use chkconfig to make it start on boot. For some reason, this fails on every boot, but adding the following file makes it work (denyhosts was also not working. This may have been because the binaries for both were located on a nfs directory that wasn't up in time or something). Again, make sure to use chkconfig to make it start on boot to at least level 5.
-------------/etc/init.d/start_pbs---------------------
#!/bin/bash
sleep 45
/etc/init.d/pbs_mom start
/etc/init.d/denyhosts start
------------------------------------------End of files on the nodes-----------------------------------------------------------
I may have missed something as I was creating this document from memory of what I did, and I ran into several problems that all seemed to be related to the hosts not being able to communicate. Seemed to be fixed by adding the file /etc/hosts to the nodes. The errors I was receiving were:
You can't run mpdboot on ['wngr497-pc01.physics.oregonstate.edu']
version of python must be >= 2.4, current ['']
| Attachment | Size |
|---|---|
| pbs_mom.txt | 2.56 KB |
| pbs_server.txt | 1.49 KB |
