Cluster Use
Using the OSU Physics Labs as a Beowulf Cluster for long running serial or parallel programs
Background
The computers in Weniger 497 and 412 have now been set up to act as a Beowulf cluster. These machines are currently 34 (eventually 35) Dell Optiplex GX620's with Intel Pentium D 830 (3.0 GHz) processors, and 1 GB of RAM running Suse Linux 10.1 64 bit Operating System. These computers are also loaded with the Intel Compiler Suite which comprises a C, C++ and Fortran compilers, as well as the Math Kernel Library (cluster edition) and MPI libraries. The MPI libraries are what is required for the cluster to be able to run parallel jobs. The cluster also utilizes a program called Torque which acts as a resource manager for the cluster. As of 4/2/07, we now use Maui for a scheduler for the system. This program will take the programs that you submit and decide what computer(s) to run on based on current load. Torque is based on PBS and uses all the same commands, but is open sourced.Please direct any questions or comments to elserj@phsyics.oregonstate.edu
Running serial jobs with Torque
A serial job is one that does not need to run on multiple
machines. This includes all java, C, C++, Fortran files not
compiled with an mpi compiler, and shell scripts.
The basic command to "submit" jobs to Torque is qsub. qsub is
short for "queue submit". In essence, what you are doing is
submitting a job to the queue where Torque will decide what to do with
the program.
Very important thing to note, qsub will not accept binary (compiled)
programs directly. This means that this process will fail:
> icc program.c -o program (icc is the intel C compiler,
-o tells the compiler to name the executable program program
> qsub program (
try to submit program)
qsub: file must be an ascii script
(result)
For security reasons, qsub will only accept shell scripts (note that
the schell script does NOT need to be executable, although being so
will not affect anything).
Here is a sample script for running serial jobs:(copy and paste between the lines)
------------------------------
#!/bin/bash
#PBS -l walltime=00:15:00
#PBS -l nice=19
cd $PBS_O_WORKDIR
./program
------------------------------
#!/bin/bash is required at the beginning of any shell script, or at
least any shell script that runs in bash (you could also specify tcsh,
ksh, etc...)
#PBS -l walltime=00:15:00 tells the scheduler to only allow 15 minutes
for this job. Note that if you require longer, you can change this.
Also note that the default is for 2 days if it is not set here. This
means that your program will be killed if it taked longer than the
walltime. This is to keep bad programs from running forever. See the
parallel section for more info. If you would like to run longer, please
contact justin by email at elserj@physics.oregonstate.edu.
#PBS -l nice=19 ensures that your job will run in the background so that it won't interfere with any other programs being used. This is need since the computers we are using are public, ie. anybody can sit down and use one of them. We don't want to interfere with normal class use. More information about "nice" can be found by doing man nice.
cd $PBS_O_WORKDIR tells the script to switch to the current directory.
If this is not used, an absolute path to the program must be called.
./program tells the script to execute the file program in the current
directory. ../program would tell it to look one directory up for
current working directory.
#########################
You can run java jobs by using the following command instead of ./program (must be compiled beforehand)
java java_program
#########################
#########################
It is also possible to run Mathematica jobs on the cluster. This is very useful for long running jobs. The steps are as follows:
If you already have a notebook that you would like to run:
Open your notebook in Mathematica,
Select Kernel -> Delete All Output
Select Edit -> Select All
Select Cell -> Cell Properties -> Initialization Cell
Select File -> Save As Special -> Package Format
Save the file somewhere, in this example, I will call it math.m.
Add the following as the first line in your newly saved file:
AppendTo[$Echo, "stdout"]
(This will allow you to see your input commands in the output file)
Change the ./program line to be
math < math.m > results.out
This will place the output from your notebook in the file results.out. If your notebook exports to a file, this should work as normal (not tested yet).
###########################
The following site has more details on PBS environment variables
available, although PBS_O_WORKDIR should be the only one you need for single processor jobs:
http://www.princeton.edu/~ktchu/computing/scientific_computing/PBS.html
As you can see from above serial scripts are very simple, although they
can be made more complex if desired.
The nicest feature of running programs
with this method is that while your job is running, you can log out and
your job will still run. This means that you can ssh to
any one of the machines, compile your program, submit it to the queue,
and then log out. Your job will run until finished or killed and
you do not have to lock a workstation or have anyone even know your
progrram is running unless they check the queue.
Which leads to how to check on your jobs and where does the output
go. You check on jobs by running the command qstat.
> qstat
Job
id
Name
User
Time Use S Queue
------------------- ---------------- --------------- -------- - -----
170.physics-server run_mpi.sh
justin
0 R batch
This tells me my program has ID 170, I am running run_mpi.sh as user
justin, the program is in state R (running) (Q is queued, waiting for
computers, E is exiting, either with errors or without), and is running
in queue batch. Note that batch is the default queue that all
jobs will run in.
If your program is written to output to a file, it will still output to
that file. However, if your program is set up to output to stdout
(the terminal, console, whatever you want to call it), the output will
be redirected to a file named script.oJobID and any errors will be
redirected to script.eJobID, where script is the script you used to
submit the job, and JobID is
the ID given to it by Torque. The JobID can be determined from
qstat, or when you submit a job, it will tell you the JobID. Note
that the JobID will be a number, followed by physics-server. This
is simply because the queue is on the server, all that is important is
the number. The files will look like this:
mpitest@wngr412-pc01:~/mpi> ll
total 32
-rwxr-xr-x 1 mpitest users 17702 2006-11-15 16:38 mpipi
-rw-r--r-- 1 mpitest users 1400 2006-11-15 16:38 mpipi.c
-rw-r--r-- 1 mpitest users 585 2006-11-15 16:39 run_mpi.sh
-rw------- 1 mpitest users 0 2006-11-15 16:39
run_mpi.sh.e164
-rw------- 1 mpitest users 327 2006-11-15 16:39
run_mpi.sh.o164
mpipi is the program being run, mpipi.c is the source code, run_mpi.sh
is the script used to submit the job, and run_mpi.sh.e164 and
run_mpi.sh.o164 are the error and output files from job 164.
Note that the above is for a mpi (parallel) program, but the basic file
structure is the same.
You can kill current jobs by using the qdel command. Use this
command if your program is taking way longer than it should or you need
to run it with a different version of your program.
> qdel 170
would have killed the above job shown in qstat if it was still
running. Note that you can only kill your own jobs, not someone
else's, although you will be able to see other peoples jobs in the
queue.
Running mpi (parallel) jobs with Torque:
This is more complicated in that the script used must have certain
commands in it. The first thing that must be done is to create an
mpd secretword. This secretword is like a password with its main
function being used to discriminate jobs started by you from those
started by someone else. To do this follow the below commands:
> cd $HOME
> echo "MPD_SECRETWORD=secretword"
>> .mpd.conf (replace secretword with your own
secretword, NOT your password. You don't really need to remember
this, it is used "behind the scenes".)
> chmod 600 .mpd.conf
These commands do the following:
make sure you are in your home directory,
place the text following the echo in quotes in a file named .mpd.conf,
change the permissions on the file .mpd.conf so that no one else can
read the file but you. See here for more info on the chmod
command.
It is beyond the scope of this document to describe programming
practices for MPI, merely implementation. However, there is a
fairly user friendly "User's Guide to MPI" (postscript) that is available and is recommended reading on the
subject.
You can compile your MPI programs with one of the following compilers
available on the OSU Physics cluster:
mpicc MPI wrappers for the gcc 4.1.0 compiler
mpiicc MPI wrappers for the Intel 9.1 compiler
mpif77 MPI wrappers for the gcc Fortran 77 3.3.5
compiler
mpif90 MPI wrappers for the gcc Fortran 90 4.1.0
compiler
mpiifort MPI wrappers for the Intel Fortran 9.1 compiler
(fortran 90)
I have not tested performance differences between the various compilers.
Here is a sample script used for mpi jobs, with a description following
of each line:
--------------------------
#!/bin/bash
#
# All lines starting with "#PBS" are PBS commands
#
# Request 2 nodes with 2 processor per node (equals 4 processors)
# ppn can either be 1 or 2
#
#PBS -l nodes=2:ppn=2
#
# Set wall clock time to 0 hours, 15 minutes and 0 seconds
#PBS -l walltime=00:15:00
# Set the nice value to 19 so that it doesn't interfere with locally running programs
#PBS -l nice=19
# cd to working directory
cd $PBS_O_WORKDIR
# name of executable
myprog=mpipi
# The following checks how many nodes were requested,
# and sets the NP variable to (nodes * ppn) from above
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
# Number of processors is $NP
# Run MYPROG with appropriate mpirun script
mpirun -r ssh -n $NP $myprog
# make sure to exit the script, else job won't finish properly
exit 0
-------------------------
Here is the mpipi
program code I used (courtesy
of Rubin Landau) ((Right click, Save As))
Again, the script must start with the line #!/bin/bash. Note that
lines beginning with #PBS are commands to the Torque scheduler, not
comments.
Also note that the script must still be submitted using the program qsub:
mpitest@wngr412-pc01:~/mpi> qsub run_mpi.sh
Again, the following site has more details on #PBS commands available:
http://www.princeton.edu/~ktchu/computing/scientific_computing/PBS.html
The line
#PBS -l nodes=2:ppn=2 tells Torque to use 2 computers with 2 processors
per node for a total of 4 processors. All of our machines are
Dual Core machines, which means that they each have two
processors. If you want to only use one processor per machine,
change to ppn=1. ppn stands for processor per node.
The line
#PBS -l walltime=00:15:00 tells Torque to kill the job if it takes
longer than 15 minutes to run. Note that this is actual run time,
not total time in the queue, meaning that if for some reason your job
doesn't start right away, this delay does not count against you.
In general it is a good idea to use a walltime kill command in case
your program is poorly implemented or stuck in a loop. You should
set the walltime to be about twice the time you expect it to run.
For short jobs, a limit of 15 minutes if fine. However, if you
expect your job to run for several days, you may remove this
line. Note that if you have such jobs, let me know so that I
don't think they are run away jobs to be killed.
#PBS -l nice=19 tells the computer to give your program a nice value of 19. This makes sure that it runs in the background. See serial description above for more info.
$PBS_O_WORKDIR is the directory that you are executing the qsub command
from. If this line is left out, the following line with the name
of your program will have to be an absolute path, such as:
/home/justin/program
The total number of processors is named with the variable NP.
Note that this value cannot exceed that given by nodes * ppn, although
it can be smaller. In the above script, this is done automatically for
the max value by parsing the file $PBS_NODEFILE. This file gives each
node being set aside by Torque on a line. The command wc -l counts the
number of lines, then this result is passed to the program awk which
converts it to a format usable by the mpirun program.
The next line is the one that actually runs your program in the MPI
environment. mpirun is the command used to start the MPI
environment, -r ssh is required for the machines to be able to
communicate with each other using ssh and scp rather than rsh and
rcp. rsh and rcp is quite a bit less secure and so is not enabled
on the cluster. All communication must be done via ssh or
scp. -n $NP calls the NP variable giving the number of processors
to
use, and $myprog is the program you compiled.
You must also give
the script a command to exit with status 0, or the job might have
problems writing the output file.
| Attachment | Size |
|---|---|
| mpipi-c.txt | 1.38 KB |
