Some general information for SAS users.
SAS, the statistical analysis software (http://sas.com), and much more, frequently used in the social sciences, is available on the High Performance Academic Computing Cluster. It is not a parallel version of SAS, but we do offer an unlimited linux license for Teaching and Research.
SAS is typically invoked in batch mode by submitting a script (*.sas text file). SAS will generate a log file (*.log) and a listing file (*.lst). The former shows you what is going on, the latter contains the output of invoked procedures.
SAS can be invoked in interactive mode on the head node for debugging and code development, if needed. However, this is not supported on compute nodes. Hence if you need to generate graphical output you will have to use SAS/Graphics or the Output Delivery System (SAS/ODS). Examples of code can be found at a variety of locations:
A tutor application is available at http://sas.wesleyan.edu/SASOnlineTutor/sot91/index.htm
So lets generate a little SAS program using a Unix editor like vi/vim, emacs or pico.
test.dat
1234567890 0987654321 2468097531
test.sas
which does the obviousoptions nocenter; filename test './test.dat'; data one; infile test; input @2 x 3.1 @6 y 3.1; total = x * y; run; proc print; run;
[root@greentail sas]# ll total 8 -rw-r--r-- 1 root root 33 Dec 21 10:16 test.dat -rw-r--r-- 1 root root 140 Dec 21 10:22 test.sas [root@greentail sas]# sas test [root@greentail sas]# cat test.lst The SAS System 10:24 Wednesday, December 21, 2011 1 Obs x y total 1 23.4 67.8 1586.52 2 98.7 54.3 5359.41 3 46.8 97.5 4563.00
Ok, so now we have a program that works. Now we want to submit it maybe dozen times (maybe with different input data or different calculations, whatever). In order to do that we will write a shell script that invokes this SAS program and hands it off to the scheduler (Lava). The scheduler will figure out for us which compute nodes are idle and submit your programs on your behalf.
run
for submissionchmod u+x run
#!/bin/bash # submit via 'bsub < run' #BSUB -q hp12 #BSUB -J test #BSUB -o stdout #BSUB -e stderr time sas test
The leading '#' is a comment in shell scripting but the scheduler specifically looks for leading '#BSUB' tags and interprets the line: -q (define queue), -J (set job name), -o (save STDOUT to a filename), -e (same for STDERR). man bsub
for more information. Then the job defines what to run, here we prefix it with the unix utility time
which reports run time to STDERR.
[hmeij@greentail sas]$ bsub < run Job <492637> is submitted to queue <hp12>. [hmeij@greentail sas]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 492637 hmeij RUN hp12 greentail n10 test Dec 21 10:49 [hmeij@greentail sas]$ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP hp12 50 Open:Active 256 - - - 219 0 219 0 matlab 50 Open:Active 8 8 - 8 0 0 0 0 stata 50 Open:Active 6 6 - 6 0 0 0 0 elw 50 Open:Active 60 - - - 0 0 0 0 emw 50 Open:Active 32 - - - 8 0 8 0 ehw 50 Open:Active 32 - - - 8 0 8 0 ehwfd 50 Open:Active 32 - - - 8 0 8 0 imw 50 Open:Active 128 - - - 32 0 32 0 bss24 50 Open:Active 90 - - - 0 0 0 0 [hmeij@greentail sas]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 492637 hmeij RUN hp12 greentail n10 test Dec 21 10:49 [hmeij@greentail sas]$ bjobs No unfinished job found [hmeij@greentail sas]$ ll total 28 -rwxr--r-- 1 hmeij its 115 Dec 21 10:48 run -rw-r--r-- 1 hmeij its 42 Dec 21 10:49 stderr -rw-r--r-- 1 hmeij its 838 Dec 21 10:49 stdout -rw-r--r-- 1 hmeij its 33 Dec 21 10:16 test.dat -rw-r--r-- 1 hmeij its 2565 Dec 21 10:49 test.log -rw-r--r-- 1 hmeij its 258 Dec 21 10:49 test.lst -rw-r--r-- 1 hmeij its 140 Dec 21 10:22 test.sas
And so the job was dispatched to host n10
for execution. Results are posted in my home directory, in fact the entire job ran in my home directory while on the remote compute node. I may not want to do that if I process or generate a lot of data. So we're going to add some statements to the script next. Also, I may want to reserve some memory so the scheduler does not submit the job to hosts that have insufficient memory available or some other job is dispatched later that causes memory conflicts with my job.
The hp12
queue is the cluster greentail's default queue where each compute node has a 12 GB memory footprint. Memory footprints of hosts for the other queues differ, please consult this link (there is some old data…) http://petaltail.wesleyan.edu/cgi-bin/bqueues_web.cgi for information about other queues.
On the back end compute nodes, unless specified, the job runs inside your home directory. That job competes with other activities inside /home. Compute nodes have two other areas where the jobs could be submitted: /localscratch and /sanscratch. The former is a local filesystem for each node and should be used if file locking is essential. The later is a filesystem from greentails diskarray (5 TB) served vi IPoIB (that is NFS traffic over fast interconnects switches, the performance should be much better than gigabit ethernet switches). It is comprised of disks and spindles that are not impacted by what happens on /home. So we're going to use that.
In the SAS program we add the following lines
%let jobpid = %sysget(LSB_JOBID); libname here "/sanscratch/&jobpid";
And change this line to use local disks for storage
data here.one;
In the submission script we change the following
#!/bin/bash # submit via 'bsub < run' #BSUB -q hp12 #BSUB -J test #BSUB -o stdout #BSUB -e stderr #BSUB -n 1 #BSUB -R "rusage[mem=200]" # unique job dir in scratch export MYSANSCRATCH=/sanscratch/$LSB_JOBID cd $MYSANSCRATCH cp ~/sas/test.dat ~/sas/test.sas . time sas test cp test.log test.lst ~/sas
[hmeij@greentail sas]$ ll /sanscratch/492667/ total 16 -rw-r--r-- 1 hmeij its 33 Dec 21 14:31 test.dat -rw-r--r-- 1 hmeij its 2568 Dec 21 14:31 test.log -rw-r--r-- 1 hmeij its 258 Dec 21 14:31 test.lst -rw-r--r-- 1 hmeij its 140 Dec 21 14:31 test.sas