User Tools

Site Tools


wiki:grappolo_ii

Customize master node

Install Grid Engine

Grid Engine is typically used on a computer farm or high-performance computing (HPC) cluster and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space.

We use the open source Open Grid Schcheduler| as batch-queuing system. This is one of the most common software used in HPC clusters around the world and we aim at teaching its use.
We will first install some dependency packages that I've found missing in raspian.

sudo apt-get install csh  # csh shell
sudo apt-get install libpam0g-dev
sudo apt-get install lesstif2-dev

Then compile

wget http://sourceforge.net/projects/gridscheduler/files/GE2011.11p1/GE2011.11p1.tar.gz
<del>wget http://softlayer-ams.dl.sourceforge.net/project/gridscheduler/GE2011.11p1/GE2011.11p1.tar.gz</del>
tar zvxf GE2011.11p1.tar.gz
cd GE2011.11p1/source
export LDFLAGS=-L/usr/lib/arm-linux-gnueabihf
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -only-depend
./scripts/zerodepend
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -no-qmon

Have a tea and a siesta… for a 30min …

The compilation succeeded.

Communication within the cluster

Edit the /etc/hosts file making sure the internal ip-address for the master node “grappolo” is correct. We also add the ip-addresses for node 1 and 2 which we will install and connect later.

127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
fe00::0         ip6-localnet
ff00::0         ip6-mcastprefix
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters
 
10.141.255.254  grappolo
10.141.0.1      node1
10.141.0.2      node2

Grid engine - system configuration

Create a new /etc/profile.d/grappolo.sh scripts with links of our system configurations and grid engine environment.

sudo leafpad /etc/profile.d/grappolo.sh
# add the following 3 lines to this script.
export SGE_ROOT=/usr/local/apps/sge/2011.11p1
export SGE_CELL="default"
export PATH=$PATH:/usr/local/apps/sge/2011.11p1/bin/linux-arm/
# save and close grappolo.sh

Then reboot.

sudo reboot
ssh ssh -X pi@192.168.1.81 # again from your computer

SGE on master node

Set up the SGE on the master node. Insert the original SD card on the raspi master node grappolo, power on and proceed to set up the Grid Engine Scheduler.

Before installation be sure your system configurations are ok!

For this you can type:

admin@grappolo:~ $ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/usr/local/apps/sge/2011.11p1/bin/linux-arm/
admin@grappolo:~ $ echo $SGE_ROOT
/usr/local/apps/sge/2011.11p1
admin@grappolo:~ $ 

Naw set up the Grid Engine on the master node Grappolo as follow:

sudo mkdir -p /usr/local/apps/sge/2011.11p1
sudo su
chmod 777 /usr/local/apps/sge/2011.11p1/
exit
# export PATH=$PATH:/usr/local/apps/sge/2011.11p1/bin/linux-arm/
# export SGE_ROOT=/usr/local/apps/sge/2011.11p1 
# we have already double checked the previous two lines are correct
cd  /usr/local/apps/sge/2011.11p1
scripts/distinst -all -local -noexit
cd $SGE_ROOT
./install_qmaster

The interactive install console will start and we have follow similar settings to those suggested @ FslSGE :

  • press enter at the intro screen
  • press “y” and then specify admin as the user id
  • leave the install dir as /opt/sge
  • You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
  • accept the sgeqmaster info * You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file * accept the sgeexecd info
  • leave the cell name as “default”
  • Enter grappolo as the appropriate cluster name when requested
  • leave the spool dir as is
  • press “n” for no windows hosts!
  • press “y” (permissions are set correctly)
  • press “y” for all hosts in one domain
  • For the Java available on your Qmaster answer “n” for the wish to use SGE Inspect or SDM then enable the JMX MBean server
  • press enter to accept the directory creation notification
  • enter “classic” for classic spooling (berkeleydb may be more appropriate for large clusters)
  • press enter to accept the next notice
  • enterthe default “20000-20100” as the GID range (increase this range if you have execution nodes capable of running more than 100 concurrent jobs)
  • accept the default spool dir or specify a different folder (for example if you wish to use a shared or local folder outside of SGE_ROOT
  • email address enter default (no email)
  • press “n” to refuse to change the parameters you have just configured
  • press enter to accept the next notice
  • press “y” to install the startup scripts
  • press enter twice to confirm the following messages
  • press “n” for a file with a list of hosts
  • enter the node1 as names of your hosts who will be able to administer and submit jobs (enter alone to finish adding hosts)
  • shadow hosts (press “default”)
  • choose “1” for normal configuration and agree with “y”
  • press enter to accept the next message and “n” to refuse to see the previous screen again and then finally enter to exit the installer

Add grappolo as submit node

admin@grappolo:~ $  qconf -as grappolo
grappolo added to submit host list
 
The grid engine should start automatically at reboot. If not, START GRID ENGINE using /etc/init.d/sgemaster.grappolo from grappolo

Bash as default terminal

In the queuing list all.q by default we found the csh shell so we modify this into bash shell using qconf SGE command.

  qconf -mq all.q
  # replace   /bin/csh with /bin/bash

Customize slave nodes

The following operations will be performed twice per each computation nodes of the cluster. First insert the cloned SD card into the same Raspberry pi you used before. From a remote computer connect via ssh.

Hostname and password

Log into the pi :

  • change hostname “grappolo” to “node1” and later “node2”;
sudo nano /etc/hostname 
  • change password to node1pw and later node2pw
passwd 

Node communications

  • edit the /etc/network/interfaces file as below:
auto lo
iface lo inet loopback
 
auto eth0
allow-hotplug eth0
iface eth0 inet static
address 10.141.0.1 
netmask 255.255.0.0
network 10.141.0.0
broadcast 10.141.255.255
gateway 10.141.255.254
In the /etc/network/interface file of node2 use address 10.141.0.2
  • edit /etc/hosts file as below:
127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
fe00::0         ip6-localnet
ff00::0         ip6-mcastprefix
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters
 
10.141.255.254  grappolo
10.141.0.1      node1
10.141.0.2      node2
/etc/hosts files for node1 and node2 are equal

Grid engine - system configuration

As for the master node, we create a new /etc/profile.d/grappolo.sh scripts in both nodes with links of our system configurations and grid engine environment.

sudo leafpad /etc/profile.d/grappolo.sh
# add the following 3 lines to this script.
export SGE_ROOT=/usr/local/apps/sge/2011.11p1
export SGE_CELL="default"
export PATH=$PATH:/usr/local/apps/sge/2011.11p1/bin/linux-arm/
# save and close grappolo.sh

General settngs

Link the /usr/local/apps folder in grappolo using the nfs:

  • can create a folder in node1
  • mount it in as shared with the correspondent folder in grappolo.
sudo mkdir /usr/local/apps
sudo mount grappolo:/usr/local/apps /usr/local/apps

To allow auto mount /usr/local/apps in node1 and 2 at reeboot, edit the /etc/fstab file and append the following line :

grappolo:/usr/local/apps	/usr/local/apps	nfs	rsize=8192,wsize=8192,timeo=14,intr,rw	0	0

Allow grappolo to ssh into node1 and node 2 without the need to type a password: Exit from node1, generate a key pair and copy the key from grappolo to node1.

pi@node1 exit
pi@grappolo ~ $ ssh-keygen -t rsa -P ""
pi@grappolo ~ $ ssh-copy-id admin@node1
pi@grappolo ~ $ ssh-copy-id admin@node2
pi@grappolo ~ $ ssh admin@node1
pi@node1 ~ $ exit
pi@grappolo ~ $ ssh admin@node2
pi@node2 ~ $ exit
pi@grappolo ~ $

Allow node1 and node2 to ssh into grappolo without a password (the other way around from before:

  • Log in from node1, to grappolo .
  • Generate a key pair.
  • Copy the key from node1 to grappolo.
  • Repeat the 3 step above for node2
pi@grappolo ~ $ ssh admin@node1
pi@node1 ~ $ ssh-keygen -t rsa -P ""
pi@node1 ~ $ ssh-copy-id 'admin@grappolo'
pi@node1 ~ $ ssh admin@grappolo
pi@grappolo ~ $ exit
pi@node1 ~ $ exit
pi@grappolo ~ $ ssh admin@node2
pi@node2 ~ $ ssh-keygen -t rsa -P ""
pi@node2 ~ $ ssh-copy-id 'admin@grappolo'
pi@node2 ~ $ ssh admin@grappolo
pi@grappolo ~ $ exit
pi@node2 ~ $ exit

Installation of execution nodes

  • Reboot all nodes (grappolo, node1 and node2).
  • Log in grappolo
    • Check the nfs server is started and active

    <code bash> admin@grappolo:~ $ rpcinfo -p | grep portmapper | head -1 100000 4 tcp 111 portmapper admin@grappolo:~ $ rpcinfo -p | grep nfs | head -1 100003 2 tcp 2049 nfs </code>

If no nfs or portmapper is foud as rpcinfo message (stdout) the nfs server should be restarted

.

If needed, to restart the nfs server:

   sudo service rpcbind restart
   sudo /etc/init.d/nfs-kernel-server restart
   sudo mount grappolo:/usr/local/apps /usr/local/apps/
  • Log into node1 and node2 and check the nfs serve is active

ls /usr/local/apps/sge

 /2011.11p1/ 

If you do not see the 2011.11.p1 folder, restart the nfs server as for grappolo.

  • Log back in grappolo and check if the SGE is active. ps -aux | grep sge admin 720 0.0 0.4 112508 4504 ? Sl 15:37 0:06 /usr/local/apps/sge/2011.11p1/bin/linux-arm/sge_qmaster admin 1648 0.0 0.1 4264 1840 pts/0 S+ 17:43 0:00 grep –color=auto sge

Restart the SGE if needed:

   /etc/init.d/sgemaster.grappolo

In grappolo we should see the following settings

   admin@grappolo:~ $ qconf -ss  # Displays the Grid Engine submit host list.
   grappolo
   node1
   node2
   
   admin@grappolo:~ $ qconf -sh  # Show current administrative hosts 
   grappolo
   node1
   node2

Now install the execution nodes. Perform the following in both nodes:

 cd $SGE_ROOT
 sudo ./install_execd
 

In the installation prompt we accept all Default settings as below:

  • Press ENTER at use default [/usr/local/apps/sge/2011.11p1]
  • Press ENTER at Please enter cell name which you used for the qmaster
  • Press ENTER at Using cell: >default<
  • Press ENTER at The port for sgeexecd is currently set as service. sgeexecd service set to port 6445
  • Press ENTER at This hostname is known at qmaster as an administrative host.
  • Press ENTER at The spool directory is currently set to: «/usr/local/apps/sge/2011.11p1/default/spool/node1» Do you want to configure a different spool directory for this host (y/n) [n] »
  • Press ENTER at reating local configuration admin@node1 added “node1” to configuration list
  • Press ENTER at We can install the startup script that will start execd at machine boot (y/n) [y] »
  • Press ENTER at cp /usr/local/apps/sge/2011.11p1/default/common/sgeexecd /etc/init.d/sgeexecd.grappolo/sbin/insserv /etc/init.d/sgeexecd.grappolo
  • Press ENTER at Starting execution daemon. Please wait … starting sge_execd
  • Press ENTER at Do you want to add a default queue instance for this host (y/n) [y]
  • Press ENTER at root@node1 modified “@allhosts” in host group list root@node1 modified “all.q” in cluster queue list
Bash as default terminal

In master the queuing list all.q we modified the csh shell into bash shell as default terminal. Now we double check this was successfully done using qconf -sq all.q . If this is not the case, we can modify the shell type again using qconf -mq all.q

for nbr in $(sfor nbr in $(seq 1 32) ; do qsub -v nbr=$nbr testsge.sh ; doneeq 1 32) ; do qsub -v nbr=$nbr testsge.sh ; done

Trouble shooting

  cat  /usr/local/apps/sge/2011.11p1/default/spool/qmaster/messages 
  qstat -explain c -j _Job-ID_

http://gridscheduler.sourceforge.net/htmlman/
common tasks with qconf
http://gridscheduler.sourceforge.net/howto/commontasks.html
solve qw faults
https://arc.liv.ac.uk/pipermail/gridengine-users/2006-May/010095.html

Adding new node
http://verahill.blogspot.co.uk/2013/08/501-briefly-adding-new-node-to-sge.html

to check hosts:

  qhost
  admin@grappolo:~ $ qhost
  HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
  -------------------------------------------------------------------------------
  global                  -               -     -       -       -       -       -
  node1                   linux-arm       4     -  973.5M       -  100.0M       -
  node2                   linux-arm       4     -  973.5M       -  100.0M       -
wiki/grappolo_ii.txt · Last modified: 2017/12/05 22:53 (external edit)