NAME

MasterSlave - A distributive batch processing using a master/slave paradigm written in Java using RMI.


DESCRIPTION

MasterSlave is a distributive batch processing using a master/slave paradigm implemented using the Remote Method Invocation (RMI) capabilities of Java. RMI was chosen because the server and the clients are Java applications; Moreover, this suite of programs does not require a complex Naming Service so the Naming Service by RMI is sufficient. The RMI security was used for the server and all the clients. The architecture and runtime dynamics of the system can be summarized as follows:

Upon start up, the server reads an initialization file which contains an integer, called a batch-id, which will be assigned to any batch of jobs submitted by a user. The server will also increment the said integer by one and update it to the initialization file after it assigns the batch-id to the batch of jobs submitted by the user. The initialization file is a poor man's approach to persistence. The batch-id is necessary because the batch of jobs submitted by the user has unique job-ids assigned to every job to be farmed out; However, since there is no guarranteed that different users will have different job-ids, then the server must generate the unique batch-id to differentiate the jobs submitted by different users. The batch of jobs submitted by a user are stored in the Vector jobQueue, a Vector is chosen for the job queue because it is synchronized and all methods which access the Vector jobQueue are synchronized.

The client must register itself with the server before the client can request any work. This is necessary since some jobs are more resources intensive then others and it is the task of the server to match the remaining available jobs to the workstations where the clients are running. The user can run the job on the first available workstation by setting its job request to `any'. The information about the client machine name and machine type is kept in the Hashtable slaveCrew.

When an idled client requests work, the server extracts an appropiate job from the jobQueue and transfers it to the Hashtable jobRunning. The server is responsible for matching the remaining available job to be run with the resources of the client. This can be achieved since the server has a record of the type of machine the client is running on and the job as submitted by the user must have a specification as to which type of machines the job is to be run on. The client is responsible for removing the job from the Hashtable jobRunning when done.

The Hashtable slaveCrew can be accessed by an external user to kill or suspend a slave, see To dynamically remove slave. The user must specify which slave to be removed or suspend. All slaves have a numerical suffix assigned to it since there may be more then one instances of a slave because some workstations have Symmetric Multi-Processors (SMP). The user has the option of killing or suspend a selected number of slaves by entering the correct suffix or by entering the wild card `*'. A kill signal of -1 will kill the slaves and a kill signal of an integer X greater then zero will suspend the slave for X milliseconds.

The client must unregister itself with the server when there aren't any more jobs in the Vector jobQueue left for the client to run.

The file Master.java contains the class Master which extends the java.rmi.Remote interface to define the exported methods the server implements and the clients can invoke remotely. The class MasterImpl, defined in the file MasterImpl.java, implements the Remote interface. The class MasterServer creates an instance of the remote object. The kornshell script, startMaster, starts the rmiregistry, remote log in to a specified machine on the network to start the server (this is done by rsh to the host machine to start the class MasterServer).

Before running any of the clients, the user must prepare three files, two RDB files and a property file for each of the clients. An RDB file is a table for a relational database system which uses a flat file ASCII format to store data, the columns of the table are tab delimited. The property file is of the form key=value. The three files must contain the following information:

The classes SubmitJob, KillClient, and SlaveClient inherit from the abstract base MasterSlaveClient are the thre main classes for the clients.

The class SubmitJob is used to submit a list of jobs, Sample RDB command file to the server. The class verifies that the job-id created by the user is unique. The kornshell script startSlave, with option (-j, -m and -s), should be used to run the SubmitJob class. See To submit a job only

The class killClient is used to kill or suspend a server. This class enables the user to dynamically remove clients from the pool of available clients to run jobs. The script killSlave can be used to kill or suspend a slave. See To dynamically remove slave or To dynamically suspend a slave for X milliseconds

On the client side, the bulk of the work is done by the class, SlaveClient. Upon start up the class loads the property file which contains the value for the keyword `machineClass' to identify what type of machine the client is running on. The value of the keyword machineClass can be a list of machineClass delimited by a comma. The property file also contains the black out period for the client, this is the time that the client is forbidden from doing any work which is the usual work hours. See Sample client property file for an example of a property file. The client register itself with the server, at this point the server should know the name and the type of machine the client is running on. This will enable the server to properly assign a job to the client whenever the client is iddled and needs work. Before requesting any work from the server, the client gets the current time and compares it with the black out period. If the current time falls within the black out period then the client will calculate how much time it must sleep until the black out period is over. If the current time is outside the black out period then the client can request work from the server. However, before requesting work from the server, the client must check the Hashtable slaveCrew in the server to make sure that a user has not requested it to be terminated. This is how a client can be dynamically terminated. In the event that the current time is not within the black out period and no one has requested the client to be terminated then the client can request work from the server. When the client is done with the current job, the client must remove the job from the Hashtable jobRunning within the server. The client continues to request work untill there is no more work to be done. When invoking a program from java, the client traps stdout and stderr and print out the outputs if they are non-zero. The client also keeps a log of what jobs it has run. The kornshell script, with option -c -m and -s, should be used to run the SlaveClient. See To dynamically add slaves.

The classes BatchId, BlackOutPeriod, RdbTableMasterSlaveJob, SlaveWorkStatus, WorkSchedule were written to support the various parts of the program. The class SampleProg is a simple simultated program which can be executed by the user. The class TimeVector was written to test how best to access a Vector.

Javadoc of the source code

The documentation as generated from javadoc can be found at:

http://hea-www.harvard.edu/MST/simul/software/docs/html/MasterSlave/index.html

To do list

1) A gui to a) submit job b) start/add slave c) suspend/kill slave d) suspend/kill job based on batch-id. See the directory ./gui

2) The program should recognize the national holidays and override the current property file and let job run. See the directory ./misc

3) Server Activation.


USAGE

The scripts startMaster and startSlave are two kornshell scripts to start a server and the clients to run a distributed batch processing. The script startSlave requires the user to submit two RDB files.

The user can run the startMaster from any workstation; However, the server which will be used as the Master should be one of the servers which is not one of the list of machines to have periodic operating system upgrade.

RDB is a relational database system which uses a flat file ASCII format to store data. See http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html for a more detailed description of an RDB table. There are only a few simple rules for creating an RDB table:

  1. All comments at the beginning of the file begins with a '#' as a first character.

  2. The first non-comment line of the table contains the name of a column.

  3. A tab is inserted between each column.

  4. The second line of the table contains either an N or S to denote if the column is a numeric or string, respectively.

The users must first prepare two files: i) jobfname.rdb, see Sample RDB command file. ii) clientfname.rdb, see Sample RDB client file The file jobfname.rdb contains the list of the jobs submitted to the server to farm out the work to the slaves. The file clientfname.rdb contains the names of the machines which will server as the clients and their associated properties.

Executing the script startSlave with the -j cmdfile.rdbb option will generate a unique batch id for all the jobs listed within the file cmdfile.rdb. After the script startSlave with the -c clientfname.rdb option is finished, there should be a series of files of the form, if any: b_batchid_jobid.stdout b_batchid_jobid.stderr, where batchid is the said batch id and jobid is the job id as supplied by the user. The suffix stdout and stderr indicate the material that were captured from stdout and stderr respectively. A summary of the work done by the clients are written out in the files clienthostname.numSMP.rdb, where clienthostname is the name of the workstation and numSMP is the extension for the number of processors of the workstation.

Sample RDB command file

  dumbo-252: cat cmdfile.rdb
  #
  # This file is an RDB type. RDB is a relational database system which
  # uses a flat file ASCII format (delimited by tabs) to store data.
  # For more information about an rdb file, see the following url:
  #
  #    http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html
  #
  # The command file contains three columns: id, machineClass and cmd:
  #
  # 1) The column `id' contains a unique identification for jobs to be
  # submitted
  #
  # 2) The column `machineClass' contains the type of machine where
  # the job in the column cmd can run on.
  #
  # 3) The column `cmd' contains the command to be submitted to a slave.
  #
  id    machineClass    cmd
  S     S       S
  id0   any     java MasterSlave.SampleProg
  id1   any     java MasterSlave.SampleProg
  id2   any     whoami
  id3   any     java MasterSlave.SampleProg
  id4   any     which java
  id5   any     java MasterSlave.SampleProg
  id6   any     whoami
  id7   any     java MasterSlave.SampleProg
  id8   any     which startMaster
  id9   any     java MasterSlave.SampleProg
  id10  any     java MasterSlave.SampleProg
  id11  any     who

Note the tabs in the file cmdfile.rdb:

  dumbo-253: cat -tev cmdfile.rdb
  #$
  # This file is an RDB type. RDB is a relational database system which$
  # uses a flat file ASCII format (delimited by tabs) to store data.$
  # For more information about an rdb file, see the following url:$
  #$
  #    http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html$
  #$
  # The command file contains three columns: id, machineClass and cmd:$
  #$
  # 1) The column `id' contains a unique identification for jobs to be$
  # submitted$
  #$
  # 2) The column `machineClass' contains the type of machine where$
  # the job in the column cmd can run on.$
  #$
  # 3) The column `cmd' contains the command to be submitted to a slave.$
  #$
  id^ImachineClass^Icmd$
  S^IS^IS$
  id0^Iany^Ijava MasterSlave.SampleProg$
  id1^Iany^Ijava MasterSlave.SampleProg$
  id2^Iany^Iwhoami$
  id3^Iany^Ijava MasterSlave.SampleProg$
  id4^Iany^Iwhich java$
  id5^Iany^Ijava MasterSlave.SampleProg$
  id6^Iany^Iwhoami$
  id7^Iany^Ijava MasterSlave.SampleProg$
  id8^Iany^Iwhich startMaster$
  id9^Iany^Ijava MasterSlave.SampleProg$
  id10^Iany^Ijava MasterSlave.SampleProg$
  id11^Iany^Iwho$

Sample RDB client file

  dumbo-242: cat clients.rdb
  #
  # This file is an RDB type. RDB is a relational database system which
  # uses a flat file ASCII format (delimited by tabs) to store data.
  # For more information about an rdb file, see the following url:
  #
  #    http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html
  #
  # The command file contains three columns: host, numSMP and properties:
  #
  # 1) The column `host' contains the name of the machine to run
  #     as a client/slave
  #
  # 2) The column numSMP contains the number of cpu's on the client's machine.
  #
  # 3) The column properties contains the initial properties of the client.
  #
  host  numSMP  properties
  S     N       S
  ennui 1       clientprofile/allthetime.properties
  futile        1       clientprofile/allthetime.properties

Note the tabs in the file clients.rdb:

  dumbo-254: cat -tev clients.rdb
  #$
  # This file is an RDB type. RDB is a relational database system which$
  # uses a flat file ASCII format (delimited by tabs) to store data.$
  # For more information about an rdb file, see the following url:$
  #$
  #    http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html$
  #$
  # The command file contains three columns: host, numSMP and properties:$
  #$
  # 1) The column `host' contains the name of the machine to run$
  #     as a client/slave$
  #$
  # 2) The column numSMP contains the number of cpu's on the client machine.$
  #$
  # 3) The column properties contains the initial properties of the client.$
  #$
  host^InumSMP^Iproperties$
  S^IN^IS$
  ennui^I1^Iclientprofile/allthetime.properties$
  futile^I1^Iclientprofile/allthetime.properties$

Sample client property file

  dumbo-255: cat clientprofile/dumbo.properties
  machineClass=dumbo
  Sunday.begin=0:00
  Sunday.end=0:00
  Monday.begin=8:30
  Monday.end=18:00
  Tuesday.begin=8:30
  Tuesday.end=18:00
  Wednesday.begin=8:30
  Wednesday.end=18:00
  Thursday.begin=8:30
  Thursday.end=18:00
  Friday.begin=8:30
  Friday.end=20:30
  Saturday.begin=0:00
  Saturday.end=0:00
  The file can take multiple entries for any given day by adding an
  integer subscript (starting from 1) to the suffixes .begin and .end,
  for example:
  Friday.begin=0:30
  Friday.end=2:30
  Friday.begin1=5:30
  Friday.end1=10:30

startMaster

startMaster [-h] -m masterName -s serverMachineName [-V]

startMaster is a kornshell script to start a master (server) on a workstation accessible on the network. The options are :

-h
Print this help information.

-m masterName
The name of the master (server).

-s serverMachineName
The name of the workstation where the master (server) named masterName is running on. serverMachineName should be one of the servers which is not one of the list of machines to have periodic operating system upgrade.

-V
Verbose option. Programs will print diagnostic messages.

startMaster Example

To start the Master named Zeus on the workstation pandora, type:

dumbo-301: startMaster -m Zeus -s pandora -V

[pandora]:MasterImpl::main( ) : The server named `rmi://pandora/Zeus' is now operational

To start the server, manually

Note, one can start the server manually by typing the following commands:

dumbo-302: rmiregistry

dumbo-303: java -Djava.security.policy=javapolicy.txt MasterSlave.MasterServer -masterName rmi://dumbo/Zeus -V

startSlave

startSlave [ -c clientfname.rdb ] [ -C clients } [ -d stdout/stderrdir ] [ -h ] [ -j jobfname.rdb ] -m masterName [ -n niceLevel ] -s serverMachineName [ -V ]

The script startSlave can be used to submit a job to the master/server to farm out the work to the slaves/clients who are idled and require work.

The user should run the script startSlave from one of the workstations which is not one of the list of machines to have periodic operating system upgrade. The user has the options of:

  1. submit a job to the Server/Master to be farmed out only, see To submit a job only.

  2. Start a list of slaves/clients to run the jobs that was submitted to the master/server only, see To dynamically add slaves

  3. To submit a job and start a list of slaves/clients to run the job which was submitted, see To submit a job and start a few slaves to do the work.

The script startMaster must be started before starting the startSlave script, note the following line

  [pandora]:MasterImpl::main( ) : The server named `rmi://pandora/Zeus' is now operational

must be printed before one can execute the startSlave script.

startSlave takes the following options:

-c clientfilename
clientfilename is an RDB filename containing three columns:
host
The name of the workstation where jobs are to be run.

numSMP
The number of symmetric processors on the workstation where the jobs are to be run.

properties
The location of the property file for the workstation.

-C clientlist
clientlist is an optional, comma delimited list of clients to run jobs on. The -c option must be specified. This is typically used to specify a subset of the clients in the client RDB file.

-d stdout/stderr output dir
The directory where the stdout and stderr shall be written to. The user must create the directory prior to running the sript.

-h
Print this help information.

-j jobfilename
An RDB filename containning three columns: 1) the id, 2) the class of machine the jobs can be run on and 3) the commands to be submitted to the clients (slaves). An example of such a file is given in Sample RDB command file

-m masterName
The name of the master (server).

-n niceLevel
The priority to run the jobs on the client machines. niceLevel must be between 4 and 20.

-s serverMachine
The name of the machine where the master (server) named masterName is running on.

startSlave Example

To submit a job and start a few slaves to do the work

To submit a job to the server to farm out the work to the iddled clients, the file cmdfile.rdb contains the commands to be farmed out. To start a list of clients to run jobs which are listed in the cmdfile.rdb, the file file clientfname.rdb contains the list of workstations which will run the jobs. Type the following command:

  dumbo-336: startSlave -c clients.rdb -j cmdfile.rdb -m Zeus -s dumbo

To submit a job only

If a user wants to add jobs, either initially or in additional jobs, into the work queue, then the user should run the startSlave script without using the -c clientfname.rdb option. To submit a job to be run in a file called cmdfile.rdb, type the following command.

  dumbo-337: startSlave -j cmdfile.rdb -m Zeus -s dumbo

Alternatively, the user can type the following command:

  dumbo-338: java -Djava.security.policy=javapolicy.txt MasterSlave.SubmitJobClient -masterName rmi://dumbo/Zeus -cmdfile cmdfile.rdb

If a job is currently running then by executing the command above then the jobs listed in the cmdfile.rdb will be appended to the job queue.

To dynamically add slaves.

If the user wants to add more clients, eiher initially or to add more clients, then the user should start the startSlave script without using the -j cmdfile.rdb option. Type the following command:

  dumbo-339: startSlave -c clients.rdb -m Zeus -s dumbo

If the is no job running then by executing the command above will start a pool of clients to run the jobs. If a job is currently running then by executing the command above will dynamicall add more clients to the pool of machines to request work from the server.

To dynamically remove slave

The user can remove a slave from the pool of available slaves by running the killSlave kornshell script. In the following example, the slave ennui.0 ( -k ennui.0 ) is to be terminated ( -w -1 ).

  dumbo-340: killSlave -m Zeus -s dumbo -k ennui.0 -w -1

Alternatively, the user can enter the following command :

  dumbo-341: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.0 -status -1

To dynamically suspend a slave for X milliseconds

The user can suspend from a slave for a number of milliseconds, in this example 1000 milliseconds, by entering one of the following two comamnds:

Using the killSlave kornshell script to suspend ennui.0 (-k ennui.0 ) for 1000 milliseconds ( -w 1000 ).

  dumbo-342: killSlave -m Zeus -s dumbo -k ennui.0 -w 1000

Alternatively, the user can enter the following command :

  dumbo-343: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.0 -status 1000

Kill or suspend all SMP on a client

Since a client may be running on a workstation with Symmetric MultiProcessors (SMP) then the suffix after the client's name may range from 0 to N-1 where N is the number of SMP. The user can either run the killSlave script for each suffix of the client's name as shown above or the user can enter a wildcard for the suffix. The previous four commands can be entered as:

  dumbo-340: killSlave -m Zeus -s dumbo -k ennui.* -w -1
  dumbo-341: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.* -status -1
  dumbo-342: killSlave -m Zeus -s dumbo -k ennui.* -w 1000
  dumbo-343: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.* -status 1000

Override slave for days, date

A user can override its current schedule by forbidding any jobs to run on his/her machine by using the overrideSlave script. The format for script is:

  overrideSlave [ today ] [ tomorrow ] [ mon ] [ tues ] [ wed ] [ thu ]
  [ fri ] [ sat ] [ sun ] [ mm/dd - mm/dd ] [ -h ] [ -H hostname ]

A user can prevent his/her machine from being used today and tomorrow by typing:

  dumbo-344: overrideSlave today tomorrow

A user can prevent his/her machine from being used from, say, January 3 through January 6 by typing:

  dumbo-345: overrideSlave 1/3 - 1/6

printJobStatus

printJobStatus [ -h ] -m masterName -s serverMachineName [ -V ]

The script printJobStatus can be used to see the current job queue. Note that the list may contain jobs that were aborted due to either un/intentional interruptions.

The output of the file shall be written to the same directory where the script startMaster was started from. The filename shall be called 'currentJobStatus.rdb'

The script is useful to see which jobs may have exited 'abruptly', in other words jobs that crashed.