MasterSlave - A distributive batch processing using a master/slave paradigm written in Java using RMI.
MasterSlave is a distributive batch processing using a master/slave paradigm implemented using the Remote Method Invocation (RMI) capabilities of Java. RMI was chosen because the server and the clients are Java applications; Moreover, this suite of programs does not require a complex Naming Service so the Naming Service by RMI is sufficient. The RMI security was used for the server and all the clients. The architecture and runtime dynamics of the system can be summarized as follows:
Upon start up, the server reads an initialization file which contains an integer, called a batch-id, which will be assigned to any batch of jobs submitted by a user. The server will also increment the said integer by one and update it to the initialization file after it assigns the batch-id to the batch of jobs submitted by the user. The initialization file is a poor man's approach to persistence. The batch-id is necessary because the batch of jobs submitted by the user has unique job-ids assigned to every job to be farmed out; However, since there is no guarranteed that different users will have different job-ids, then the server must generate the unique batch-id to differentiate the jobs submitted by different users. The batch of jobs submitted by a user are stored in the Vector jobQueue, a Vector is chosen for the job queue because it is synchronized and all methods which access the Vector jobQueue are synchronized.
The client must register itself with the server before the client can request any work. This is necessary since some jobs are more resources intensive then others and it is the task of the server to match the remaining available jobs to the workstations where the clients are running. The user can run the job on the first available workstation by setting its job request to `any'. The information about the client machine name and machine type is kept in the Hashtable slaveCrew.
When an idled client requests work, the server extracts an appropiate job from the jobQueue and transfers it to the Hashtable jobRunning. The server is responsible for matching the remaining available job to be run with the resources of the client. This can be achieved since the server has a record of the type of machine the client is running on and the job as submitted by the user must have a specification as to which type of machines the job is to be run on. The client is responsible for removing the job from the Hashtable jobRunning when done.
The Hashtable slaveCrew can be accessed by an external user to kill or suspend a slave, see To dynamically remove slave. The user must specify which slave to be removed or suspend. All slaves have a numerical suffix assigned to it since there may be more then one instances of a slave because some workstations have Symmetric Multi-Processors (SMP). The user has the option of killing or suspend a selected number of slaves by entering the correct suffix or by entering the wild card `*'. A kill signal of -1 will kill the slaves and a kill signal of an integer X greater then zero will suspend the slave for X milliseconds.
The client must unregister itself with the server when there aren't any more jobs in the Vector jobQueue left for the client to run.
The file Master.java contains the class Master which extends the java.rmi.Remote interface to define the exported methods the server implements and the clients can invoke remotely. The class MasterImpl, defined in the file MasterImpl.java, implements the Remote interface. The class MasterServer creates an instance of the remote object. The kornshell script, startMaster, starts the rmiregistry, remote log in to a specified machine on the network to start the server (this is done by rsh to the host machine to start the class MasterServer).
Before running any of the clients, the user must prepare three files, two RDB files and a property file for each of the clients. An RDB file is a table for a relational database system which uses a flat file ASCII format to store data, the columns of the table are tab delimited. The property file is of the form key=value. The three files must contain the following information:
See Sample RDB command file for an example.
The classes SubmitJob, KillClient, and SlaveClient inherit from the abstract base MasterSlaveClient are the thre main classes for the clients.
The class SubmitJob is used to submit a list of jobs, Sample RDB command file to the server. The class verifies that the job-id created by the user is unique. The kornshell script startSlave, with option (-j, -m and -s), should be used to run the SubmitJob class. See To submit a job only
The class killClient is used to kill or suspend a server. This class enables the user to dynamically remove clients from the pool of available clients to run jobs. The script killSlave can be used to kill or suspend a slave. See To dynamically remove slave or To dynamically suspend a slave for X milliseconds
On the client side, the bulk of the work is done by the class, SlaveClient. Upon start up the class loads the property file which contains the value for the keyword `machineClass' to identify what type of machine the client is running on. The value of the keyword machineClass can be a list of machineClass delimited by a comma. The property file also contains the black out period for the client, this is the time that the client is forbidden from doing any work which is the usual work hours. See Sample client property file for an example of a property file. The client register itself with the server, at this point the server should know the name and the type of machine the client is running on. This will enable the server to properly assign a job to the client whenever the client is iddled and needs work. Before requesting any work from the server, the client gets the current time and compares it with the black out period. If the current time falls within the black out period then the client will calculate how much time it must sleep until the black out period is over. If the current time is outside the black out period then the client can request work from the server. However, before requesting work from the server, the client must check the Hashtable slaveCrew in the server to make sure that a user has not requested it to be terminated. This is how a client can be dynamically terminated. In the event that the current time is not within the black out period and no one has requested the client to be terminated then the client can request work from the server. When the client is done with the current job, the client must remove the job from the Hashtable jobRunning within the server. The client continues to request work untill there is no more work to be done. When invoking a program from java, the client traps stdout and stderr and print out the outputs if they are non-zero. The client also keeps a log of what jobs it has run. The kornshell script, with option -c -m and -s, should be used to run the SlaveClient. See To dynamically add slaves.
The classes BatchId, BlackOutPeriod, RdbTableMasterSlaveJob, SlaveWorkStatus, WorkSchedule were written to support the various parts of the program. The class SampleProg is a simple simultated program which can be executed by the user. The class TimeVector was written to test how best to access a Vector.
The documentation as generated from javadoc can be found at:
http://hea-www.harvard.edu/MST/simul/software/docs/html/MasterSlave/index.html
1) A gui to a) submit job b) start/add slave c) suspend/kill slave d) suspend/kill job based on batch-id. See the directory ./gui
2) The program should recognize the national holidays and override the current property file and let job run. See the directory ./misc
3) Server Activation.
The scripts startMaster and startSlave are two kornshell scripts to start a server and the clients to run a distributed batch processing. The script startSlave requires the user to submit two RDB files.
The user can run the startMaster from any workstation; However, the server which will be used as the Master should be one of the servers which is not one of the list of machines to have periodic operating system upgrade.
RDB is a relational database system which uses a flat file ASCII format to store data. See http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html for a more detailed description of an RDB table. There are only a few simple rules for creating an RDB table:
The users must first prepare two files: i) jobfname.rdb, see Sample RDB command file. ii) clientfname.rdb, see Sample RDB client file The file jobfname.rdb contains the list of the jobs submitted to the server to farm out the work to the slaves. The file clientfname.rdb contains the names of the machines which will server as the clients and their associated properties.
Executing the script startSlave with the -j cmdfile.rdbb option will
generate a unique batch id for all the jobs listed within the file
cmdfile.rdb. After the script startSlave with the -c clientfname.rdb
option is finished, there should be a series of files of the form, if
any: b_batchid_jobid.stdout b_batchid_jobid.stderr, where batchid is
the said batch id and jobid is the job id as supplied by the user.
The suffix stdout and stderr indicate the material that were captured
from stdout and stderr respectively. A summary of the work done by the
clients are written out in the files clienthostname.numSMP.rdb, where
clienthostname is the name of the workstation and numSMP is the
extension for the number of processors of the workstation.
dumbo-252: cat cmdfile.rdb
# # This file is an RDB type. RDB is a relational database system which # uses a flat file ASCII format (delimited by tabs) to store data. # For more information about an rdb file, see the following url: # # http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html # # The command file contains three columns: id, machineClass and cmd: # # 1) The column `id' contains a unique identification for jobs to be # submitted # # 2) The column `machineClass' contains the type of machine where # the job in the column cmd can run on. # # 3) The column `cmd' contains the command to be submitted to a slave. # id machineClass cmd S S S id0 any java MasterSlave.SampleProg id1 any java MasterSlave.SampleProg id2 any whoami id3 any java MasterSlave.SampleProg id4 any which java id5 any java MasterSlave.SampleProg id6 any whoami id7 any java MasterSlave.SampleProg id8 any which startMaster id9 any java MasterSlave.SampleProg id10 any java MasterSlave.SampleProg id11 any who
Note the tabs in the file cmdfile.rdb:
dumbo-253: cat -tev cmdfile.rdb
#$ # This file is an RDB type. RDB is a relational database system which$ # uses a flat file ASCII format (delimited by tabs) to store data.$ # For more information about an rdb file, see the following url:$ #$ # http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html$ #$ # The command file contains three columns: id, machineClass and cmd:$ #$ # 1) The column `id' contains a unique identification for jobs to be$ # submitted$ #$ # 2) The column `machineClass' contains the type of machine where$ # the job in the column cmd can run on.$ #$ # 3) The column `cmd' contains the command to be submitted to a slave.$ #$ id^ImachineClass^Icmd$ S^IS^IS$ id0^Iany^Ijava MasterSlave.SampleProg$ id1^Iany^Ijava MasterSlave.SampleProg$ id2^Iany^Iwhoami$ id3^Iany^Ijava MasterSlave.SampleProg$ id4^Iany^Iwhich java$ id5^Iany^Ijava MasterSlave.SampleProg$ id6^Iany^Iwhoami$ id7^Iany^Ijava MasterSlave.SampleProg$ id8^Iany^Iwhich startMaster$ id9^Iany^Ijava MasterSlave.SampleProg$ id10^Iany^Ijava MasterSlave.SampleProg$ id11^Iany^Iwho$
dumbo-242: cat clients.rdb
# # This file is an RDB type. RDB is a relational database system which # uses a flat file ASCII format (delimited by tabs) to store data. # For more information about an rdb file, see the following url: # # http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html # # The command file contains three columns: host, numSMP and properties: # # 1) The column `host' contains the name of the machine to run # as a client/slave # # 2) The column numSMP contains the number of cpu's on the client's machine. # # 3) The column properties contains the initial properties of the client. # host numSMP properties S N S ennui 1 clientprofile/allthetime.properties futile 1 clientprofile/allthetime.properties
Note the tabs in the file clients.rdb:
dumbo-254: cat -tev clients.rdb
#$ # This file is an RDB type. RDB is a relational database system which$ # uses a flat file ASCII format (delimited by tabs) to store data.$ # For more information about an rdb file, see the following url:$ #$ # http://hea-www.harvard.edu/MST/simul/software/docs/rdb.html$ #$ # The command file contains three columns: host, numSMP and properties:$ #$ # 1) The column `host' contains the name of the machine to run$ # as a client/slave$ #$ # 2) The column numSMP contains the number of cpu's on the client machine.$ #$ # 3) The column properties contains the initial properties of the client.$ #$ host^InumSMP^Iproperties$ S^IN^IS$ ennui^I1^Iclientprofile/allthetime.properties$ futile^I1^Iclientprofile/allthetime.properties$
dumbo-255: cat clientprofile/dumbo.properties
machineClass=dumbo Sunday.begin=0:00 Sunday.end=0:00 Monday.begin=8:30 Monday.end=18:00 Tuesday.begin=8:30 Tuesday.end=18:00 Wednesday.begin=8:30 Wednesday.end=18:00 Thursday.begin=8:30 Thursday.end=18:00 Friday.begin=8:30 Friday.end=20:30 Saturday.begin=0:00 Saturday.end=0:00
The file can take multiple entries for any given day by adding an integer subscript (starting from 1) to the suffixes .begin and .end, for example:
Friday.begin=0:30 Friday.end=2:30 Friday.begin1=5:30 Friday.end1=10:30
startMaster [-h] -m masterName -s serverMachineName [-V]
startMaster is a kornshell script to start a master (server) on a workstation accessible on the network. The options are :
To start the Master named Zeus on the workstation pandora, type:
dumbo-301: startMaster -m Zeus -s pandora -V
[pandora]:MasterImpl::main( ) : The server named `rmi://pandora/Zeus' is now operational
Note, one can start the server manually by typing the following commands:
dumbo-302: rmiregistry
dumbo-303: java -Djava.security.policy=javapolicy.txt MasterSlave.MasterServer -masterName rmi://dumbo/Zeus -V
startSlave [ -c clientfname.rdb ] [ -C clients } [ -d stdout/stderrdir ] [ -h ] [ -j jobfname.rdb ] -m masterName [ -n niceLevel ] -s serverMachineName [ -V ]
The script startSlave can be used to submit a job to the master/server to farm out the work to the slaves/clients who are idled and require work.
The user should run the script startSlave from one of the workstations which is not one of the list of machines to have periodic operating system upgrade. The user has the options of:
The script startMaster must be started before starting the startSlave script, note the following line
[pandora]:MasterImpl::main( ) : The server named `rmi://pandora/Zeus' is now operational
must be printed before one can execute the startSlave script.
startSlave takes the following options:
To submit a job to the server to farm out the work to the iddled clients, the file cmdfile.rdb contains the commands to be farmed out. To start a list of clients to run jobs which are listed in the cmdfile.rdb, the file file clientfname.rdb contains the list of workstations which will run the jobs. Type the following command:
dumbo-336: startSlave -c clients.rdb -j cmdfile.rdb -m Zeus -s dumbo
If a user wants to add jobs, either initially or in additional jobs, into the work queue, then the user should run the startSlave script without using the -c clientfname.rdb option. To submit a job to be run in a file called cmdfile.rdb, type the following command.
dumbo-337: startSlave -j cmdfile.rdb -m Zeus -s dumbo
Alternatively, the user can type the following command:
dumbo-338: java -Djava.security.policy=javapolicy.txt MasterSlave.SubmitJobClient -masterName rmi://dumbo/Zeus -cmdfile cmdfile.rdb
If a job is currently running then by executing the command above then the jobs listed in the cmdfile.rdb will be appended to the job queue.
If the user wants to add more clients, eiher initially or to add more clients, then the user should start the startSlave script without using the -j cmdfile.rdb option. Type the following command:
dumbo-339: startSlave -c clients.rdb -m Zeus -s dumbo
If the is no job running then by executing the command above will start a pool of clients to run the jobs. If a job is currently running then by executing the command above will dynamicall add more clients to the pool of machines to request work from the server.
The user can remove a slave from the pool of available slaves by running the killSlave kornshell script. In the following example, the slave ennui.0 ( -k ennui.0 ) is to be terminated ( -w -1 ).
dumbo-340: killSlave -m Zeus -s dumbo -k ennui.0 -w -1
Alternatively, the user can enter the following command :
dumbo-341: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.0 -status -1
The user can suspend from a slave for a number of milliseconds, in this example 1000 milliseconds, by entering one of the following two comamnds:
Using the killSlave kornshell script to suspend ennui.0 (-k ennui.0 ) for 1000 milliseconds ( -w 1000 ).
dumbo-342: killSlave -m Zeus -s dumbo -k ennui.0 -w 1000
Alternatively, the user can enter the following command :
dumbo-343: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.0 -status 1000
Since a client may be running on a workstation with Symmetric MultiProcessors (SMP) then the suffix after the client's name may range from 0 to N-1 where N is the number of SMP. The user can either run the killSlave script for each suffix of the client's name as shown above or the user can enter a wildcard for the suffix. The previous four commands can be entered as:
dumbo-340: killSlave -m Zeus -s dumbo -k ennui.* -w -1
dumbo-341: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.* -status -1
dumbo-342: killSlave -m Zeus -s dumbo -k ennui.* -w 1000
dumbo-343: java -Djava.security.policy=javapolicy.txt MasterSlave.KillClient -masterName rmi://dumbo/Zeus -kill ennui.* -status 1000
A user can override its current schedule by forbidding any jobs to run on his/her machine by using the overrideSlave script. The format for script is:
overrideSlave [ today ] [ tomorrow ] [ mon ] [ tues ] [ wed ] [ thu ] [ fri ] [ sat ] [ sun ] [ mm/dd - mm/dd ] [ -h ] [ -H hostname ]
A user can prevent his/her machine from being used today and tomorrow by typing:
dumbo-344: overrideSlave today tomorrow
A user can prevent his/her machine from being used from, say, January 3 through January 6 by typing:
dumbo-345: overrideSlave 1/3 - 1/6
printJobStatus [ -h ] -m masterName -s serverMachineName [ -V ]
The script printJobStatus can be used to see the current job queue. Note that the list may contain jobs that were aborted due to either un/intentional interruptions.
The output of the file shall be written to the same directory where the script startMaster was started from. The filename shall be called 'currentJobStatus.rdb'
The script is useful to see which jobs may have exited 'abruptly', in other words jobs that crashed.