rdbtranspose - transpose an rdb file
rdbtranspose --icol=column name --ocol=column name [options] [<input filename> [<output filename]]
Options which take values are separated from their values by either
white space or the = character.
rdbtranspose takes an RDB table and performs the equivalent of a matrix transpose, turning columns into rows. The values in one of the columns in the input table are used to create column names in the output table; if these are not unique, grouping is done.
By default it reads an RDB table from the standard input stream and writes the transposed table to the standard output stream. An alternative input file may be specified either as an argument on the command line, or via the --input parameter. An alternative output file may be specified with the --output parameter, or as the last argument on the command line (but only if the input file was also specified as an argument on the command line).
rdbtranspose --icol=a --ocol=b < input.rdb > output.rdb rdbtranspose --icol=a --ocol=b input.rdb > output.rdb rdbtranspose --icol=a --ocol=b input.rdb output.rdb rdbtranspose --icol=a --ocol=b --input=input.rdb --output=output.rdb rdbtranspose --icol=a --ocol=b --output=output.rdb input.rdb
rdbtranspose must scan the input table twice, first to determine the output column names, then to transpose the data. If the input data comes in from a pipe (i.e. the standard input stream), it must store all of the data in memory before writing the output file.
If the input comes from a file, it will read the file from disk twice. It will store only the data associated with a block (see below) in memory. To force it to store all of the data use the --mem option. Normally this isn't needed.
Simple RDB tables which do not have duplicate output column names are transposed as expected:
name height weight S N N frank 33 22 bell 44 33
rdbtranspose --ocol=what --icol=name
what frank bell S S S height 33 44 weight 22 33
However, if there are duplicates in the output column
names rdbtranspose breaks the input rows into blocks, defined
as consecutive blocks of rows which do not have repeat entries in
the output column name column. For instance, if the position
column in the following table contains the output column names,
position salary bennies dept S N S S stocker 33 none grocery sweeper 22 vacation grocery stocker 30 health auto sweeper 19 shoe shines auto
There are two blocks, each containing a stocker and sweeper.
rdbtranspose will transpose each block independently:
rdbtranspose --icol=position --ocol=what
what stocker sweeper S S S salary 33 22 bennies none vacation dept grocery grocery salary 30 19 bennies health shoe shines dept auto auto
Blocks may be identified either by an automatically assigned
integer id, or via the contents of a column in the RDB table
(the dept column in this example):
rdbtranspose --icol=position --ocol=what --blockid
block what stocker sweeper S S S S 1 salary 33 22 1 bennies none vacation 1 dept grocery grocery 2 salary 30 19 2 bennies health shoe shines 2 dept auto auto
rdbtranspose --icol=position --ocol=what --blockcol=dept
block what stocker sweeper S S S S grocery salary 33 22 grocery bennies none vacation grocery dept grocery grocery auto salary 30 19 auto bennies health shoe shines auto dept auto auto
The name of the block identification output column may be changed via the --blockname parameter.
In the last example, the rows where column what is equal to dept
are not useful; the -zap option removes them:
rdbtranspose --icol=position --ocol=what --blockcol=dept --zap
block what stocker sweeper S S S S grocery salary 33 22 grocery bennies none vacation auto salary 30 19 auto bennies health shoe shines
Diab Jerius ( djerius@cfa.harvard.edu )