NAME

rdbtranspose - transpose an rdb file


SYNOPSIS

rdbtranspose --icol=column name --ocol=column name [options] [<input filename> [<output filename]]


OPTIONS

Options which take values are separated from their values by either white space or the = character.

--input file
the RDB file to process. If not specified, it is read from the standard input stream.

--output file
the output file to which the RDB file will be written. If not specified, it is written to the standard output stream.

--icol string
the name of the column in the input RDB file containing the new column names. It must be specified.

--mem
Force the entire RDB table into memory (rather than rereading the RDB file).

--ocol string
the name of the column in the output RDB file which will contain the old column names. It must be specified.

--blockid
output a block identification column. each block is given a number, or may be specified by a value in a column (see --blockcol). The name of the output column is given by the --blockname parameter.

--blockname string
the name of the output column containing the block ids.

--blockcol string
The column in the input file which contains a block identification. The value for the row which is last in the block is used as the identification.

--zap
don't output rows containing block identifications if --blockcol> was specified.

--help
Print this message and exit.

--version
Print the version and exit.


DESCRIPTION

rdbtranspose takes an RDB table and performs the equivalent of a matrix transpose, turning columns into rows. The values in one of the columns in the input table are used to create column names in the output table; if these are not unique, grouping is done.

By default it reads an RDB table from the standard input stream and writes the transposed table to the standard output stream. An alternative input file may be specified either as an argument on the command line, or via the --input parameter. An alternative output file may be specified with the --output parameter, or as the last argument on the command line (but only if the input file was also specified as an argument on the command line).

  rdbtranspose --icol=a --ocol=b < input.rdb > output.rdb
  rdbtranspose --icol=a --ocol=b input.rdb > output.rdb
  rdbtranspose --icol=a --ocol=b input.rdb output.rdb
  rdbtranspose --icol=a --ocol=b --input=input.rdb --output=output.rdb
  rdbtranspose --icol=a --ocol=b --output=output.rdb input.rdb

rdbtranspose must scan the input table twice, first to determine the output column names, then to transpose the data. If the input data comes in from a pipe (i.e. the standard input stream), it must store all of the data in memory before writing the output file.

If the input comes from a file, it will read the file from disk twice. It will store only the data associated with a block (see below) in memory. To force it to store all of the data use the --mem option. Normally this isn't needed.

Simple RDB tables which do not have duplicate output column names are transposed as expected:

  name  height  weight
  S     N       N
  frank 33      22
  bell  44      33
  rdbtranspose --ocol=what --icol=name
  what    frank   bell
  S       S       S
  height  33      44
  weight  22      33

However, if there are duplicates in the output column names rdbtranspose breaks the input rows into blocks, defined as consecutive blocks of rows which do not have repeat entries in the output column name column. For instance, if the position column in the following table contains the output column names,

  position  salary  bennies     dept
  S         N       S           S
  stocker   33      none        grocery
  sweeper   22      vacation    grocery
  stocker   30      health      auto
  sweeper   19      shoe shines auto

There are two blocks, each containing a stocker and sweeper. rdbtranspose will transpose each block independently:

  rdbtranspose --icol=position --ocol=what
  what    stocker sweeper
  S       S       S
  salary  33      22
  bennies none    vacation
  dept    grocery grocery
  salary  30      19
  bennies health  shoe shines
  dept    auto    auto

Blocks may be identified either by an automatically assigned integer id, or via the contents of a column in the RDB table (the dept column in this example):

  rdbtranspose --icol=position --ocol=what --blockid
  block   what    stocker sweeper
  S       S       S       S
  1       salary  33      22
  1       bennies none    vacation
  1       dept    grocery grocery
  2       salary  30      19
  2       bennies health  shoe shines
  2       dept    auto    auto
  rdbtranspose --icol=position --ocol=what --blockcol=dept
  block   what    stocker sweeper
  S       S       S       S
  grocery salary  33      22
  grocery bennies none    vacation
  grocery dept    grocery grocery
  auto    salary  30      19
  auto    bennies health  shoe shines
  auto    dept    auto    auto

The name of the block identification output column may be changed via the --blockname parameter.

In the last example, the rows where column what is equal to dept are not useful; the -zap option removes them:

  rdbtranspose --icol=position --ocol=what --blockcol=dept --zap
  block   what    stocker sweeper
  S       S       S       S
  grocery salary  33      22
  grocery bennies none    vacation
  auto    salary  30      19
  auto    bennies health  shoe shines


AUTHOR

Diab Jerius ( djerius@cfa.harvard.edu )