NAME

rdbhist - create a histogram table from an RDB table.


SYNOPSIS

rdbhist [--param=value ...]


OPTIONS

rdbhist uses the Perl Getopt::Long package which adheres to the POSIX syntax for command line options, with GNU extensions. Options may take optional arguments, which are symbolically shown below enclosed in square brackets. Optional arguments may be appended to the option with an equals sign, or may be separated from it by white space.

Input and Output Parameters

--input=filename
The input RDB file. If the given name is the string stdin, it reads from UNIX standard input stream. Defaults to stdin.

--table[=filename] / --notable
The output histogram, written as an RDB file. The filename may be omitted, or may be on of the strings stdout or stderr, in which case it writes to the UNIX standard output or error stream.

The default behavior is to write the histogram to stdout. To turn off table generation, use --notable.

--vtable[=filename]
This option creates an RDB table containing a single line per input item, indicating the bin into which the item was sorted. All of the columns from the input table are written out as well. The table is not normally produced. If filename is omitted, or is one of the strings stdout or stderr, the table is written to the UNIX standard output or error stream. A vtable is written only if rdbhist bins data.

--update=N
This option indicates that the plot (--plot) or the table (--table) should be refreshed after every N items have been binned.

Binning Parameters

--bin
This is a flag indicating that rdbhist should bin the data specified by the input stream. As rdbhist has the ability to plot previously determined histograms, sometimes binning is not desired. To stop binning, use the --nobin option.

--xcol=column name
The name of the column in the input RDB table containing the items to be binned

--wtcol=column name
If specified, this indicates that the items to be binned have non-unity weight. The argument is the name of the column which contains the values used to weight the count of values in each bin.

--width=float
The bin width, specified in units of the xcol data. Currently rdbhist will only auto-generate equal width bins. The default bin width is 1.

--edge=float
The value of the left edge of one of the bins. It defaults to 0.

For complete control over bin placement and width, see --binfile and --fixed.

Pre-Binned Data Parameters

--binfile=filename
The name of a file containing either pre-binned data, or bin edges (see the --fixed option). See Pre-Binned Data.

The binfile must always contain two columns specifying the bins' minimum and maximum edges. The column names default to min and max, but may be changed with the --binmin and --binmax options.

Depending upon how it will be used, it may also contain columns containing the number of items counted into each bin, the sum of the items in each bin, and a bin index. The column names default to n, sum, and idx, respectively. The existence of the columns (and, optionally different names) is flagged by the --binn, --binsum and --binidx options.

If filename is stdin, the data are read from the standard input stream.

--binmax=column name
The name of the binfile column containing the maximum bin edges. The default is max.

--binmin=column name
The name of the binfile column containing the minimum bin edges. The default is min.

--binidx[=column name]
This indicates the name of the index column which written out via --table and --vtable. It defaults to idx.

Additionally, if --binfile has been specified, this option indicates that the binfile contains an index column. If the optional column name is not provided, it defaults to idx.

--binsum[=column name]
This indicates the name of the column containing sums of bins which is written out via --table and --vtable. It defaults to sum.

Additionally, if --binfile is specified, this flag indicates that the binfile contains a column containing pre-summed bin values, which should be used to preload the histogram. If the optional column name is not specified, it defaults to sum.

--binn[=column name]
This indicates the name of the column containing the number of items counted into each bin which is written out via --table and --vtable. It defaults to n. This column is written out if either of --wtcol or the combination of --binfile and --binsum is specified.

Additionally, this flag indicates that the binfile contains a column containing pre-summed bin counts, which should be used to preload the histogram. If the optional column name is not specified, it defaults to n.

--fixed
This flag indicates that rdbhist should only use the bins specified in binfile.

Plotting

--plot[=device]
If a plot of the histogram is desired, this option should be specified. The optional argument is the name of the PGPLOT device. If it is not specified, it defaults to /xserve.

--bar/-nobar
If --bar is specified (the default behavior), vertical bars are drawn between bins in the histogram. use --nobar to prevent this.

--norm
If specified, the plot of the histogram will be normalized by the sum of the bins.

--log
If specified, the logarithm of the counts will be plotted.

--xlabel=string
The string with which the horizontal axis will be annotated. It defaults to the name of the column containing the data items to be binned.

--ylabel=string
The string with which the vertical axis will be annotated. It defaults to the value of first of the following which is specified: --wtcol, --binsum, or the string 'N'.

--title=string
The title of the plot. If --wtcol or --binsum is specified it defaults to Weight/Bin vs. Bin, otherwise it defaults to Counts/Bin vs. Bin.

--verbose
Be a little more verbose about operations. Messages are written to the UNIX standard error stream.

--help
A boolean switch. If present, print this help information and exit.

--version
A boolean switch. If present, write version information to UNIX standard output stream and exit.


DESCRIPTION

rdbhist generates a histogram using data read from the input RDB table. It can operate on unbinned data, previously binned data, or weighted data. It can automatically generate bins of equal width, or read in edge values for bins. The resultant histogram may be produced as an RDB summary table, a plot (via PGPLOT), and as a verbose RDB table indicating for each input item the bin in which it is accumulated.

Input

The input RDB table is specified via the --input parameter. rdbhist defaults to reading from the UNIX standard input stream. The input file must have at least one column containing the values to be binned. The name of the column, which defaults to x, may be specified with the --xcol option. If the values are not to have unity weights the input table must have a second column containing those weights. The --wtcol option should be specified, and may optionally be used to specify a name for that column; otherwise it defaults to wt.

To merge the newly created histogram with a previous one, use the --binfile option to specify the name of the binned data.

If a previously constructed histogram is to be read in and plotted, and no new data are to be added to the histogram, use the --nobin option to indicate this.

Output

rdbhist can output the resulting histogram in any of three formats.

Binning Parameters

rdbhist can either generate bins of a given width or it can read in predefined bins.

The default behavior is to generate bins of the width specified by the --width option (which defaults to 1), aligned such that one of the bins minimum limit falls at the value specified by the --edge option (which defaults to 0).

Use the --fixed and --binfile options to have rdbhist use pre-defined bins. The binfile must have two columns which specify the bin minimum and maximum limits. These are by default named min and max, but may be changed with the --binmin and --binmax options.

Bins' maximum limits are not included in the bin.

Pre-binned Data

rdbhist is capable of loading a previously binned distribution and either adding to it from an input file, or plotting it (for the latter, see Plotting Pre-Binned Data).

When augmenting an existing histogram, the histogram is read from the file specified by the --binfile option. The presence of that data is indicated by specifying the --binsum and/or --binn options. The specified data will be preloaded into the bins which are defined in the binfile. If automatically generated bins are to be used when adding to the histogram, it is recommended that the same values of width and edge be used for the new data as was used to generate the old data. Additionally, binfile should also have a column containing the indices of the bins. Bin limits are derived from indices using the following formulae:

        bin_min = index * width + edge
        bin_max = ( index + 1 ) * width + edge

It is very important that the indices match those calculated by rdbhist, else it will get very confused. The easiest way to do this is use rdbhist to generate the initial histogram, and use its output table (--table) as the binfile. It is also possible to rebin the histogram using the automatically determined bins, insure that the output bins are the same as the input ones, and use the output histogram as the binfile.

Plotting Pre-Binned Data

If all you want to do is plot up a histogram of already binned data, tell rdbhist not to bin (--nobin), to plot (--plot), the name of the file (--binfile) (else it reads from stdin), the column names for bin edges and the summed values (--binmax, --binmin, and --binsum).

Automatic updating of output products

The output summary table (--table) and plot (--plot) may be periodically updated by specifying the --update parameter, which specifies a refresh period in terms of the number of values being binned. The verbose output table (--vtable) is not affected by this option.


AUTHOR

M. Tibbetts (mtibbetts@cfa.harvard.edu)

D. Jerius (djerius@cfa.harvard.edu)