rdbhist - create a histogram table from an RDB table.
rdbhist [--param=value ...]
rdbhist uses the Perl Getopt::Long package which adheres to the POSIX syntax for command line options, with GNU extensions. Options may take optional arguments, which are symbolically shown below enclosed in square brackets. Optional arguments may be appended to the option with an equals sign, or may be separated from it by white space.
RDB file. If the given name is the string stdin,
it reads from UNIX standard input stream. Defaults to stdin.
RDB file. The filename may be
omitted, or may be on of the strings stdout or stderr, in which
case it writes to the UNIX standard output or error stream.
The default behavior is to write the histogram to stdout. To turn off
table generation, use --notable.
RDB table containing a single line per
input item, indicating the bin into which the item was sorted. All of
the columns from the input table are written out as well. The table is
not normally produced. If filename is omitted, or is one of the
strings stdout or stderr, the table is written to the UNIX
standard output or error stream. A vtable is written only if rdbhist
bins data.
RDB table containing the
items to be binned
0.
For complete control over bin placement and width, see --binfile and --fixed.
The binfile must always contain two columns specifying the bins'
minimum and maximum edges. The column names default to min and
max, but may be changed with the --binmin and --binmax
options.
Depending upon how it will be used, it may also contain columns containing
the number of items counted into each bin, the sum of the items in each bin,
and a bin index. The column names default to n, sum, and idx,
respectively. The existence of the columns (and, optionally different
names) is flagged by the --binn, --binsum and --binidx options.
If filename is stdin, the data are read from the standard input stream.
max.
min.
idx.
Additionally, if --binfile has been specified, this option indicates
that the binfile contains an index column. If the optional column
name is not provided, it defaults to idx.
sum.
Additionally, if --binfile is specified, this flag indicates that
the binfile contains a column containing pre-summed bin values,
which should be used to preload the histogram. If the optional column
name is not specified, it defaults to sum.
n. This column is written out if either of
--wtcol or the combination of --binfile and --binsum is specified.
Additionally, this flag indicates that the binfile contains a
column containing pre-summed bin counts, which should be used to
preload the histogram. If the optional column name is not specified,
it defaults to n.
/xserve.
Weight/Bin vs. Bin, otherwise it defaults to
Counts/Bin vs. Bin.
rdbhist generates a histogram using data read from the input
RDB table. It can operate on unbinned data, previously binned
data, or weighted data. It can automatically generate bins of equal
width, or read in edge values for bins. The resultant histogram may
be produced as an RDB summary table, a plot (via PGPLOT), and
as a verbose RDB table indicating for each input item the bin in
which it is accumulated.
The input RDB table is specified via the --input
parameter. rdbhist defaults to reading from the UNIX standard input
stream. The input file must have at least one column containing the
values to be binned. The name of the column, which defaults to x,
may be specified with the --xcol option. If the values are not to
have unity weights the input table must have a second column
containing those weights. The --wtcol option should be specified,
and may optionally be used to specify a name for that column;
otherwise it defaults to wt.
To merge the newly created histogram with a previous one, use the --binfile option to specify the name of the binned data.
If a previously constructed histogram is to be read in and plotted, and no new data are to be added to the histogram, use the --nobin option to indicate this.
rdbhist can output the resulting histogram in any of three formats.
RDB
table, a summary of the histogram, giving bin minimum and maximum
edges, the number of items in each bin, the weighted sum of the items
(if non-unity weights are specified), and the index of the bin.
The --vtable option delivers an RDB table with a row for each
item being binned, indicating which bin it was accumulated into, with
the same bin parameters as given by the --table option. All of the
information in the input RDB table is passed throught.
The --plot option produces a plot of the histogram using PGPLOT.
The axes labels and title may be automatically generated, or may be
overriden via the --xlabel, --ylabel, and --title options.
The presence of vertical bars separating the bins is controlled by
the --bar and --nobar options (the default is to plot them).
rdbhist can either generate bins of a given width or it can read in predefined bins.
The default behavior is to generate bins of the width specified by the
--width option (which defaults to 1), aligned such that one of the
bins minimum limit falls at the value specified by the --edge option (which
defaults to 0).
Use the --fixed and --binfile options to have rdbhist use
pre-defined bins. The binfile must have two columns which specify
the bin minimum and maximum limits. These are by default named
min and max, but may be changed with the --binmin and --binmax
options.
Bins' maximum limits are not included in the bin.
rdbhist is capable of loading a previously binned distribution and either adding to it from an input file, or plotting it (for the latter, see Plotting Pre-Binned Data).
When augmenting an existing histogram, the histogram is read from the file specified by the --binfile option. The presence of that data is indicated by specifying the --binsum and/or --binn options. The specified data will be preloaded into the bins which are defined in the binfile. If automatically generated bins are to be used when adding to the histogram, it is recommended that the same values of width and edge be used for the new data as was used to generate the old data. Additionally, binfile should also have a column containing the indices of the bins. Bin limits are derived from indices using the following formulae:
bin_min = index * width + edge
bin_max = ( index + 1 ) * width + edge
It is very important that the indices match those calculated by rdbhist, else it will get very confused. The easiest way to do this is use rdbhist to generate the initial histogram, and use its output table (--table) as the binfile. It is also possible to rebin the histogram using the automatically determined bins, insure that the output bins are the same as the input ones, and use the output histogram as the binfile.
If all you want to do is plot up a histogram of already binned data, tell rdbhist not to bin (--nobin), to plot (--plot), the name of the file (--binfile) (else it reads from stdin), the column names for bin edges and the summed values (--binmax, --binmin, and --binsum).
The output summary table (--table) and plot (--plot) may be periodically updated by specifying the --update parameter, which specifies a refresh period in terms of the number of values being binned. The verbose output table (--vtable) is not affected by this option.
M. Tibbetts (mtibbetts@cfa.harvard.edu)
D. Jerius (djerius@cfa.harvard.edu)