rdbstats - compute statistics on an rdb table
rdbstats [options] [columns]
Options may be abbreviated. Options which take values may be separated
from their values either by white space or the = character.
column(s) to the specified column name(s). Consecutive rows
with the same value in the break column(s) are treated as independent
data sets. This option cannot be used in conjunction with the
-rows option. This option may be repeated.
--percentiles=33 will output the value
which is greater than 33 percent of the data. Because all of the data
must be stored in memory for all of the columns for these statistics,
this is a rather memory intensive option for large data sets
ave|medianmedian is specified, the
--quartiles option is implicitly turned on unless --percentiles
is specified. Normalization is done in the following fashion
( Q - ave ) / (ave or median)
where Q is the statistic to be normalized.
eval'd.
rdbstats generates statistics for columns in an rdb table. It reads the rdbtable from STDIN and writes an rdbtable to STDOUT. By default it generates the sum, average, standard deviation, minimum, and maximum values of the data. It can optionally generate the first and last quartiles and the median.
rdbstats can operate on more than one column. It normally operates on the columns specified on the command line, but if the -all option is specified, it works on all numeric columns. New columns containing the data products are created by appending suffixes to the names of the source columns. The suffices, and the contents of the columns are:
_n_sum_ave_dev_min_max_rss_fq_median_lq_pNN, where N is specified by the --percentiles options
The input data may be split into independent subsets in one of two ways. Subsets may be identified by one or more break columns. Contiguous data with the same value in the break columns are treated as a subset. For example, given the rdb database,
row dataset data
N N N
1 1 3.1
2 1 4.9
3 1 62.2
4 2 122
5 2 233
6 2 232
If dataset is the break column, rows 1-3 are a subset and rows 4-6
are a subset. Note that the break column may be either numeric or
string. Its value need have no intrinsic meaning.
The second method is to specify ranges of the record numbers of the rows to be included. Unlike with the break column, subsets need not be contiguous; they may not, however, overlap. Row range lists are specified with the --rows option; multiple instances of the option are allowed to specify multiple subsets. Row ranges have the following syntax:
Range lists are composed of ranges, separated by commas. Ranges in a range list must be disjoint, and must be listed in increasing order.
Valid characters in a range are 0-9, '(', ')', '-' and ','. White space and underscore (_) are ignored. Other characters are not allowed.
Here are some range list examples:
-1-2(-)1-3,4,18-21Note that record numbers begin with 1.
Diab Jerius ( djerius@cfa.harvard.edu )