rdb
rdb - a Relational Database management System
The programs avaible for manipulation of rdb tables are:
- column - select columns from an rdbtable
- compute - compute data field values
- dataent - interactively enter data into an rdbtable
- fmtcolumn - format columns
- headchg - replace or change the header of an rdbtable
- jointbl - join two rdbtables
- lst2tbl - convert a list format rdb table to table format
- mergetbl - merge two rdbtables
- ptbl - quick and easy printing of an rdbtable
- rdbcat - concatenate rdb tables
- rdbstats - perform statistics on rdb tables
- repair - attempt to repair existing or candidate RDB datafiles
- row - extract rows from an rdbtable
- setfields - set fields in an rdbtable
- tbl2lst - convert an rdb table to "list" format
(Note, there are more, these are only the ones with online
documentation available. See ``OPERATORS''.)
RDB is a fast, portable, relational database management system without
arbitrary limits, (other than memory and processor speed) that runs
under, and interacts with, the UNIX Operating system. It uses the
Operator/Stream DBMS paradigm described in ``Unix Review'', March, 1991,
page 24, entitled ``A 4GL Language''. There are a number of ``operators''
that each perform a unique function on the data. The ``stream'' is
suplied by the UNIX Input/Output redirection mechanism. Therefore
each operator processes some data and then passes it along to the next
operator via the UNIX pipe function. This is very efficient as UNIX
pipes are implemented in memory (at least in versions of UNIX at
RAND). RDB is compliant with the ``Relational Model''. The data is
contained in regular UNIX ACSII files, and so can be manipulated by
regular UNIX utilities, e.g. ls, wc, mv, cp, cat,
more, less, editors like the RAND editor e, head, RCS,
etc. The form of each file of data is that of a relation, or table,
with rows and columns of information. To extract information, a file
of data is fed to one or more ``operators'' via the UNIX Input/Output
redirection mechanism. There are also programs to generate reports,
and to generate, modify, and validate the data.
A good question one could ask is "With all the relational database
management systems available today, why do we need another one?" There
are five reasons. They are:
- RDB is easy to use by non-computer people. The concept is
straight forward and logical. To select rows of data, the 'row'
operator is used; to select columns of data, the 'column'
operator is used.
- The data is highly portable to and from other types of
machines, like Macintoshes or MSDOS computers.
- The system will run on any UNIX machine (that has the Perl
Programming Language).
- The system can work on intermediate data, which will later be
put into a commercial RDBMS, like INGRES.
- RDB essentially has no arbitrary limits, and can work where
INGRES can't. For example there is no limit on data field
size, the number of columns, or file size.
A more through discussion of why this type of relational database.structure makes sense is found in the book, "UNIX Relational Database
Management", Reference #2.
It is assumed that the reader has at least a minimum knowledge of
the UNIX Operating System, including knowledge of Input/Outout
redirection (e.g., STDIN, STDOUT, pipes).
The RDB system was implemented in the PERL programming language on
a SUN Sparc I computer.
This document presents information in the following order: The DATA
section describes the structure of the data, with examples. There is a
general discussion about operators in the section on OPERATORS, followed
by several sub-sections, one for each operator in alphabetic order.
Each has detailed instructions for use, and examples. There are
sections describing selection of information using multiple operators,
producing reports, and generating new rdbtables (data files in RDB
format).
The data is contained in regular UNIX, ASCII files and therefore can
be manipulated by regular UNIX utilities, e.g. grep, ls, wc,
mv, cp, cat, more, less, head, RCS, and editors
like the RAND editor e, vi, etc. A good way to view the data of
course, would be to use the RDB operator that prints such datafiles
ptbl
. The relation, or table structure is achieved by separating
the columns with ASCII TAB characters, and terminating the rows with
ASCII NEWLINE characters. That is, each row of data in a file
contains the data values (a data field) separated by TAB characters
and terminated with a NEWLINE character. Therefore a fundamental rule
is that data values must NOT contain TAB characters.
The first section of the file, called the header, contains the file
structure information used by the operators. The header also contains
optional embedded documentation relating to the entire datafile (table
documentation) and/or each data column (column documentation). The
rest of the file, called the body, contains the actual data values. A
file of data, so structured, is said to be an 'rdbtable'. The header
consists of two or more lines. There is an optional number (zero or
more) of lines of table documentation followed by exactally two lines
that contain the structure information: the column name row and the
column definition row. The table documentation lines start with
either a sharp sign (#) followed by a space character, or one or more
space characters followed by a sharp sign (#). The rest of each line
may contain any documentation desired. Note that the table
documentation lines are the only lines in an rdbtable that are not
required to conform to the table structure defined above. The fields
in the column name row contain the names of each column. The fields
in the column definition row contain the data definitions and optional
column documentation for each column.
The column names are case sensitive, i.e. 'COUNT' is different from
'Count'. The guideline for characters that may be used in column
names is that alphabetic, numeric, and non-alphanumeric characters
that are not special to the UNIX shell are good choices. Column names
must include at least one alphabetic character. It is highly
recommended (but not required) that column names start with an
alphabetic or numeric character.
Non-alphanumeric characters that are acceptable in column names are
the percent sign (%) colon (:) at sign (@) equals (=) comma (,) and
dot (.). The sharp sign (#) underscore (_) and dash (-) characters
may also be used but they must not be the first character in a column
name. The TAB character must never be used in column names, nor
should internal spaces or UNIX I/O redirection characters (<,>,|) be
used. The data definitions include column width, data type, and
justification. The column width must be explicitly specified; the
others are optional and are frequently specified by default. The data
definitions are specified by adjacent characters in a single word.
The width of each field is specified by a numeric count. The type of
data is ``string'', ``numeric'', or ``month''. The types are specified by
an 'S', 'N', or 'M' respectively, and the default is type string.
Printout justification is 'left', or 'right', and is specified by an
'<' or '>' character respectively. If not specified, data types
string and month will be left justified and type numeric will be right
justified.
Note that column width is used primarily by the operator
ptbl
and in
no way limits the actual data size. It is not an error if some actual
data in a column is wider than the defined width; a listing produced
with
ptbl
may be out of alignment however. The optional
documentation for each column follows the data definition word in the
field. There must be one or more space characters after the data
definition word and before the column documentation; the column
documentation may be as long as necessary. Note that the data
definition and the optional column documentation are contained in a
single field in the row.
If the column name and/or column definition rows contain much
information and/or column documentation they can become long and
confusing to read. However the operators
valid
and
headchg
have
options to print the header contents as a 'template' file, an organized
list of information about the header.
A sample rdbtable (named sample) that will be used in later
examples is shown in Table 1. The picture in Table 1 is for
illustrative purposes; what the file would actually look like is shown
in Table 2, where a TAB character is represented by '<T>' and a NEWLINE
character is represented by '<N>'.
RDBTABLE (SAMPLE)
# Table documentation lines. These describe and
# identify the rdbtable contents.
# They may be read by many normal UNIX utilities,
# which is useful to easily identify a file.
# May also contain RCS or SCCS control information.
NAME COUNT TYP AMT OTHER RIGHT
6 5N 3 5N 8 8>
Bush 44 A 133 Another This
Hansen 44 A 23 One Is
Jones 77 X 77 Here On
Perry 77 B 244 And The
Hart 77 D 1111 So Right
Holmes 65 D 1111 On Edge
RDBTABLE (SAMPLE) ACTUAL CONTENT
# Table documentation lines. These describe and
# identify the rdbtable contents.
# They may be read by many normal UNIX utilities,
# which is useful to easily identify a file.
# May also contain RCS or SCCS control information.
NAMECOUNTTYPAMTOTHERRIGHT
65N35N88>
Bush44A133AnotherThis
Hansen44A23OneIs
Jones77X77HereOn
Perry77B244AndThe
Hart77D1111SoRight
Holmes65D1111OnEdge
It is important to note that only actual data is stored in the data
fields, with no leading or trailing space characters. This fact can
(and usually does) have a major effect on the size of the resulting
datafiles (rdbtables) compared to data stored in ``fixed field width''
systems. The datafiles in RDB are almost always smaller, sometimes
dramatically smaller.
The operators are separate program modules that each perform a
unique function on the data. They can be grouped into data movers,
report generators, and utilities.
The data movers are operators that extract or rearrange the data in
some way. They each read an rdbtable via STDIN and write a rdbtable via
STDOUT and so are frequently connected using the UNIX pipe function to
form a larger task. Each operator in such a ``pipeline'' style of
operation gets its input from the output of the previous operator in the
``pipeline''. The data movers include:
-
row
-
Selects rows based on arbitrary expressions.
-
column
-
Selects columns by name, outputs columns in listed order.
-
setfields
-
Selects rows based on arbitrary expressions and changes fields in
those rows.
-
search
-
Selects rows based on a multi-column key of a sorted
or indexed rdbtable.
-
sorttbl
-
Sorts a datafile by one or more columns.
-
jointbl
-
Natural or ``Master/Detail'' join of two rdbtables.
-
mergetbl
-
Merges two like rdbtables.
-
compute
-
Computes an arbitrary expression using column names.
-
uniqtbl
-
Makes an rdbtable unique on specified columns.
-
subtotal
-
Lists subtotals of specified columns.
The report generators each read an rdbtable via STDIN and produce a.report on STDOUT, so when they are in a ``pipeline'' of operators they
will be the operator at the end. The report generators are:
-
ptbl
-
Quick and easy printing of output formatted from
information in the header.
-
reporttbl
-
Best form of output, with definable format.
-
summ
-
Summary/Statistical information about data values in
an rdbtable.
The utilities are used for manipulating the structure and content.of rdbtables and are generally used as separate tasks. The utilities
are:
-
headchg
-
Generates and replaces (or removes) the header of
an rdbtable.
-
dataent
-
An interactive capability for entering data into
an rdbtable.
-
etbl
-
Uses an editor to allow modifications to an rdbtable.
-
valid
-
Verifies the structure of an rdbtable.
-
repair
-
Attempts to repair candidate RDB datafiles.
All operators take a '-help' option to show details of operation.online.
The original versions of the programs were developed by Walter
V. Hobbs. This documentation and some of the new commands are derived
from his work.