rdb


NAME

rdb - a Relational Database management System


SYNOPSIS

The programs avaible for manipulation of rdb tables are:

(Note, there are more, these are only the ones with online documentation available. See ``OPERATORS''.)


SUMMARY

RDB is a fast, portable, relational database management system without arbitrary limits, (other than memory and processor speed) that runs under, and interacts with, the UNIX Operating system. It uses the Operator/Stream DBMS paradigm described in ``Unix Review'', March, 1991, page 24, entitled ``A 4GL Language''. There are a number of ``operators'' that each perform a unique function on the data. The ``stream'' is suplied by the UNIX Input/Output redirection mechanism. Therefore each operator processes some data and then passes it along to the next operator via the UNIX pipe function. This is very efficient as UNIX pipes are implemented in memory (at least in versions of UNIX at RAND). RDB is compliant with the ``Relational Model''. The data is contained in regular UNIX ACSII files, and so can be manipulated by regular UNIX utilities, e.g. ls, wc, mv, cp, cat, more, less, editors like the RAND editor e, head, RCS, etc. The form of each file of data is that of a relation, or table, with rows and columns of information. To extract information, a file of data is fed to one or more ``operators'' via the UNIX Input/Output redirection mechanism. There are also programs to generate reports, and to generate, modify, and validate the data.


DESCRIPTION

A good question one could ask is "With all the relational database management systems available today, why do we need another one?" There are five reasons. They are:

  1. RDB is easy to use by non-computer people. The concept is straight forward and logical. To select rows of data, the 'row' operator is used; to select columns of data, the 'column' operator is used.

  2. The data is highly portable to and from other types of machines, like Macintoshes or MSDOS computers.

  3. The system will run on any UNIX machine (that has the Perl Programming Language).

  4. The system can work on intermediate data, which will later be put into a commercial RDBMS, like INGRES.

  5. RDB essentially has no arbitrary limits, and can work where INGRES can't. For example there is no limit on data field size, the number of columns, or file size.

A more through discussion of why this type of relational database.structure makes sense is found in the book, "UNIX Relational Database Management", Reference #2.

It is assumed that the reader has at least a minimum knowledge of the UNIX Operating System, including knowledge of Input/Outout redirection (e.g., STDIN, STDOUT, pipes).

The RDB system was implemented in the PERL programming language on a SUN Sparc I computer.

This document presents information in the following order: The DATA section describes the structure of the data, with examples. There is a general discussion about operators in the section on OPERATORS, followed by several sub-sections, one for each operator in alphabetic order. Each has detailed instructions for use, and examples. There are sections describing selection of information using multiple operators, producing reports, and generating new rdbtables (data files in RDB format).

DATA (RDBTABLE)

The data is contained in regular UNIX, ASCII files and therefore can be manipulated by regular UNIX utilities, e.g. grep, ls, wc, mv, cp, cat, more, less, head, RCS, and editors like the RAND editor e, vi, etc. A good way to view the data of course, would be to use the RDB operator that prints such datafiles ptbl . The relation, or table structure is achieved by separating the columns with ASCII TAB characters, and terminating the rows with ASCII NEWLINE characters. That is, each row of data in a file contains the data values (a data field) separated by TAB characters and terminated with a NEWLINE character. Therefore a fundamental rule is that data values must NOT contain TAB characters.

The first section of the file, called the header, contains the file structure information used by the operators. The header also contains optional embedded documentation relating to the entire datafile (table documentation) and/or each data column (column documentation). The rest of the file, called the body, contains the actual data values. A file of data, so structured, is said to be an 'rdbtable'. The header consists of two or more lines. There is an optional number (zero or more) of lines of table documentation followed by exactally two lines that contain the structure information: the column name row and the column definition row. The table documentation lines start with either a sharp sign (#) followed by a space character, or one or more space characters followed by a sharp sign (#). The rest of each line may contain any documentation desired. Note that the table documentation lines are the only lines in an rdbtable that are not required to conform to the table structure defined above. The fields in the column name row contain the names of each column. The fields in the column definition row contain the data definitions and optional column documentation for each column.

The column names are case sensitive, i.e. 'COUNT' is different from 'Count'. The guideline for characters that may be used in column names is that alphabetic, numeric, and non-alphanumeric characters that are not special to the UNIX shell are good choices. Column names must include at least one alphabetic character. It is highly recommended (but not required) that column names start with an alphabetic or numeric character.

Non-alphanumeric characters that are acceptable in column names are the percent sign (%) colon (:) at sign (@) equals (=) comma (,) and dot (.). The sharp sign (#) underscore (_) and dash (-) characters may also be used but they must not be the first character in a column name. The TAB character must never be used in column names, nor should internal spaces or UNIX I/O redirection characters (<,>,|) be used. The data definitions include column width, data type, and justification. The column width must be explicitly specified; the others are optional and are frequently specified by default. The data definitions are specified by adjacent characters in a single word. The width of each field is specified by a numeric count. The type of data is ``string'', ``numeric'', or ``month''. The types are specified by an 'S', 'N', or 'M' respectively, and the default is type string. Printout justification is 'left', or 'right', and is specified by an '<' or '>' character respectively. If not specified, data types string and month will be left justified and type numeric will be right justified.

Note that column width is used primarily by the operator ptbl and in no way limits the actual data size. It is not an error if some actual data in a column is wider than the defined width; a listing produced with ptbl may be out of alignment however. The optional documentation for each column follows the data definition word in the field. There must be one or more space characters after the data definition word and before the column documentation; the column documentation may be as long as necessary. Note that the data definition and the optional column documentation are contained in a single field in the row.

If the column name and/or column definition rows contain much information and/or column documentation they can become long and confusing to read. However the operators valid and headchg have options to print the header contents as a 'template' file, an organized list of information about the header.

A sample rdbtable (named sample) that will be used in later examples is shown in Table 1. The picture in Table 1 is for illustrative purposes; what the file would actually look like is shown in Table 2, where a TAB character is represented by '<T>' and a NEWLINE character is represented by '<N>'.

Table 1

RDBTABLE (SAMPLE)

# Table documentation lines. These describe and # identify the rdbtable contents. # They may be read by many normal UNIX utilities, # which is useful to easily identify a file. # May also contain RCS or SCCS control information. NAME COUNT TYP AMT OTHER RIGHT 6 5N 3 5N 8 8> Bush 44 A 133 Another This Hansen 44 A 23 One Is Jones 77 X 77 Here On Perry 77 B 244 And The Hart 77 D 1111 So Right Holmes 65 D 1111 On Edge

Table 2

RDBTABLE (SAMPLE) ACTUAL CONTENT

# Table documentation lines. These describe and<N> # identify the rdbtable contents.<N> # They may be read by many normal UNIX utilities,<N> # which is useful to easily identify a file.<N> # May also contain RCS or SCCS control information.<N> NAME<T>COUNT<T>TYP<T>AMT<T>OTHER<T>RIGHT<N> 6<T>5N<T>3<T>5N<T>8<T>8><N> Bush<T>44<T>A<T>133<T>Another<T>This<N> Hansen<T>44<T>A<T>23<T>One<T>Is<N> Jones<T>77<T>X<T>77<T>Here<T>On<N> Perry<T>77<T>B<T>244<T>And<T>The<N> Hart<T>77<T>D<T>1111<T>So<T>Right<N> Holmes<T>65<T>D<T>1111<T>On<T>Edge<N>

It is important to note that only actual data is stored in the data fields, with no leading or trailing space characters. This fact can (and usually does) have a major effect on the size of the resulting datafiles (rdbtables) compared to data stored in ``fixed field width'' systems. The datafiles in RDB are almost always smaller, sometimes dramatically smaller.

OPERATORS

The operators are separate program modules that each perform a unique function on the data. They can be grouped into data movers, report generators, and utilities.

The data movers are operators that extract or rearrange the data in some way. They each read an rdbtable via STDIN and write a rdbtable via STDOUT and so are frequently connected using the UNIX pipe function to form a larger task. Each operator in such a ``pipeline'' style of operation gets its input from the output of the previous operator in the ``pipeline''. The data movers include:

row
Selects rows based on arbitrary expressions.

column
Selects columns by name, outputs columns in listed order.

setfields
Selects rows based on arbitrary expressions and changes fields in those rows.

search
Selects rows based on a multi-column key of a sorted or indexed rdbtable.

sorttbl
Sorts a datafile by one or more columns.

jointbl
Natural or ``Master/Detail'' join of two rdbtables.

mergetbl
Merges two like rdbtables.

compute
Computes an arbitrary expression using column names.

uniqtbl
Makes an rdbtable unique on specified columns.

subtotal
Lists subtotals of specified columns.

The report generators each read an rdbtable via STDIN and produce a.report on STDOUT, so when they are in a ``pipeline'' of operators they will be the operator at the end. The report generators are:

ptbl
Quick and easy printing of output formatted from information in the header.

reporttbl
Best form of output, with definable format.

summ
Summary/Statistical information about data values in an rdbtable.

The utilities are used for manipulating the structure and content.of rdbtables and are generally used as separate tasks. The utilities are:

headchg
Generates and replaces (or removes) the header of an rdbtable.

dataent
An interactive capability for entering data into an rdbtable.

etbl
Uses an editor to allow modifications to an rdbtable.

valid
Verifies the structure of an rdbtable.

repair
Attempts to repair candidate RDB datafiles.

All operators take a '-help' option to show details of operation.online.


AUTHOR

The original versions of the programs were developed by Walter V. Hobbs. This documentation and some of the new commands are derived from his work.