Next Previous Contents

2. Data formats

2.1 NoSQL table (relation) structure.

A table (or relation) is an ordinary ASCII file, with some additional rules that make it possible to use it as a database table. The file has records (rows) and fields (columns). The relation, or table structure is achieved by separating the columns with ASCII TAB characters, and terminating the rows with ASCII NEWLINE characters. That is, each row of data in a file contains the data values (a data field) separated by TAB characters and terminated with a NEWLINE character. Therefore a fundamental rule is that data values must NOT contain TAB characters.

The first section of the file, called the header, contains the file structure information used by the operators. The rest of the file, called the body, contains the actual data values. A file of data, so structured, is said to be a 'table'.

The header consists of exactly two lines that contain the structure information: the column name row and the dashline. The fields in the column name row contain the names of each column, and are separated from each other by a single TAB character. The dashline is a set of dashed lines, one set for each column, separated by single TAB characters. The dashline signals the start of the actual data rows and its sole purpose is to make the header visually easy to find.

The column names are case sensitive, i.e. 'COUNT' is different from 'Count'. The guideline for characters that may be used in column names is that alphabetic, numeric, and the underscore (_) are good choices. Numeric-only column names are not allowed. No rows, except the dashline, should contain only dashes and tabs.

The TAB character must never be used in column names, nor should spaces or UNIX I/O redirection characters (<,>,|) be used. To be on the safe side, column names should always start with a letter and contain only upper and lower case letters, numbers and the underscore (_). To wear both belt and braces :-), i.e. to be really safe, column names should start with an upper-case letter. The following names are reserved to the awk programming language, and should not be used to indicate column names:

BEGIN, END, break, continue, else, exit, exp, for, getline, gsub, if, in, index, int, length, log, next, print, printf, split, sprintf, sqrt, sub, substr, while, and possibly others, depending on the implementation of your awk (i.e. mawk, gawk, etc.). Refer to the man page and the documentation of you awk interpeter.

For instance, suppose you have a table that maps names to nicknames, then its two columns could be called Name and NName. Some NoSQL operators create new columns that have the same name as pre-existing table columns, with lower-case letters prepended to them. This is why you really sould stick to these rules.

Not abiding by these naming rules may still work, but there may be unexpected results.

A sample table (named SAMPLE) that will be used in later examples is shown in Table 1. The picture in Table 1 is for illustrative purposes; what the file would actually look like is shown in Table 2, where a TAB character is represented by '<T>' and a NEWLINE character is represented by '<N>'.

                          Table 1

                       table (SAMPLE)

                NAME    COUNT   TYP     AMT
                ----    -----   ---     ---
                Bush    44      A       133
                Hansen  44      A       23 
                Jones   77      X       77 
                Perry   77      B       244
                Hart    77      D       1111
                Holmes  65      D       1111
  

                         Table 2
  
                table (SAMPLE) actual content

                NAME<T>COUNT<T>TYP<T>AMT<N>
                ----<T>-----<T>---<T>---<N>
                Bush<T>44<T>A<T>133<N>
                Hansen<T>44<T>A<T>23<N>
                Jones<T>77<T>X<T>77<N>
                Perry<T>77<T>B<T>244<N>
                Hart<T>77<T>D<T>1111<N>
                Holmes<T>65<T>D<T>1111<N>
    

It is important to note that only actual data is stored in the data fields, with no leading or trailing space characters. This fact can (and usually does) have a major effect on the size of the resulting datafiles (tables) compared to data stored in "fixed field width" systems. The datafiles in NoSQL are almost always smaller, sometimes dramatically smaller.

A table can also be represented in a different format, called 'list format'. The list format of the above SAMPLE table is:


      NAME  Bush
      COUNT 44
      TYP   A
      AMT   133     
      
      NAME  Hansen
      COUNT 44
      TYP   A
      AMT   23      
      
      NAME  Jones
      COUNT 77
      TYP   X
      AMT   77      
      
      NAME  Perry
      COUNT 77
      TYP   B
      AMT   244     
      
      NAME  Hart
      COUNT 77
      TYP   D
      AMT   1111    
      
      NAME  Holmes
      COUNT 65
      TYP   D
      AMT   1111
      
    

The actual contents of a table in 'list' format, showing newlines and tabs is:


      <N>
      NAME<T>Bush<N>
      COUNT<T>44<N>
      TYP<T>A<N>
      AMT<T>133<N>
      <N>
      NAME<T>Hansen<N>
      COUNT<T>44<N>
      TYP<T>A<N>
      AMT<T>23<N>
      <N>
      NAME<T>Jones<N>
      COUNT<T>77<N>
      TYP<T>X<N>
      AMT<T>77<N>
      <N>
      NAME<T>Perry<N>
      COUNT<T>77<N>
      TYP<T>B<N>
      AMT<T>244<N>
      <N>
      NAME<T>Hart<N>
      COUNT<T>77<N>
      TYP<T>D<N>
      AMT<T>1111<N>
      <N>
      NAME<T>Holmes<N>
      COUNT<T>65<N>
      TYP<T>D<N>
      AMT<T>1111<N>
      <N>

    

Long lines, i.e. lines that are too long to fit on the width of the screen, may be folded over multiple rows in a file in 'list' format, provided that each continuation row starts with one or more spaces and/or TAB characters. Field (column) names need to be separated by the associated data by one or more spaces and/or TAB characters. The data part may contain TABs, which will be replaced by spaces by the 'listtotable' operator when the list is turned into a table.


      COMMENTS  This is a very looong comment, that I want to fold over
        multiple lines.

    

and the actual content is :


      <N>
      COMMENTS<T>This is a very looong comment, that I want to fold over<N>
      <T>multiple lines.<N>
      <N>

    

As we will see, there are NoSQL operators that convert back and forth between 'table' and 'list' formats.

It is suggested, though not required, that table file names be given the filename extension '.rdb', to make them recognizable right away.

2.2 Notes on similar database packages.

Besides NoSQL and RDB there are other UNIX DBMS's, both commercial and free, that are based on ASCII tables. A commercial implementation is /rdb, by Revolutionary Software, while among the free ones there are Starbase, developed at the Harvard Smithsonian Astrophysical Observatory, and Gunnar Stefansson's reldb, a collection of interesting tools available at sites that carry archives of the comp.sources.unix Usenet newsgroup.

The ASCII table format of those database engines is very close to that of NoSQL, therefore data can easily be converted back and forth between them and NoSQL.

I have no connections whatsoever with Revolutionary Software, but their /rdb is probably the package that I like most, as it combines the flexibility of NoSQL with the speed of the C language, though I think it provides far more operators than are actually useful, and handles internally things that could be more conveniently delegated to the underlying UNIX commands and utilities; and /rdb is not free :-).


Next Previous Contents