Any editor may be used to construct or modify a table, since it is a regular UNIX file. This 'direct editing' method is occasionally used, especially for small amounts of data. Avoid, however, using an editor that destroys TAB or SOH characters.
To generate a new table the best plan (and usually the safest one) is first to generate a cross reference (xref) file, then convert it to table format and add the rows of data. Any convenient editor may be used to generate an xref file. Next convert the xref to an empty table wth the command 'maketable.' Finally, use the operator 'edittable' to enter rows of data.
A typical xref file is shown below:
# These are lines of table documentation. They can be of any length,
# and any number of such lines may exist.
# Each line must start correctly, e.g. with "#" surrounded by any
# number of spaces and/or TABs.
Name Name of item
Type Type: 1,2,3,7,9,A,B,X
Count Number of items
K Constant modifier
SS7 Special status for type 7
Size In Kilobytes
The above xref file contains the necessary elements to describe a table of six columns: table documentation (the comment lines that each start with a sharp sign '#'), column names ("Name", "Type", "Count", ...), and column documentation for each column (the text at the end of each column line).
To build the final table header, use the command:
maketable < table.xref
where table.xref
is the xref file described above.
The command will produce the correct table header to STDOUT,
(that can be redirected to a file as usual) :
Name Type Count K SS7 Size
The uneven spacing between output column names depends on the tabular settings of your terminal.
I will now explain how NoSQL tables can be modified, both manually and with automated programs.
Basically there are two ways to manually modify an existing table: either direct editing with a TAB conscious text editor, like vi(1) or, better, by using the 'edittable' command. The latter method is recommended, especially in a multi-user environment where multiple concurrent editing sessions may occur against the same table file. Whatever editor is used also will need to preserve the SOH characters at the beginning of the table.
The 'edittable' utility can be used to add new rows and to change or delete existing rows of data in a table. The optional '--list' switch tells 'edittable' to convert the table to 'list' format for editing, which makes working with wide tables much more comfortable. The editor called by 'edittable' defaults to vi, or to whatever is set in the environment variable EDITOR.
After editing, the table is automatically checked for validity by 'edittable,' using 'istable.' If structure errors are detected, the program prompts you for what to do (re-edit, quit, etc.).
The only way to update a table, both manually and with a program, is to capture the SDTOUT stream of an operator and write the results back to a table file. You must pay attention to avoid overwriting the original input file while it is being read. Some familiarity with the UNIX shell is expected from the user. For instance, you have better not do:
operator < table > table
or you will end up with 'table' being zero'ed! The correct procedure is:
operator < table > table.tmp
mv table.tmp table
At the moment NoSQL does not provide a record level update facility. This may be inefficient if you need to do frequent changes to a large table, for instance with a CGI program behind a Web server, as the whole table needs to be rewitten upon each change. But there are ways to circumvent this apparently major limitation in the NoSQL paradigm.
The obvious "trick" is to keep your tables small. Silly a suggestion as it may sound, before laughing please have a look at section 2.9 of the included 4gl.ps or 4gl.txt paper. Remember that NoSQL works with UNIX, not in addition to it. The underlying UNIX file system, which most commercial databases tend to disregard, can provide, when used creatively, an extremely powerful way to pre-organize your tables (relations) efficiently and keep them small (where small means roughly within a few hundred kilobytes in size).
If you really must do frequent modifications to a big indexed table, then you still can do it efficiently by applying your changes to a separate file rather than to the actual table. Then, merge the changes back into the big table (as well as re-build its indices) only every now and again, with a batch job that can be run in the background, overnight or when the system activity is low. The following example will try to explain this better.
Suppose we have the large indexed table bigtable
and
we need to insert/change/delete one or more records. What we can
do is:
bigtable.updates
, with exactly the same header as
bigtable
, but containing only those records that we want
to insert/update into or remove from bigtable
.
The entries in bigtable.updates
will have to be in a format
suitable for the 'update' operator.bigtable
we will have
to do it in three logical steps. The first step is to
use 'search' on
bigtable
to take advantage of the indices. This will
produce an intermediate output that will then have to be merged
into bigtable.updates
, and the final output will undergo the
original query statement again.bigtable
will be done
to bigtable.updates
, with the syntax described in the
documentation of the 'update' operator. Unlike bigtable
,
where records will usually be kept sorted on the primary key
field (i.e., on the leftmost table column), the records in
bigtable
do not need to be sorted in any particular
order.This may seem complicated, but it isn't much, really.
Say bigtable
contains two columns, SSN (for Social Security No.) and SURNAME, where
the unique
table key is on SSN, and we have built a secondary index on the SURNAME
field. If we want to extract all the rows which SURNAME field is
equal to "Smith," the query necessary to take advantage of a fast
search method on bigtable
and to account for the
presence of bigtable.updates
is:
printf 'SURNAME\nSmith\n' |
search --index bigtable.x.SURNAME | update bigtable.updates |
row 'SURNAME=="Smith"'
As you can see, the trick is:
bigtable
to quickly
obtain a much smaller (possibly empty) subset of its data.bigtable.updates
on-the-fly during the query.As shown, these logical steps can be performed in one single line of UNIX shell code!
The 'search' operator is very powerful, but being written in Perl it may take a while to load. If the following conditions are met, with respect to the previous example:
bigtable
leftmost field is "SURNAME," and bigtable
.
If the above is the case -- a common situation -- then here is a faster alternative based on the look(1) shell utility (which is about twice as fast as 'search'):
(head -1 bigtable; look Smith bigtable) |
update bigtable.updates | row 'SURNAME=="Smith"'
For added convenience, I have introduced the 'keysearch' operator, which is a header aware front end to the look(1) utility, so that the above can be rewritten as:
keysearch Smith bigtable |
update bigtable.updates | row 'SURNAME=="Smith"'
One word of caution: fast access methods, like those provided by 'search' and 'keysearch,' may be worthwhile only if the table being searched is really big, say over several thousand records. With a 10,000 record (300 KB) table a linear search with 'grep' often remains the fastest option, even on my ancient PII-233 box. Actually, with 'grep' a lot depends on the complexity of the pattern to be matched, whether we use an extended regular expression, and so forth. Nonsequential access methods always add complication (indexed files, sorted tables) and should not be used without prior thinking.
As already mentioned, NoSQL works in close relationship with UNIX. Rather than hiding the underlying operating system, NoSQL leverages on the facilities that the system already provides. An alternative way of dealing with large datasets, without resorting to the techniques outlined in the previous section, is to subdivide your large table into smaller parts. A large monolithic table can be made much more manageable by turning it into a binary tree of files. The idea is to distribute the record key space into separate files, or Key-Clusters. Thus, the file-system does the work instead of the CPU.
If you are mostly concerned with quickly accessing
records based on their primary key field, this may be the way to go.
The path to the individual cluster (file) containing a given key
will then be a hash
function of the relevant key value. For instance, in a two-level
clustered structure the relative path to the file cointaining the
keys goofy
, goose
and goblin
could be
./g/o.table
. The ".table" suffix on the file name is
there just for readability, but apart from that it does not serve
any special purpose.
I have called this way of splitting larger datasets into
subfiles Path Based Clustering (PBC).
Casual queries on the whole table space will of course involve extra file system overhead and will be even slower than a linear scan of a single large table. But if 90% of the time you are only interested in queries by primary key, this is the way to go. I personally take this approach very often, and it works quite well. Using one or two directory levels is usually the right compromise between individual cluster sizes and retaining the possibility of doing occasional non-key queries on the whole thing (see the 'union' operator). To attain a uniform distribution of records among the clusters, you may need to hash the file path on other than the first two characters of the key. Maybe the first and the third, or possibly other combination. It depends on the nature of the data that will be stored in that table, and some preliminary analysis and experimentataion may be necessary prior to deciding how the tree will have to be organized.
The need to concatenate tables comes up every so often, and, although it is simple to do the method may not be obvious. The UNIX 'cat' command cannot be used as it would result in duplicating the header and thus cause an invalid output table. And, of course, only tables with the same header, i.e. tables that are said to be ``union-compatible,'' should be concatenated, otherwise an invalid table would result. If we have two tables, TABA and TABB, then to concatenate TABB onto the end of TABA we use the command:
union TABA TABB