Programmer's Guide to BOW
This manual documents how to install and use `libbow'),
(a library of C code for statistical text processing),
version 1.0.
1. Overview
Documentation and updates for `libbow' are available at
http://www.cs.cmu.edu/~mccallum/bow
Rainbow is a C program that performs document classification using one
of several different methods, including naive Bayes, TFIDF/Rocchio,
K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr's
Probabilitistic Indexing, and a simple-minded form a shrinkage with
naive Bayes.
Rainbow's accompanying library, `libbow', is a library of C code
intended for support of statistical text-processing programs. The
current source distribution includes the library, a text classification
front-end (rainbow), a simple TFIDF-based document retrieval front-end
(arrow), an AltaVista-style document retrieval front-end (archer), and a
unsupported document clustering front-end with hierarchical clustering
and deterministic annealing (crossbow).
| The library provides facilities for:
* Recursively descending directories, finding text files.
* Finding `document' boundaries when there are multiple docs per file.
* Tokenizing a text file, according to several different methods.
* Including N-grams among the tokens.
* Mapping strings to integers and back again, very efficiently.
* Building a sparse matrix of document/token counts.
* Pruning vocabulary by occurrence counts or by information gain.
* Building and manipulating word vectors.
* Setting word vector weights according to NaiveBayes, TFIDF, and a
simple form of Probabilistic Indexing.
* Scoring queries for retrieval or classification.
* Writing all data structures to disk in a compact format.
* Reading the document/token matrix from disk in an efficient,
sparse fashion.
* Performing test/train splits, and automatic classification tests.
* Operating in server mode, receiving and answering queries over a
socket.
|
It is known to compile on most UNIX systems, including Linux, Solaris,
SUNOS, Irix and HPUX. Six months ago, it compiled on WindowsNT (with
a GNU build environment); it would probably work again with little
effort. Patches to the code are most welcome.
It is relatively efficient. Reading, tokenizing and indexing the raw
text of 20,000 UseNet articles takes about 3 minutes. Building a
naive Bayes classifier from 10,000 articles, and classifying the other
10,000 takes about 1 minute.
The code conforms to the GNU coding standards. It is released under the
Library GNU Public License (LGPL).
| The library does not:
Have parsing facilities.
Do smoothing across N-gram models.
Claim to be finished.
Have good documentation.
Claim to be bug-free.
...many other things.
|
Pronounciation guide: "libbow" rhymes with "lib-low", not "lib-cow".
Notes from Devika:
How to delimit documents.
How to tag things--how to augment the lexers.
Lead in gently, steps. Big picture.... more and more interesting things
Variety of examples.
Guide to sea of command-line references. Structure.
When to consider using which switch.
Sensible defaults.
2. Traversing Diretories to find Text Files
3. Getting Words from Text Files
Lexer buffers, Lexers
3.1 The Simple Lexer
3.2 The N-Gram Lexer
3.3 The Email/News Lexer
3.4 The HTML Lexer
3.5 Functions Useful for Writing Lexers
- Function: int bow_stem_porter (char *word)
- Apply the Porter stemming algorithm to modify word. Return 0 on success.
- Function: int bow_isalpha (int character)
- A function wrapper around POSIX's
isalpha
macro.
- Function: int bow_isgraph (int character)
- A function wrapper around POSIX's
isgraph
macro.
- Function: int bow_stoplist_present (const char *word)
- Return non-zero if word is on the stoplist.
- Function: int bow_stoplist_add_from_file (const char *filename)
- Add to the stoplist the white-space delineated words from
filename. Return the number of words added. If the file could
not be opened, return -1.
4. Mapping between Words and Integers
4.1 Generic Maps between Integers and Strings
- : bow_int4str
- Function: bow_int4str * bow_int4str_new (int capacity)
- Allocate, initialize and return a new int/string mapping structure. The
parameter capacity is used as a hint about the number of words to
expect; if you don't know or don't care about a capacity value,
pass 0, and a default value will be used.
- Function: const char * bow_int2str (bow_int4str *map, int index)
- Given a integer index, return its corresponding string.
- Function: int bow_str2int (bow_int4str *map, const char *string)
- Given the char-pointer string, return its integer
index. If this is the first time we're seeing string, add it to
the mapping, assign it a new index, and return the new index.
- Function: int bow_str2int_no_add (bow_int4str *map, const char *string)
- Given the char-pointer string, return its integer index. If
string is not yet in the mapping, return -1.
- Function: void bow_int4str_write (bow_int4str *map, FILE *fp)
- Write the int-str mapping to file-pointer fp.
- Function: bow_int4str * bow_int4str_new_from_fp (FILE *fp)
- Return a new int-str mapping, created by reading file-pointer fp.
- Function: bow_int4str * bow_int4str_new_from_file (const char *filename)
- Return a new int-string mapping, created by reading filename.
- Function: void bow_int4str_free (bow_int4str *map)
- Free the memory held by the int-string mapping map.
4.2 The Global Dictionary
- Function: const char * bow_int2word (int wi)
- Given a "word index" wi, return its word, according to the global
word-int mapping.
- Function: int bow_word2int (const char *word);
- Given a word, return its "word index," according to the global
word-int mapping; if it's not yet in the mapping, add it.
- Function: int bow_word2int_add_occurrence (const char *word)
- Like
bow_word2int()
, except it also increments the occurrence
count associated with word.
- Variable: int bow_word2int_do_not_add
- If this is non-zero, then
bow_word2int()
will return -1 when
asked for the index of a word that is not already in the mapping.
Essentially, setting this to non-zero makes bow_word2int()
and
bow_word2int_add_occurrence()
behave like
bow_str2int_no_add()
.
- Function: int bow_words_add_occurrences_from_text_dir (const char *dirname, const char *exception_name)
- Add to the word occurrence counts by recursively decending directory
dirname and lexing all the text files; skip any files matching
exception_name.
- Function: int bow_words_occurrences_for_wi (int wi);
- Return the number of times
bow_word2int_add_occurrence()
was
called with the word whose index is wi.
- Function: void bow_words_set_map (bow_int4str *map, int free_old_map)
- Replace the current word/int mapping with map.
- Function: void bow_words_remove_occurrences_less_than (int occur);
- Modify the int/word mapping by removing all words that occurred less
than occur number of times. WARNING: This totally changes the
word/int mapping; any
wv
's, wi2dvf
's or barrel
's
you build with the old mapping will have bogus word indices afterward.
- Function: int bow_num_words ()
- Return the total number of unique words in the int/word map.
- Function: void bow_words_write (FILE *fp)
- Save the int/word map to file-pointer FP.
- Function: void bow_words_write_to_file (const char *filename)
- Same as above, but with a filename instead of a
FILE*
.
- Function: void bow_words_read_from_fp (FILE *fp)
- Read the int/word map from file-pointer fp.
- Function: void bow_words_read_from_file (const char *filename)
- Same as above, but with a filename instead of a
FILE*
.
- Function: void bow_words_reread_from_file (const char *filename, int force_update)
- Same as above, but don't bother rereading unless filename is different
from the last one, or force_update is non-zero.
5. Word Vectors
5.1 Creating a Word Vector from a Text File
5.2 Writing and Reading Word Vectors as Data Files
6. Vectors of Documents
- Type: bow_dv
7. A Matrix of Document/Word Statistics
- Type: bow_dvf
- Type: bow_wi2dvf
8. Document/Word Models
- Type: bow_barrel
9. Vector-per-Class Models
10. Arrays of Structures
10.1 Arrays indexed by integers
10.2 Arrays indexed by strings
11. Command-line argument processing with Argp
Table of Contents
1. Overview
2. Traversing Diretories to find Text Files
3. Getting Words from Text Files
3.1 The Simple Lexer
3.2 The N-Gram Lexer
3.3 The Email/News Lexer
3.4 The HTML Lexer
3.5 Functions Useful for Writing Lexers
4. Mapping between Words and Integers
4.1 Generic Maps between Integers and Strings
4.2 The Global Dictionary
5. Word Vectors
5.1 Creating a Word Vector from a Text File
5.2 Writing and Reading Word Vectors as Data Files
6. Vectors of Documents
7. A Matrix of Document/Word Statistics
8. Document/Word Models
9. Vector-per-Class Models
10. Arrays of Structures
10.1 Arrays indexed by integers
10.2 Arrays indexed by strings
11. Command-line argument processing with Argp
Short Table of Contents
1. Overview
2. Traversing Diretories to find Text Files
3. Getting Words from Text Files
4. Mapping between Words and Integers
5. Word Vectors
6. Vectors of Documents
7. A Matrix of Document/Word Statistics
8. Document/Word Models
9. Vector-per-Class Models
10. Arrays of Structures
11. Command-line argument processing with Argp
About this document
This document was generated
by
using texi2html
The buttons in the navigation panels have the following meaning:
Button |
Name |
Go to |
From 1.2.3 go to |
[ < ] |
Back
|
previous section in reading order
|
1.2.2
|
[ > ] |
Forward
|
next section in reading order
|
1.2.4
|
[ << ] |
FastBack
|
beginning of this chapter or previous chapter
|
1
|
[ Up ] |
Up
|
up section
|
1.2
|
[ >> ] |
FastForward
|
next chapter
|
2
|
[Top] |
Top
|
cover (top) of document
|
|
[Contents] |
Contents
|
table of contents
|
|
[Index] |
Index
|
concept index
|
|
[ ? ] |
About
|
this page
|
|
where the Example assumes that the current position
is at Subsubsection One-Two-Three of a document of
the following structure:
- 1. Section One
- 1.1 Subsection One-One
- 1.2 Subsection One-Two
- 1.2.1 Subsubsection One-Two-One
- 1.2.2 Subsubsection One-Two-Two
- 1.2.3 Subsubsection One-Two-Three
<== Current Position
- 1.2.4 Subsubsection One-Two-Four
- 1.3 Subsection One-Three
- 1.4 Subsection One-Four
This document was generated
by root on April, 9 2005
using texi2html