[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
`libbow', version 1.0.
Documentation and updates for `libbow' are available at http://www.cs.cmu.edu/~mccallum/bow
Rainbow is a C program that performs document classification using one of several different methods, including naive Bayes, TFIDF/Rocchio, K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr's Probabilitistic Indexing, and a simple-minded form a shrinkage with naive Bayes.
Rainbow's accompanying library, `libbow', is a library of C code intended for support of statistical text-processing programs. The current source distribution includes the library, a text classification front-end (rainbow), a simple TFIDF-based document retrieval front-end (arrow), an AltaVista-style document retrieval front-end (archer), and a unsupported document clustering front-end with hierarchical clustering and deterministic annealing (crossbow).
The library provides facilities for: * Recursively descending directories, finding text files. * Finding `document' boundaries when there are multiple docs per file. * Tokenizing a text file, according to several different methods. * Including N-grams among the tokens. * Mapping strings to integers and back again, very efficiently. * Building a sparse matrix of document/token counts. * Pruning vocabulary by occurrence counts or by information gain. * Building and manipulating word vectors. * Setting word vector weights according to NaiveBayes, TFIDF, and a simple form of Probabilistic Indexing. * Scoring queries for retrieval or classification. * Writing all data structures to disk in a compact format. * Reading the document/token matrix from disk in an efficient, sparse fashion. * Performing test/train splits, and automatic classification tests. * Operating in server mode, receiving and answering queries over a socket. |
It is relatively efficient. Reading, tokenizing and indexing the raw text of 20,000 UseNet articles takes about 3 minutes. Building a naive Bayes classifier from 10,000 articles, and classifying the other 10,000 takes about 1 minute.
The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).
The library does not: Have parsing facilities. Do smoothing across N-gram models. Claim to be finished. Have good documentation. Claim to be bug-free. ...many other things. |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
`Rainbow' is a standalone program that does document classification. Here are some examples:
rainbow -i ./training/positive ./training/negative |
Using the text files found under the directories `./positive' and `./negative', tokenize, build word vectors, and write the resulting data structures to disk.
rainbow --query=./testing/254 |
Tokenize the text document `./testing/254', and classify it, producing output like:
/home/mccallum/training/positive 0.72 /home/mccallum/training/negative 0.28 |
rainbow --test-set=0.5 -t 5 |
Perform 5 trials, each consisting of a new random test/train split and outputs of the classification of the test documents.
Typing `rainbow --help' will give list of all rainbow options.
After you have compiled `libbow' and `rainbow', you can run the shell script `./demo/script' to see an annotated demonstration of the classifier in action.
More information and documentation is available at http://www.cs.cmu.edu/~mccallum/bow
Rainbow improvements coming eventually: Better documentation. Incremental model training. |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
`Arrow' is a standalone program that does document retrieval by TFIDF.
Index all the documents in directory `foo' by typing
arrow --index foo |
Make a single query by typing
arrow --query |
then typing your query, and pressing Control-D.
If you want to make many queries, it will be more efficient to run arrow as a server, and query it multiple times without restarts by communicating through a socket. Type, for example,
arrow --query-server=9876 |
And access it through port number 9876. For example:
telnet localhost 9876 |
In this mode there is no need to press Control-D to end a query. Simply type your query on one line, and press return.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
`Crossbow' is a standalone program that does document clustering. Sorry, there is no documentation yet.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
`Archer' is a standalone program that does document retrieval with AltaVista-type queries, using +, -, "", etc. The commands in the "arrow" examples above also work for archer. See "archer --help" for more information.
[Top] | [Contents] | [Index] | [ ? ] |
1. Bag Of Words Library README
1.1 Rainbow
1.2 Arrow
1.3 Crossbow
1.4 Archer
[Top] | [Contents] | [Index] | [ ? ] |
1. Bag Of Words Library README
[Top] | [Contents] | [Index] | [ ? ] |
Button | Name | Go to | From 1.2.3 go to |
---|---|---|---|
[ < ] | Back | previous section in reading order | 1.2.2 |
[ > ] | Forward | next section in reading order | 1.2.4 |
[ << ] | FastBack | beginning of this chapter or previous chapter | 1 |
[ Up ] | Up | up section | 1.2 |
[ >> ] | FastForward | next chapter | 2 |
[Top] | Top | cover (top) of document | |
[Contents] | Contents | table of contents | |
[Index] | Index | concept index | |
[ ? ] | About | this page |
where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure: