Skip to main content.
home | support | download

SWISH-CONFIG - Configuration File Directives

Swish-e version 2.4.5

Table of Contents


OVERVIEW

This document lists the available configuration directives available in Swish-e.

CONFIGURATION FILE

What files Swish-e indexes and how they are indexed, and where the index is written can be controlled by a configuration file.

The configuration file is a text file composed of comments, blank lines, and configuration directives. The order of the directives is not important. Some directives may be used more than once in the configuration file, while others can only be used once (e.g. additional directives will overwrite preceding directives). Case of the directive is not important -- you may use upper, lower, or mixed case.

Comments are any line that begin with a "#".

    # This is a comment

As of 2.4.3 lines may be continued by placing a backslas as the last character on the line:

    IgnoreWords \
        am \
        the \
        foo

Directives may take more than one parameter. Enclose single parameters that include whitespace in quotes (single or double). Inside of quotes the backslash escapes the next character.

    ReplaceRules append "foo bar"   <- define "foo bar" as a single parameter

If you need to include a quote character in the value either use a backslash to escape it, or enclose it in quotes of the other type.

Backslashes also have special meaning in regular expressions.

    FileFilterMatch pdftotext "'%p' -" /\.pdf$/

This says that the dot is a real dot (instead of matching any character). If you place the regular expression in quotes then you must use double-backslashes.

    FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"

Swish-e will convert the double backslash into a single backslash before passing the parameter to the regular expression compiler.

Commented example configuration files are included in the conf directory of the Swish-e distribution.

Some command line arguments can override directives specified in the configuration file. Please see also the SWISH-RUN for instructions on running Swish-e, and the SWISH-SEARCH page for information and examples on how to search your index.

The configuration file is specified to Swish-e by the -c switch. For example,

    swish-e -c myconfig.conf

You may also split your directives up into different configuration files. This allows you to have a master configuration file used for many different indexes, and smaller configuration files for each separate index. You can specify the different configuration files when running from the command line with the -c switch (see SWISH-RUN), or you may include other Configuration file with the IncludeConfigFile directive below.

Typically, in a configuration file the directives are grouped together in some logical order -- that is, directives that control the source of the documents would be grouped together first, and directives that control how each document is filtered or its words index in another group of directives. (The directives listed below are grouped in this order).

The configuration file directives are listed below in these groups:

Alphabetical Listing of Directives

Directives that Control Swish

These configuration directives control the general behavior of Swish-e.

NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.

Administrative Headers Directives

Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:

Document Source Directives

These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

Document Contents Directives

These directives control what information is extracted from your source documents, and how that information is made available during searching.

Directives for the File Access method only

Some directives have different uses depending on the source of the documents. These directives are only valid when using the File system method of indexing.

Directives for the HTTP Access Method Only

The HTTP Access method is enabled by the "-S http" switch when indexing. It works by running a Perl program called SwishSpider which fetches documents from a web server.

Only text files (content-type of "text/*") are indexed with the HTTP Access Method. Other document types (e.g. PDF or MSWord) may be indexed as well. The SwishSpider will attempt to make use of the SWISH::Filter module (included with the Swish-e distribution) to convert documents into a format that Swish-e can index.

Note: The -S prog method of spidering (using spider.pl) can be a replacement for the -S http method. It offers more configuration options and better spidering speed.

These directives below are available when using the HTTP Access Method of indexing.

Directives for the prog Access Method Only

This section details the directives that are only available for the "prog" document source feature of Swish-e. The "prog" access method runs an external program that "feeds" documents to Swish-e. This allows indexing and filtering of documents from any source.

See prog - general purpose access method in the SWISH-RUN man page for more information.

A number of example programs for use with the "prog" access method are provided in the prog-bin directory. Please see those example if you have questions about implementing a "prog" input program.

Notes when using MS Windows

You should use unix style path separators to specify your external program. Swish will convert forward slashes to backslashes before calling the external program. This is only true for the program name specified with IndexDir or the -i command line option.

In addition, Swish-e will make sure the program specified actually exists, which means you need to use the full name of the program.

For example, to run the perl spider program spider.pl you would need a Swish-e configuration file such as:

    IndexDir e:/perl/bin/perl.exe
    SwishProgParameters prog-bin/spider.pl default http://swish-e.org

and run indexing with the command:

    swish-e -c swish.cfg -S prog -v 9

The IndexDir command tells Swish-e the name of the program to run. Under unix you can just specify the name of the script, since unix will figure out the program from the first line of the script.

The SwishProgParameters are the parameters passed to the program specified by IndexDir (perl.exe in this case). The first parameter is the perl script to run (prog-bin/spider.pl). Perl passes the rest of the parameters directly to the perl script. The second parameter default tells the spider.pl program to use default settings for spidering (or you could specify a spider config file -- see perldoc spider.pl for details), and lastly, the URL is passed into the spider program.

Document Filter Directives

Internally, Swish-e knows how to parse only text, HTML, and XML documents. With "filters" you can index other types of documents. For example, if all your web pages are in gzip format a filter can uncompress these on the fly for indexing.

You may wish to read the Swish-e FAQ question on filtering before continuing here. How Do I filter documents?

There are two suggested methods for filtering.

Filtering with SWISH::Filter

The Swish-e distribution includes a Perl module called SWISH::Filter and individual filters located in the filters directory. This system uses plug-in filters to extend the types of documents that Swish-e can index. The plug-in filters do not actually do the filtering, but rather provide a standard interface for accessing programs that can filter or convert documents. The programs that do the filtering are not part of the Swish-e distribution; they must be downloaded and installed separately.

The advantage of this method is that new filtering methods can be installed easily.

This system is designed to work with the -S http and -prog methods, but may also be used with the FileFilter feature and -S fs indexing method. See $prefix/share/doc/swish-e/examples/filter-bin/swish_filter.pl for an example.

See the filters/README file for more information.

Filtering with the FileFilter feature

A filter is an external program that Swish-e executes while processing a document of a given type. Swish-e will execute the filter program for each file that matches the file suffix (extension) set in the FileFilter or FileFilterMatch directives. FileFilterMatch matches using regular expressions and is described below.

Filters may be used with any type of input method (i.e. -S fs, -S http, or -S prog). But because

Swish-e calls the external program passing as default arguments:

Swish-e can also pass other parameters to the filter program. These parameters can be defined using the FileFilter or FileFilterMatch directives. See Filter Options below.

The filter program must open the file, process its contents, and return it to Swish-e by printing to STDOUT.

Note that this can add a significant amount of time to the indexing process if your external program is a perl or shell script. If you have many files to filter you should consider writing your filter in C instead of a shell or perl script, or using the "prog" Access Method along with SWISH::Filter.

Document Info

$Id: SWISH-CONFIG.pod,v 1.91 2006/10/20 20:18:30 whmoseley Exp $

.