Class lucene_fileindexer

Description

The lucene file indexer class.

This class indexes files on disc, either one by one or as a whole file hierarchy tree.

Located in /lucene-defs.php (line 1421)


	
			
Variable Summary
 mixed $application
 mixed $host
 mixed $idoffset
 mixed $idprefix
 mixed $idsource
 mixed $indexfields
 mixed $ixid
 mixed $lockfile
 mixed $metascan
 mixed $meta_fields
 mixed $port
 mixed $timeoutsecs
 mixed $timer
Method Summary
 lucene_fileindexer lucene_fileindexer ([string $application = "?"], [string $host = ""], [string $port = ""])
 void avoid_lockfile (string $lockfile, integer $wait_secs)
 void define_field (string $fieldname, string $type, [boolean $stored = STORED], [boolean $indexed = INDEXED])
 void id_generate ([integer $idsource = ID_FROM_INC], [mixed $pfxofs = ""])
 void index_field (string $fieldname, string $fieldvalue)
 void index_file (string $path, string $id, [mixed $fields = false])
 void index_tree (string $path, [$patt $patt = ""], [$restart $restart = ""], [$lockfile $lockfile = ""], integer $wait_secs)
 void meta_field (string $fieldname, string $type)
 void noscantags ()
 void scantags ()
Variables
mixed $application = "" (line 1424)

Application we are indexing for

mixed $field_definitions = array() (line 1447)

Index fields definitions array. Contains definitions

mixed $host = "" (line 1426)

Host to connect to

mixed $idoffset = 0 (line 1455)

ID generation offset

mixed $idprefix = "" (line 1458)

ID generation prefix

mixed $idsource = ID_FROM_INC (line 1436)

ID generation source

mixed $indexfields = array() (line 1452)

Fields for indexing. This is an array of fieldname/value

mixed $ixid (line 1433)

The index ID

mixed $lockfile = "" (line 1471)

Path to a lockfile we should give way to. If this value

mixed $lockfile_wait_secs = 0 (line 1474)

Number of seconds to wait on a lockfile. If zero, wait forever.

mixed $lucene_indexer (line 1461)

The index object which does the work

mixed $metascan = true (line 1439)

Scan for meta tags as fields in file content. Recommended.

mixed $meta_fields = array() (line 1443)

Meta fields definitions array. Contains definitions

mixed $port = "" (line 1428)

Port to connect to

mixed $timeoutsecs = "" (line 1465)

Timeout for indexing commands in seconds (can usually leave

mixed $timer (line 1477)

Indexing execution timer

Methods
Constructor lucene_fileindexer (line 1486)

Constructor

Create a new lucene indexer

lucene_fileindexer lucene_fileindexer ([string $application = "?"], [string $host = ""], [string $port = ""])
  • string $application: Application name
  • string $host: Hostname or IP of Lucene server
  • string $port: Port of Lucene server
avoid_lockfile (line 1522)

Define a lockfile which we must avoid during indexing. If defined then no indexing will take place while the lockfile exists. The second parameter allows you to specify a limit to the patience of this process, in seconds. Zero means wait forever.

void avoid_lockfile (string $lockfile, integer $wait_secs)
  • string $lockfile: Path to the lockfile. Nullstring = not defined
  • integer $wait_secs: Time to wait for lockfile. Zero means forever.
define_field (line 1507)

Define a field. We supply the name of the field, it's type (Text, Date or Id), and whether it should be stored by Lucene for later retreival in queries. For example you would not store the raw document/content as this is usually stored elsewhere.

IMPORTANT NOTE: Fields defined here will automatically be included as meta fields.

  • see: meta_fields()
void define_field (string $fieldname, string $type, [boolean $stored = STORED], [boolean $indexed = INDEXED])
  • string $fieldname: Name of the field to index
  • string $type: Type of field data: Text, Date or Id.
  • boolean $stored: If true then Lucene will store the content itself
  • boolean $indexed: If true then Lucene will index the field content
id_generate (line 1572)

Set the source for ID generation. Since we are indexing a bunch of files, the ID's have to be generated on demand inside the loop. So we provide for various ways here, and you can extend this class to provide more if required.

Main ways: ID_FROM_INC Increment a counter by 1 each time (with offset) ID_FROM_NAME Take the filename, strip the extension, add prefix ID_FROM_FILENAME Take the full filename, add prefix ID_FROM_PATH Take the full file path NB: These are all defined as integer constants.

void id_generate ([integer $idsource = ID_FROM_INC], [mixed $pfxofs = ""])
  • integer $idsource: Source of ID generation
  • mixed $pfxofs: String prefix, or integer offset
index_field (line 1554)

Supply field content for indexing. This causes Lucene to take the given fieldname and index the given value against it.

The field name can have the field type included in the form 'Foo:Date', where 'Date' is the type in this instance. In fact, since 'Text' is the default filed type, 'Date' is probably the only one you need to use as the current implementation stands.

void index_field (string $fieldname, string $fieldvalue)
  • string $fieldname: Name of the field to index.
  • string $fieldvalue: Content of the field to index
index_file (line 1612)

Index a file located at the given path, using given ID.

You can also use the parameter $fields to supply an array of fieldname/value pairs to index with this file, for one-off indexing of files. If the fieldname is a date field, make sure to define the name as 'Foo:Date', to cause the field definition to be correct.

void index_file (string $path, string $id, [mixed $fields = false])
  • string $path: Path to the head of the file tree to index
  • string $id: ID to associate with the indexed file content
  • mixed $fields: Array of field/values to index with file
index_tree (line 1773)

Index a tree of files starting at the path given. We index these in one of four modes, which determines how we generate the ID for each item: 'ID_FROM_INC' mode uses an incremental counter starting at 1. If $prefix holds a number, the counter will start at this number instead of one.

Each item has an ID incremented by one from the last one. 'ID_FROM_NAME' mode uses the filename, stripped of any path and extension as the ID. If prefix is not a nullstring, then it is prefixed to every filename ID. 'ID_FROM_FILENAME' mode uses the filename, including any extension as the ID. If prefix is not a nullstring, then it is prefixed to every filename ID. 'ID_FROM_PATH' mode uses the full path to the item being indexed as the ID. If prefix is not a nullstring, then it is prefixed to every filename ID. The file will simply be indexed as a single Text field, with the appropriate ID, and no other index fields unless $metascan is set to TRUE. If this is the case, the system will scan the file for HTML meta tags of form: '<meta name="foo" content="bar">'. In this example a field of name 'foo' would be given value 'bar'.

void index_tree (string $path, [$patt $patt = ""], [$restart $restart = ""], [$lockfile $lockfile = ""], integer $wait_secs)
  • string $path: Path to the head of the file tree to index
  • $lockfile $lockfile: If path is set, we idle whilst this file exists
  • integer $wait_secs: Time to wait for lockfile. Zero means forever.
  • $patt $patt: Pattern to match, eg. '*.html'
  • $restart $restart: If equal to "restart" then treat $path as file of paths
meta_field (line 1540)

Define a field as a meta tag. This ensures that the field will be picked up from the file meta tags, if present. If it is not listed here then it will be ignored.

IMPORTANT NOTE: We define the strict rule that ONLY fields which have been defined here can be added to the indexing via the meta tag scanning. Ie. you must define fields here explicitly, or via the define_field() method, or they will be ignored even if they turn up as a meta tag. This is so we can restrict the indexing, and be sure of field types.

void meta_field (string $fieldname, string $type)
  • string $fieldname: Name of the field to process as meta tag
  • string $type: Type of field data: Text, Date or Id.
noscantags (line 1598)

Flag that we should NOT do a tag scan on the content of the files.

void noscantags ()
scantags (line 1591)

Flag that we should do a tag scan on the content of the files to try and extract fields to index. Note that any tags thus found will only be used if the field name has been defined with the method define_field(); This causes both the <title> tag and <meta> tags to be considered.

void scantags ()

Documentation generated by phpDocumentor 1.3.0RC3