Main Page   Class Hierarchy   Alphabetical List   Data Structures   File List   Data Fields   Globals  

RuleBasedBreakIterator Class Reference

#include <rbbi.h>

Inheritance diagram for RuleBasedBreakIterator:

BreakIterator DictionaryBasedBreakIterator

Public Methods

 RuleBasedBreakIterator (const RuleBasedBreakIterator &that)
 Copy constructor. More...

virtual ~RuleBasedBreakIterator ()
 Destructor.

RuleBasedBreakIterator & operator= (const RuleBasedBreakIterator &that)
 Assignment operator. More...

virtual UBool operator== (const BreakIterator &that) const
 Equality operator. More...

UBool operator!= (const BreakIterator &that) const
 Not-equal operator. More...

virtual BreakIteratorclone (void) const
 Returns a newly-constructed RuleBasedBreakIterator with the same behavior, and iterating over the same text, as this one.

virtual int32_t hashCode (void) const
 Compute a hash code for this BreakIterator. More...

virtual const UnicodeStringgetRules (void) const
 Returns the description used to create this iterator.

virtual const CharacterIteratorgetText (void) const
 Return a CharacterIterator over the text being analyzed. More...

virtual void adoptText (CharacterIterator *newText)
 Set the iterator to analyze a new piece of text. More...

virtual void setText (const UnicodeString &newText)
 Set the iterator to analyze a new piece of text. More...

virtual int32_t first (void)
 Sets the current iteration position to the beginning of the text. More...

virtual int32_t last (void)
 Sets the current iteration position to the end of the text. More...

virtual int32_t next (int32_t n)
 Advances the iterator either forward or backward the specified number of steps. More...

virtual int32_t next (void)
 Advances the iterator to the next boundary position. More...

virtual int32_t previous (void)
 Advances the iterator backwards, to the last boundary preceding this one. More...

virtual int32_t following (int32_t offset)
 Sets the iterator to refer to the first boundary position following the specified position. More...

virtual int32_t preceding (int32_t offset)
 Sets the iterator to refer to the last boundary position before the specified position. More...

virtual UBool isBoundary (int32_t offset)
 Returns true if the specfied position is a boundary position. More...

virtual int32_t current (void) const
 Returns the current iteration position. More...

virtual UClassID getDynamicClassID (void) const
 Returns a unique class ID POLYMORPHICALLY. More...

virtual BreakIteratorcreateBufferClone (void *stackBuffer, int32_t &BufferSize, UErrorCode &status)
 Thread safe client-buffer-based cloning operation Do NOT call delete on a safeclone, since 'new' is not used to create it. More...


Static Public Methods

UClassID getStaticClassID (void)
 Returns the class ID for this class. More...


Static Public Attributes

const int8_t IGNORE
 A token used as a character-category value to identify ignore characters.


Protected Methods

 RuleBasedBreakIterator (UDataMemory *image)
virtual int32_t handleNext (void)
 This method is the actual implementation of the next() method. More...

virtual int32_t handlePrevious (void)
 This method backs the iterator back up to a "safe position" in the text. More...

virtual void reset (void)
 Dumps caches and performs other actions associated with a complete change in text or iteration position. More...


Protected Attributes

CharacterIteratortext
 The character iterator through which this BreakIterator accesses the text.

RuleBasedBreakIteratorTables * tables
 The data tables this iterator uses to determine the break positions.


Friends

class BreakIterator

Detailed Description

A subclass of BreakIterator whose behavior is specified using a list of rules.

There are two kinds of rules, which are separated by semicolons: substitutions and regular expressions.

A substitution rule defines a name that can be used in place of an expression. It consists of a name, which is a string of characters contained in angle brackets, an equals sign, and an expression. (There can be no whitespace on either side of the equals sign.) To keep its syntactic meaning intact, the expression must be enclosed in parentheses or square brackets. A substitution is visible after its definition, and is filled in using simple textual substitution. Substitution definitions can contain other substitutions, as long as those substitutions have been defined first. Substitutions are generally used to make the regular expressions (which can get quite complex) shorted and easier to read. They typically define either character categories or commonly-used subexpressions.

There is one special substitution.  If the description defines a substitution called "<ignore>", the expression must be a [] expression, and the expression defines a set of characters (the "ignore characters") that will be transparent to the BreakIterator.  A sequence of characters will break the same way it would if any ignore characters it contains are taken out.  Break positions never occur befoer ignore characters.

A regular expression uses a subset of the normal Unix regular-expression syntax, and defines a sequence of characters to be kept together. With one significant exception, the iterator uses a longest-possible-match algorithm when matching text to regular expressions. The iterator also treats descriptions containing multiple regular expressions as if they were ORed together (i.e., as if they were separated by |).

The special characters recognized by the regular-expression parser are as follows:

* Specifies that the expression preceding the asterisk may occur any number of times (including not at all).
{} Encloses a sequence of characters that is optional.
() Encloses a sequence of characters.  If followed by *, the sequence repeats.  Otherwise, the parentheses are just a grouping device and a way to delimit the ends of expressions containing |.
| Separates two alternative sequences of characters.  Either one sequence or the other, but not both, matches this expression.  The | character can only occur inside ().
. Matches any character.
*? Specifies a non-greedy asterisk.  *? works the same way as *, except when there is overlap between the last group of characters in the expression preceding the * and the first group of characters following the *.  When there is this kind of overlap, * will match the longest sequence of characters that match the expression before the *, and *? will match the shortest sequence of characters matching the expression before the *?.  For example, if you have "xxyxyyyxyxyxxyxyxyy" in the text, "x[xy]*x" will match through to the last x (i.e., "xxyxyyyxyxyxxyxyxyy", but "x[xy]*?x" will only match the first two xes ("xxyxyyyxyxyxxyxyxyy").
[] Specifies a group of alternative characters.  A [] expression will match any single character that is specified in the [] expression.  For more on the syntax of [] expressions, see below.
/ Specifies where the break position should go if text matches this expression.  (e.g., "[a-z]&#42;/[:Zs:]*1" will match if the iterator sees a run of letters, followed by a run of whitespace, followed by a digit, but the break position will actually go before the whitespace).  Expressions that don't contain / put the break position at the end of the matching text.
</td> Escape character.  The \ itself is ignored, but causes the next character to be treated as literal character.  This has no effect for many characters, but for the characters listed above, this deprives them of their special meaning.  (There are no special escape sequences for Unicode characters, or tabs and newlines; these are all handled by a higher-level protocol.  In a Java string, "
" will be converted to a literal newline character by the time the regular-expression parser sees it.  Of course, this means that \ sequences that are visible to the regexp parser must be written as \ when inside a Java string.)  All characters in the ASCII range except for letters, digits, and control characters are reserved characters to the parser and must be preceded by \ even if they currently don't mean anything.
! If ! appears at the beginning of a regular expression, it tells the regexp parser that this expression specifies the backwards-iteration behavior of the iterator, and not its normal iteration behavior.  This is generally only used in situations where the automatically-generated backwards-iteration brhavior doesn't produce satisfactory results and must be supplemented with extra client-specified rules.
(all others) All other characters are treated as literal characters, which must match the corresponding character(s) in the text exactly.

Within a [] expression, a number of other special characters can be used to specify groups of characters:

- Specifies a range of matching characters.  For example "[a-p]" matches all lowercase Latin letters from a to p (inclusive).  The - sign specifies ranges of continuous Unicode numeric values, not ranges of characters in a language's alphabetical order: "[a-z]" doesn't include capital letters, nor does it include accented letters such as a-umlaut.
:: A pair of colons containing a one- or two-letter code matches all characters in the corresponding Unicode category.  The two-letter codes are the same as the two-letter codes in the Unicode database (for example, "[:Sc::Sm:]" matches all currency symbols and all math symbols).  Specifying a one-letter code is the same as specifying all two-letter codes that begin with that letter (for example, "[:L:]" matches all letters, and is equivalent to "[:Lu::Ll::Lo::Lm::Lt:]").  Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal within colons.
[] [] expressions can nest.  This has no effect, except when used in conjunction with the ^ token.
^ Excludes the character (or the characters in the [] expression) following it from the group of characters.  For example, "[a-z^p]" matches all Latin lowercase letters except p.  "[:L:^[\u4e00-\u9fff]]" matches all letters except the Han ideographs.
(all others) All other characters are treated as literal characters.  (For example, "[aeiou]" specifies just the letters a, e, i, o, and u.)

For a more complete explanation, see http://www.ibm.com/developerworks/unicode/library/boundaries/boundaries.html.   For examples, see the resource data (which is annotated).

Author:
Richard Gillam


Constructor & Destructor Documentation

RuleBasedBreakIterator::RuleBasedBreakIterator const RuleBasedBreakIterator &    that
 

Copy constructor.

Will produce a collator with the same behavior, and which iterates over the same text, as the one passed in.


Member Function Documentation

virtual void RuleBasedBreakIterator::adoptText CharacterIterator   newText [virtual]
 

Set the iterator to analyze a new piece of text.

This function resets the current iteration position to the beginning of the text.

Parameters:
newText  An iterator over the text to analyze. The BreakIterator takes ownership of the character iterator. The caller MUST NOT delete it!

Implements BreakIterator.

virtual BreakIterator* RuleBasedBreakIterator::createBufferClone void *    stackBuffer,
int32_t &    BufferSize,
UErrorCode   status
[virtual]
 

Thread safe client-buffer-based cloning operation Do NOT call delete on a safeclone, since 'new' is not used to create it.

Parameters:
stackBuffer  user allocated space for the new clone. If NULL new memory will be allocated. If buffer is not large enough, new memory will be allocated.
BufferSize  reference to size of allocated space. If BufferSize == 0, a sufficient size for use in cloning will be returned ('pre-flighting') If BufferSize is not enough for a stack-based safe clone, new memory will be allocated.
status  to indicate whether the operation went on smoothly or there were errors An informational status value, U_SAFECLONE_ALLOCATED_ERROR, is used if any allocations were necessary.
Returns:
pointer to the new clone
@draft ICU 1.8

Implements BreakIterator.

Reimplemented in DictionaryBasedBreakIterator.

virtual int32_t RuleBasedBreakIterator::current void    const [virtual]
 

Returns the current iteration position.

Returns:
The current iteration position.

Implements BreakIterator.

virtual int32_t RuleBasedBreakIterator::first void    [virtual]
 

Sets the current iteration position to the beginning of the text.

(i.e., the CharacterIterator's starting offset).

Returns:
The offset of the beginning of the text.

Implements BreakIterator.

virtual int32_t RuleBasedBreakIterator::following int32_t    offset [virtual]
 

Sets the iterator to refer to the first boundary position following the specified position.

@offset The position from which to begin searching for a break position.

Returns:
The position of the first break after the current position.

Reimplemented in DictionaryBasedBreakIterator.

UClassID RuleBasedBreakIterator::getDynamicClassID void    const [inline, virtual]
 

Returns a unique class ID POLYMORPHICALLY.

Pure virtual override. This method is to implement a simple version of RTTI, since not all C++ compilers support genuine RTTI. Polymorphic operator==() and clone() methods call this method.

Returns:
The class ID for this object. All objects of a given class have the same class ID. Objects of other classes have different class IDs.

Implements BreakIterator.

Reimplemented in DictionaryBasedBreakIterator.

UClassID RuleBasedBreakIterator::getStaticClassID void    [inline, static]
 

Returns the class ID for this class.

This is useful only for comparing to a return value from getDynamicClassID(). For example:

Base* polymorphic_pointer = createPolymorphicObject(); if (polymorphic_pointer->getDynamicClassID() == Derived::getStaticClassID()) ...

Returns:
The class ID for all objects of this class.

Reimplemented in DictionaryBasedBreakIterator.

virtual const CharacterIterator& RuleBasedBreakIterator::getText void    const [virtual]
 

Return a CharacterIterator over the text being analyzed.

This version of this method returns the actual CharacterIterator we're using internally. Changing the state of this iterator can have undefined consequences. If you need to change it, clone it first.

Returns:
An iterator over the text being analyzed.

Implements BreakIterator.

virtual int32_t RuleBasedBreakIterator::handleNext void    [protected, virtual]
 

This method is the actual implementation of the next() method.

All iteration vectors through here. This method initializes the state machine to state 1 and advances through the text character by character until we reach the end of the text or the state machine transitions to state 0. We update our return value every time the state machine passes through a possible end state.

Reimplemented in DictionaryBasedBreakIterator.

virtual int32_t RuleBasedBreakIterator::handlePrevious void    [protected, virtual]
 

This method backs the iterator back up to a "safe position" in the text.

This is a position that we know, without any context, must be a break position. The various calling methods then iterate forward from this safe position to the appropriate position to return. (For more information, see the description of buildBackwardsStateTable() in RuleBasedBreakIterator.Builder.)

virtual int32_t RuleBasedBreakIterator::hashCode void    const [virtual]
 

Compute a hash code for this BreakIterator.

Returns:
A hash code

virtual UBool RuleBasedBreakIterator::isBoundary int32_t    offset [virtual]
 

Returns true if the specfied position is a boundary position.

As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".

Parameters:
offset  the offset to check.
Returns:
True if "offset" is a boundary position.

virtual int32_t RuleBasedBreakIterator::last void    [virtual]
 

Sets the current iteration position to the end of the text.

(i.e., the CharacterIterator's ending offset).

Returns:
The text's past-the-end offset.

Implements BreakIterator.

virtual int32_t RuleBasedBreakIterator::next void    [virtual]
 

Advances the iterator to the next boundary position.

Returns:
The position of the first boundary after this one.

Implements BreakIterator.

virtual int32_t RuleBasedBreakIterator::next int32_t    n [virtual]
 

Advances the iterator either forward or backward the specified number of steps.

Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().

Parameters:
n  The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).
Returns:
The character offset of the boundary position n boundaries away from the current one.

Implements BreakIterator.

UBool RuleBasedBreakIterator::operator!= const BreakIterator   that const [inline]
 

Not-equal operator.

If operator== returns TRUE, this returns FALSE, and vice versa.

Reimplemented from BreakIterator.

RuleBasedBreakIterator& RuleBasedBreakIterator::operator= const RuleBasedBreakIterator &    that
 

Assignment operator.

Sets this iterator to have the same behavior, and iterate over the same text, as the one passed in.

virtual UBool RuleBasedBreakIterator::operator== const BreakIterator   that const [virtual]
 

Equality operator.

Returns TRUE if both BreakIterators are of the same class, have the same behavior, and iterate over the same text.

Implements BreakIterator.

virtual int32_t RuleBasedBreakIterator::preceding int32_t    offset [virtual]
 

Sets the iterator to refer to the last boundary position before the specified position.

@offset The position to begin searching for a break from.

Returns:
The position of the last boundary before the starting position.

Reimplemented in DictionaryBasedBreakIterator.

virtual int32_t RuleBasedBreakIterator::previous void    [virtual]
 

Advances the iterator backwards, to the last boundary preceding this one.

Returns:
The position of the last boundary position preceding this one.

Implements BreakIterator.

Reimplemented in DictionaryBasedBreakIterator.

virtual void RuleBasedBreakIterator::reset void    [protected, virtual]
 

Dumps caches and performs other actions associated with a complete change in text or iteration position.

This function is a no-op in RuleBasedBreakIterator, but subclasses can and do override it.

Reimplemented in DictionaryBasedBreakIterator.

virtual void RuleBasedBreakIterator::setText const UnicodeString   newText [virtual]
 

Set the iterator to analyze a new piece of text.

This function resets the current iteration position to the beginning of the text.

Parameters:
newText  The text to analyze.

Implements BreakIterator.


The documentation for this class was generated from the following file:
Generated on Sun Mar 3 16:07:06 2002 for ICU 2.0 by doxygen1.2.14 written by Dimitri van Heesch, © 1997-2002