Normalizer transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text
Normalizer transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. Normalizer supports the standard normalization forms described in Unicode Technical Report #15.Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character "Á" (A-acute). In Unicode, this can be encoded as a single character (the "composed" form):
00C1 LATIN CAPITAL LETTER A WITH ACUTEor as two separate characters (the "decomposed" form):0041 LATIN CAPITAL LETTER A 0301 COMBINING ACUTE ACCENTTo a user of your program, however, both of these sequences should be treated as the same "user-level" character "Á". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
0066 LATIN SMALL LETTER F 0066 LATIN SMALL LETTER F 0069 LATIN SMALL LETTER Ior as the single characterFB03 LATIN SMALL LIGATURE FFIThe ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.
Normalizer helps solve these problems by transforming text into the canonical composed and decomposed forms as shown in the first example above. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents. Finally, Normalizer rearranges accents into the proper canonical order, so that you do not have to worry about accent rearrangement on your own.
Normalizer adds one optional behavior, {@link #IGNORE_HANGUL}, that differs from the standard Unicode Normalization Forms. This option can be passed to the {@link #Normalizer constructors} and to the static {@link #compose compose} and {@link #decompose decompose} methods. This option, and any that are added in the future, will be turned off by default.
There are three common usage models for Normalizer. In the first, the static {@link #normalize normalize()} method is used to process an entire input string at once. Second, you can create a Normalizer object and use it to iterate through the normalized form of a string by calling {@link #first} and {@link #next}. Finally, you can use the {@link #setIndex setIndex()} and {@link #getIndex} methods to perform random-access iteration, which is very useful for searching.
Note: Normalizer objects behave like iterators and have methods such as setIndex, next, previous, etc. You should note that while the setIndex and getIndex refer to indices in the underlying input text being processed, the next and previous methods it iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next and previous and the indices passed to and returned from setIndex and getIndex. It is for this reason that Normalizer does not implement the {@link CharacterIterator} interface.
Note: Normalizer is currently based on version 2.1.8 of the Unicode Standard. It will be updated as later versions of Unicode are released. If you are using this class on a JDK that supports an earlier version of Unicode, it is possible that Normalizer may generate composed or dedecomposed characters for which your JDK's {@link java.lang.Character} class does not have any data.
If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned
off, this operation produces output that is in
C.
If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned
off, this operation produces output that is in
KC.
If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned
off, this operation produces output that is in
D.
If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned
off, this operation produces output that is in
KD.
The Unicode standard treates Hangul to Jamo conversion as a
canonical decomposition, so this option must be turned off if you
wish to transform strings into one of the standard
setOption enum EMode
NO_OP
COMPOSE
COMPOSE_COMPAT
DECOMP
DECOMP_COMPAT
enum
IGNORE_HANGUL
The options parameter specifies which optional
Normalizer features are to be enabled for this object.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is {@link #IGNORE_HANGUL}.
If you want the default behavior corresponding to one of the standard
Unicode Normalization Forms, use 0 for this argument.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is {@link #IGNORE_HANGUL}.
If you want the default behavior corresponding
to Unicode Normalization Form C or KC,
use 0 for this argument.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is {@link #IGNORE_HANGUL}.
The desired options should be OR'ed together to determine the value
of this argument. If you want the default behavior corresponding
to Unicode Normalization Form D or KD,
use 0 for this argument.
Note: This method sets the position in the input text,
while {@link #next} and {@link #previous} iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by next and previous and the indices passed to and
returned from setIndex and {@link #getIndex}.
Note: This method sets the position in the input, while
{@link #next} and {@link #previous} iterate through characters in the
output. This means that there is not necessarily a one-to-one
correspondence between characters returned by next and
previous and the indices passed to and returned from
setIndex and {@link #getIndex}.
Note:If the normalization mode is changed while iterating
over a string, calls to {@link #next} and {@link #previous} may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call {@link #setText setText()}, {@link #first},
{@link #last}, etc. after calling setMode.
alphabetic index hierarchy of classes this page has been generated automatically by doc++
(c)opyright by Malte Zöckler, Roland Wunderling Normalizer(const UnicodeString& str, EMode mode, int32_t opt)
mode - The normalization mode.
opt - Any optional features to be enabled.
Currently the only available option is {@link #IGNORE_HANGUL}
If you want the default behavior corresponding to one of the
standard Unicode Normalization Forms, use 0 for this argument Normalizer(const CharacterIterator& iter, EMode mode)
mode - The normalization mode.
Normalizer(const CharacterIterator& iter, EMode mode, int32_t opt)
mode - The normalization mode.
opt - Any optional features to be enabled.
Currently the only available option is {@link #IGNORE_HANGUL}
If you want the default behavior corresponding to one of the
standard Unicode Normalization Forms, use 0 for this argument Normalizer(const Normalizer& copy)
~Normalizer()
static void normalize(const UnicodeString& source, EMode mode, int32_t options, UnicodeString& result, UErrorCode &status)
aMode - the normalization mode
options - the optional features to be enabled.
result - The normalized string (on output).
status - The error code. static void compose(const UnicodeString& source, bool_t compat, int32_t options, UnicodeString& result, UErrorCode &status)
compat - Perform compatibility decomposition before composition.
If this argument is false, only canonical
decomposition will be performed.
options - the optional features to be enabled.
result - The composed string (on output).
status - The error code. static void decompose(const UnicodeString& source, bool_t compat, int32_t options, UnicodeString& result, UErrorCode &status)
compat - Perform compatibility decomposition.
If this argument is false, only canonical
decomposition will be performed.
options - the optional features to be enabled.
result - The composed string (on output).
status - The error code.
UChar current(void) const
UChar first(void)
UChar last(void)
UChar next(void)
UChar previous(void)
UChar setIndex(UTextOffset index)
void reset(void)
UTextOffset getIndex(void) const
UTextOffset startIndex(void) const
UTextOffset endIndex(void) const
bool_t operator==(const Normalizer& that) const
Normalizer* clone(void) const
int32_t hashCode(void) const
void setMode(EMode newMode)
EMode getMode(void) const
void setOption(int32_t option, bool_t value)
value - the new setting for the option. Use true to
turn the option on and false to turn it off.
bool_t getOption(int32_t option) const
void setText(const UnicodeString& newText, UErrorCode &status)
void setText(const CharacterIterator& newText, UErrorCode &status)
void getText(UnicodeString& result)
contact: doc++@zib.de