1.19. Regular Matching

Felix provides support for matching strings against regular expressions.

Unlike commonly used regexp libraries, regular expressions are not strings: instead a first class syntax is used to define them.

Felix allows you to name regular expressions with the syntax:

  regexp <name> = <regexp> ;
The name is an identifier. A string used in a regexp stands for a match of each character of the string in sequence. The following symbols are special, and are given from weakest to strongest binding order:
symbolsyntaxmeaning
|infixalternatives
*postfix0 or more occurences
+postfix1 or more occurences
?postfix0 or 1 occurences
<juxtaposition>infixconcatenation
<name>atomicre denoted by the name in a REGEXP definition
<string>atomicsequence of chars of the string
[<charset>]atomicany char of the charset
[^<charset>]atomicany char not in the charset
.atomicany char other than end of line
_atomicany char
eofatomicend marker
(<regexp>)atomicbrackets
The usual notation for character sets is employed:
symbolmeaning
<string>any character in the string
<char>-<char>any between or including the two chars
Note that a char is represented by a one character string.

Start C++ section to tut/examples/tut121.flx[1 /2 ] Next Last
     1: include "std";
     2: regexp lower = ["abcdefghijklmnopqrstuvwxyz"];
     3: regexp upper = ["ABCDEFGHIJKLMNOPQRSTUVWXYZ"];
     4: regexp digit = ["0123456789"];
     5: regexp alpha = lower | upper | "_";
     6: regexp id = alpha (alpha | digit) *;
     7: 
End C++ section to tut/examples/tut121.flx[1]
Regular expressions are used in regular match expressions. These are like ordinary matches, except that the pattern is a regular expression, and the argument must be a string. If more than one pattern matches the string, the first one is used.
Start C++ section to tut/examples/tut121.flx[2 /2 ] Prev First
     8: print
     9:   regmatch "identifier" with
    10:   | digit+ => "Number"
    11:   | id =>  "Identifier"
    12:   endmatch
    13: ;
    14: endl;
    15: 
    16: print
    17:   regmatch "9999" with
    18:   | digit+ => "Number"
    19:   | id =>  "Identifier"
    20:   endmatch
    21: ;
    22: endl;
    23: 
    24: print
    25:   regmatch "999xxx" with
    26:   | digit+ => "Number"
    27:   | id =>  "Identifier"
    28:   | _* => "Neither"
    29:   endmatch
    30: ;
    31: endl;
    32: 
    33: 
End C++ section to tut/examples/tut121.flx[2]
Note that whilst conceptually regular matches are applied first to last, the actual implementation uses a finite state machine and guarranteed to be linear in the length of the input, and, in particular independent of the number of regular expressions. Each character is processed at most once.

Note: the generated code is *extremely* fast, within one or two memory fetches of the fastest possible code. here is the generated code for the inner loop of a regmatch:

  while(state && start != end)
    state = matrix[*start++][state];