RegEx-Engine

https://github.com/SicroAtGit/RegEx-Engine

About

This RegEx engine compiles a regular expression string into an NFA, and can optionally convert the NFA into a DFA, which executes much faster; the generated NFA/DFA can then be executed against a string.

The RegEx engine will always return the longest possible match among several possible matches. During this process no backtracking is required, because all alternations are checked simultaneously.

RegExes can be assigned unique RegEx ID numbers, which allow to determine which RegEx matched when executing multiple RegExs simultaneously. This feature is useful for creating lexers, which is the main focus of the project. At the same time, the RegEx engine is kept flexible, so that it might be employed in a variety of other contexts, beside lexers creation.

Examples

Simple Match

*regEx = RegEx::Init()
If *regEx
  RegEx::AddNfa(*regEx, "test|example")
  If RegEx::Match(*regEx, @"example")
    Debug "Match!"
  Else
    Debug "No match!"
  EndIf
  RegEx::Free(*regEx)
Else
  Debug "Error!"
EndIf

Multiple RegExes Simultaneously

Enumeration
  #RegExId_Word
  #RegExId_Number
EndEnumeration

*regEx = RegEx::Init()
If *regEx
  RegEx::AddNfa(*regEx, "\w+", #RegExId_Word)
  RegEx::AddNfa(*regEx, "\d+", #RegExId_Number)
  If RegEx::Match(*regEx, @"example", @regExId)
    Select regExId
      Case #RegExId_Word:   Debug "Match is a word!"
      Case #RegExId_Number: Debug "Match is a number!"
    EndSelect
  Else
    Debug "No match!"
  EndIf
  RegEx::Free(*regEx)
Else
  Debug "Error!"
EndIf

More code examples can be found in the Source/Examples/ directory.

Supported Syntax

Syntax Meaning
xy x followed by y (Concatenation)
x\|y x or y (Alternation)
x* Zero or more consecutive x
x+ One or more consecutive x
x? Zero or one x
( ) Groups a regular expression. Groups inherit the active modes of their parent context. Mode changes within a group have no effect on the surrounding contexts.
\* Escapes the metacharacter * and treats it as a literal character.
Works also with the other metacharacters: \| + ? ( ) \
\r Matches the carriage return character (\x0D)
\n Matches the line feed character (\x0A)
\t Matches the horizontal tab character (\x09)
\f Matches the form feed character (\x0C)
[x] Where x can be a combination of: single literal character, escape sequence, or range (a-c)
. Matches any character except \r and \n
\d Matches Unicode characters class Nd
\D Matches any character except those in Unicode characters class Nd
\s Matches Unicode characters class White_Space
\S Matches any character except those in Unicode characters class White_Space
\w Matches Unicode characters classes Alphabetic, M, Nd, Pc and Join_Control
\W Matches any character except those in Unicode characters classes Alphabetic, M, Nd, Pc and Join_Control
\x Matches the character represented by the two digit hex code x (\x01\xFF)
\u Matches the character represented by the four digit hex code u (\u0001\uFFFF)
(?m) Toggles the RegEx mode states. m can be one or more flags. To deactivate a RegEx mode prefix a flag with a minus sign

Unicode Support

Like the native string functions in PureBasic: UCS-2 character encoding, UTF-16 surrogate pairs are interpreted as two single UCS-2 characters.

Case-Insensitive Mode

Flag: i

The implementation uses Unicode’s Simple Case Folding variant, but in reverse: instead of mapping all character variations to a single character (folding), a single character is mapped to all character variations (unfolding). This is necessary because the DFA must know all valid characters.

ASCII Mode

Flag: a

When active, the predefined character classes will only match the corresponding ASCII characters. For example, (?a)\w will match only [a-zA-Z0-9_]. The character encoding remains UCS-2 in this mode, i.e. (?a)\W matches all UCS-2 characters except [a-zA-Z0-9_].

This RegEx mode is also useful in combination with #RegExMode_NoCase when you want to lex for keywords within source code, case-insensitively, but no case-folding should be applied:

Public Constants

EnumerationBinary RegExModes
  #RegExMode_NoCase ; Activates case-insensitive mode
  #RegExMode_Ascii  ; Activates ASCII mode
EndEnumeration
Enumeration NfaStateTypes
  #StateType_EpsilonMove ; Used for NFA epsilon moves
  #StateType_SymbolMove  ; Used for NFA symbol moves
  #StateType_SplitMove   ; Used for NFA unions
  #StateType_Final       ; Used for NFA final state
EndEnumeration
#State_DfaDeadState = 0 ; Index number of the DFA dead state

Public Structures

Structure ByteRangeStruc
  min.a ; Minimum byte value (0-255)
  max.a ; Maximum byte value (0-255)
EndStructure
Structure NfaStateStruc
  stateType.u               ; Type of the NFA state (regExId = stateType - #StateType_NfaFinal)
  byteRange.ByteRangeStruc  ; A byte range is used as a transition symbol
  *nextState1.NfaStateStruc ; Pointer to the first next NFA state
  *nextState2.NfaStateStruc ; Pointer to the second next NFA state
EndStructure
Structure DfaStateStruc
  nextState.u[256] ; Index is the symbol (0-255) and the value is the next DFA state
  isFinalState.u   ; Positive number if the DFA state is a final state, otherwise null
EndStructure
Structure DfaStatesArrayStruc
  states.DfaStateStruc[0] ; Array pointer to the DFA states
EndStructure
Structure NfaPoolStruc
  List nfaStates.NfaStateStruc() ; Holds all NFA states of the NFA pool
  *initialNfaState.NfaStateStruc ; Pointer to the NFA initial state
EndStructure
Structure RegExEngineStruc
  List nfaPools.NfaPoolStruc()       ; Holds all NFA pools
  *dfaStatesPool.DfaStatesArrayStruc ; Holds all DFA states
  isUseDfaFromMemory.b               ; `#True` if `UseDfaFromMemory()` was used, otherwise `#False`
EndStructure

Public Macros

Public Functions

Reduced DFA Matcher Module

The reduced module DfaMatcher provides only a DFA matcher which uses the precompiled DFAs created with the main module.

If only the precompiled DFAs are needed in the software, for matching, and no new NFAs/DFAs are to be created at runtime, then the reduced module can be used. This way the software is not unnecessarily bloated with the large Unicode tables and the rest of the code found in the main module.

Public Constants

#State_DfaDeadState = 0 ; Index number of the DFA dead state

Public Structures

Structure DfaStateStruc
  nextState.u[256] ; Index is the symbol (0-255) and the value is the next DFA state
  isFinalState.u   ; Positive number if the DFA state is a final state, otherwise null
EndStructure
Structure DfaStatesArrayStruc
  states.DfaStateStruc[0] ; Array pointer to the DFA states
EndStructure

Public Macros

Public Functions

Would you like to contribute to the project?

Then please check out CONTRIBUTING for details.

License

The project is licensed under the MIT license.