CRE Syntax

This is the reference for CRE syntax. It describes each construct, and lists the corresponding Perl-like syntax.

Annex, a Python library, supports only a subset of CRE syntax, since it's a wrapper around Python's re.

CREs (and traditional syntax) can be thought of in two parts: a language to specify sets of characters, and a language on top of that to specify sets of strings.

Table of Contents

Specifying Characters

Single Metacharacters

These constructs stand for a set of characters.

CRE	Traditional	Definition	Notes
`any`	`.`	any character, possibly including newline (anyall=true)
`chars[x y z]`	`[xyz]`	character class	Insignificant space makes these more readable.
`!chars[x y z]`	`[^xyz]`	negated character class
`digit`	`\d`	Perl character class
`!digit`	`\D`	negated Perl character class
`:alpha`	`[:alpha:]`	ASCII character class
`!:alpha`	`[:^alpha:]`	negated ASCII character class
`::Greek`	`\p{Greek}`	Unicode character class
`!::Greek`	`\P{Greek}`	negated Unicode character class

Character Literals

These constructs stand for exactly one character. They're all valid inside or outside a character class. There are no octal escapes.

CRE	Traditional	Definition	Notes
`0x00`	`varies`	Hex escape
`&201c`	`varies`	Unicode code point
`&name`	`e.g. \n`	Named character.	See the section below

Named Characters

This is a list of named characters. There are no equivalents of \b for backspace or \f for form feed, etc. Instead use hex or unicode escapes.

CRE	Traditional	Definition	Notes
`&space`		space character	Traditional regexes just use a literal space, or [ ]. Whitespace is never significant in CREs, so this is necessary to represent a space inside a character class.
`&newline &cr &tab`	`\n \r \t`	whitespace characters	The backslash doesn't mean anything special in CRE. Use the default concatenation operator: `'a' &newline 'b'` instead of `a\nb`.
`&hyphen &bang &hash &lbracket &rbracket`	`\- ! \# \[ \]`	The character named.	Only needed inside character classes. Otherwise use `'-'`.
`&squote &dquote`	`\- ! \# \[ \]`	The character named.	Purely syntactic sugar. You can always use `chars["]` inside a char class, or `'"'` outside of one.

Character Class Elements

These elements may appear inside a character class.

Named classes can be negated with the ! operator: !digit, !:alnum, !::Greek.

CRE	Traditional	Definition	Notes
`x`	`x`	single character
`A-Z`	`A-Z`	character range (inclusive)	ranges must be separated by space, e.g. `chars[a-z A-Z]` not `chars[a-zA-Z]`. Escapes are also allowed, e.g. `chars[0x00 - 0x20]`

In addition, all character literals, as well as named classes and their negations, may appear within a character class. For example: digit, !digit, :alnum, !:alnum, &space, 0x00, and &201c.

Perl Named Classes

CRE	Traditional	Definition	Notes
`digit`	`\d`	digits `[0-9]`
`!digit`	`\D`	not digits `[^0-9]`
`whitespace`	`\s`	whitespace `[\t\n\f\r ]`
`!whitespace`	`\S`	not whitespace `[^\t\n\f\r ]`
`wordchar`	`\w`	word characters `[0-9A-Za-z_]`	`word` was confusing since it sounds like it stands for multiple characters, so we use the longer `wordchar`.
`!wordchar`	`\W`	not word characters `[^0-9A-Za-z_]`

POSIX Named Classes

These are not supported in Annex.

CRE	Traditional	Definition	Notes
`:alnum`	`[:alnum:]`	alphanumeric (== [0-9A-Za-z])
`:alpha`	`[:alpha:]`	alphabetic (== [A-Za-z])
`:ascii`	`[:ascii:]`	ASCII (== [\x00-\x7F])
`:blank`	`[:blank:]`	blank (== [\t ])
`:cntrl`	`[:cntrl:]`	control (== [\x00-\x1F\x7F])
`:digit`	`[:digit:]`	digits (== [0-9])
`:graph`	`[:graph:]`	graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{\|}~])
`:lower`	`[:lower:]`	lower case (== [a-z])
`:print`	`[:print:]`	printable (== [ -~] == [ [:graph:]])
`:punct`	`[:punct:]`	punctuation (== [!-/:-@[-`{-~])
`:space`	`[:space:]`	whitespace (== [\t\n\v\f\r ])
`:upper`	`[:upper:]`	upper case (== [A-Z])
`:word`	`[:word:]`	word characters (== [0-9A-Za-z_])
`:xdigit`	`[:xdigit:]`	hex digit (== [0-9A-Fa-f])

Specifying Strings

Literals

CRE	Traditional	Definition	Notes
`'x' or "x"`	`x`	The literal string x.	Each type of string can contain the opposite quote, e.g. `'"'` is a double quote and `"'"` is a single quote. There are no backslash escapes. You can use the concatenation operator to write strings with special characters, e.g. `'a' 0x00 'b'` is like `"a\0b"` in C.

Composites

CRE	Traditional	Definition	Notes
`x y`	`xy`	x followed by y
`either x or y`	`x\|y`	x or y (prefer x)	Prefix syntax makes long alternations easier for humans and computers to parse.

Repetitions

General repetitions look like x^(modifier repetition). They're only needed in the relatively rare cases of a ranged numbered repetition or possessive repetition. Otherwise, +, *, ?, ++, **, ??, and ^n suffice.

Possessive repetitions (traditionally the + suffix: *+, ++, ?+, ...) are like the non-greedy repetitions, but with a P instead of an N. For example, x^(P *) or x^(P 2..3). These are not supported in Python.

CRE	Traditional	Definition	Notes
`x*`	`x*`	zero or more x, prefer more
`x+`	`x+`	one or more x, prefer more
`x?`	`x?`	zero or one x, prefer one
`x^(n..m)`	`x{n,m}`	n or n+1 or ... or m x, prefer more
`x^(n..)`	`x{n,}`	n or more x, prefer more
`x^n`	`x{n}`	exactly n x
`x** or x^(N *)`	`x*?`	zero or more x, prefer fewer
`x++ or x^(N +)`	`x+?`	one or more x, prefer fewer
`x?? or x^(N ?)`	`x??`	zero or one x, prefer zero
`x^(N n..m)`	`x{n,m}?`	n or n+1 or ... or m x, prefer fewer
`x^(N n..)`	`x{n,}?`	n or more x, prefer fewer

Grouping

CRE	Traditional	Definition	Notes
`{re}`	`(re)`	numbered capturing group
`{re as name}`	`(?P<name>re) or (?<name>re) or (?'name're)`	named & numbered capturing group	no standard traditional syntax
`(re)`	`(?:re)`	non-capturing group
`MyName = expression`		Named subexpression.	Creates a pattern that can be referenced elsewhere. It does not create a capturing group. Naming convention is `CapWords`.

Zero Width Assertions

These constructs match empty space.

CRE	Traditional	Definition	Notes
`%begin`	`^`	at beginning of text or line (multiline=true)
`%end`	`$`	at end of text (like \z not \Z) or line (multiline=true)
`%begin-text`	`\A`	at beginning of text
`%end-text`	`\z`	at end of text
`%boundary`	`\b`	at word boundary (\w on one side and \W, \A, or \z on the other)
`!%boundary`	`\B`	not a word boundary
`!%begin-word`	`\<`	left word boundary
`!%end-word`	`\>`	right word boundary

Backtracking Constructs

These constructs imply a backtracking implementation. They are identified by keywords in CAPS.

Notes:

Perl has very exotic and "experimental" backtracking constructs. We only list ones that are implemented in at least one other language.
In general, these constructs are not supported by re2, because the backtracking semantics would break its linear-time guarantee.
Asserts begin with % because they're also "zero-width".

CRE	Traditional	Definition	Notes
`%ASSERT(re)`	`(?=re)`	lookahead assertion.
`!%ASSERT(re)`	`(?!re)`	negative lookahead assertion.
`%ASSERTLEFT(re)`	`(?<=re)`	lookbehind assertion.
`!%ASSERTLEFT(re)`	`(?<!re)`	negative lookbehind assertion.
`REF(1)`	`\1`	Backreference to captured group.
`REF(name)`	`(?P=name)`	Backreference to named captured group.
`IF foo THEN a ELSE b`	`(?(foo)yes\|no)`	Pattern conditional on backreference.
`RECURSE(pattern)`	`??{name} or (?n) or (?R)`	Recurse into the pattern. The pattern can be a group identified by name, number, or the entire pattern itself.
`ATOMIC(...)`	`(?>...)`	Atomic grouping.	Java and Ruby both support this.
`%END-PREV`	`\G`	The end of the previous match.	Not supported by Python (or re2).

Other

Top Level Syntax

A CRE can be either an expression or a list of named expressions, one of which is Start. For example, this is a valid CRE:

digit+

So is this:

Start = digit+

and this:

D     = digit+
Start = D

Flags

Flag syntax is flags(x -y z) (set x and z, clear y).

CRE	Traditional	Definition	Notes
`flags(multiline unicode ...)`	`(?flags)`	set flags. In Python this is only valid at the start of the entire pattern.	In CRE, the names for flags are whole words like `multiline`, not single letters.

List of flags:

CRE	Traditional	Definition	Notes
`ignorecase`	`i`	case-insensitive (default false)
`multiline`	`m`	multi-line mode: `^` and `$` match begin/end line in addition to begin/end text (default false)
`anyall`	`s`	let . match \n (default false)
`ungreedy`	`U`	ungreedy: swap meaning of `x* and x*?, x+` and `x+?`, etc (default false)	not available in Python.
`unicode`	`u`	Make character classes dependent on Unicode character properties database.	Python's re.UNICODE flag.
`debug`		Display debug information about compiled expression.	Python's re.DEBUG. (May be Python only?)
`locale`	`L`	Make \w, \W, \b, \B, \s and \S dependent on the current locale.	Python's re.LOCALE. (May be Python only?)

Comments

CRE	Traditional	Definition	Notes
`# until end of line`	`(?#text)`	comment

Reference

List of Reserved Words

A word without a punctuation prefix (e.g. %begin, :alnum) is assumed to be the name of a subexpression, except if it's one of these reserved words.

either, or -- alternation
as -- capturing
chars -- start a character class, e.g. chars[a..z]
flags -- for compilation flags.
any -- single metacharacter
wordchar, digit, whitespace -- Perl classes
REF, ASSERT, ASSERTLEFT, IF THEN ELSE, ATOMIC,RECURSE` -- backtracking constructs in CAPS

List of Punctuation Used

= -- definition of subexpressions
* + ? ^ -- repetitions
' and " -- strings
() -- grouping, for REF(1), for repeats like a^(1..3)
[] -- character classes
{} -- capture
- -- range of characters
.. -- repetition range
# -- comment
! -- negation of character classes, named classes (perl posix unicode)
: -- beginning of named classes
% -- zero width assertions
& -- character literals

Unused: / \ ~ @ $ | ; ` < > , .

Grammar

Parser for CRE -- This is in TPE syntax. It's auto-generated from the source code, so it's the most precise documentation. TPE itself is rigorosusly specified, like PEG.

TODO: lexer for CRE. Note that the edges are ordered.
- can you generate a state machine of some sort for CRE lexer?