This is the reference for CRE syntax. It describes each construct, and lists the corresponding Perl-like syntax.
Annex, a Python library, supports only a subset of CRE syntax, since it's a
wrapper around Python's re
.
CREs (and traditional syntax) can be thought of in two parts: a language to specify sets of characters, and a language on top of that to specify sets of strings.
These constructs stand for a set of characters.
CRE | Traditional | Definition | Notes |
any
|
.
|
any character, possibly including newline (anyall=true) | |
chars[x y z]
|
[xyz]
|
character class | Insignificant space makes these more readable. |
!chars[x y z]
|
[^xyz]
|
negated character class | |
digit
|
\d
|
Perl character class | |
!digit
|
\D
|
negated Perl character class | |
:alpha
|
[:alpha:]
|
ASCII character class | |
!:alpha
|
[:^alpha:]
|
negated ASCII character class | |
::Greek
|
\p{Greek}
|
Unicode character class | |
!::Greek
|
\P{Greek}
|
negated Unicode character class |
These constructs stand for exactly one character. They're all valid inside or outside a character class. There are no octal escapes.
CRE | Traditional | Definition | Notes |
0x00
|
varies
|
Hex escape | |
&201c
|
varies
|
Unicode code point | |
&name
|
e.g. \n
|
Named character. | See the section below |
This is a list of named characters. There are no equivalents of
\b
for backspace or \f
for form feed, etc. Instead use hex or unicode
escapes.
CRE | Traditional | Definition | Notes |
&space
|
|
space character | Traditional regexes just use a literal space, or [ ]. Whitespace is never significant in CREs, so this is necessary to represent a space inside a character class. |
&newline &cr &tab
|
\n \r \t
|
whitespace characters |
The backslash doesn't mean anything special in CRE. Use the default concatenation operator: 'a' &newline 'b' instead of
a\nb .
|
&hyphen &bang &hash &lbracket &rbracket
|
\- ! \# \[ \]
|
The character named. |
Only needed inside character classes. Otherwise use '-' .
|
&squote &dquote
|
\- ! \# \[ \]
|
The character named. |
Purely syntactic sugar. You can always use chars["] inside a char class, or '"' outside of one.
|
These elements may appear inside a character class.
Named classes can be negated with the !
operator: !digit
, !:alnum
,
!::Greek
.
CRE | Traditional | Definition | Notes |
x
|
x
|
single character | |
A-Z
|
A-Z
|
character range (inclusive) |
ranges must be separated by space, e.g. chars[a-z A-Z] not chars[a-zA-Z] . Escapes are also allowed, e.g.
chars[0x00 - 0x20]
|
In addition, all character literals, as well as named classes and their
negations, may appear within a character class. For example: digit
, !digit
,
:alnum
, !:alnum
, &space
, 0x00
, and &201c
.
CRE | Traditional | Definition | Notes |
digit
|
\d
|
digits [0-9]
|
|
!digit
|
\D
|
not digits [^0-9]
|
|
whitespace
|
\s
|
whitespace [\t\n\f\r ]
|
|
!whitespace
|
\S
|
not whitespace [^\t\n\f\r ]
|
|
wordchar
|
\w
|
word characters [0-9A-Za-z_]
|
word was confusing since it sounds like it stands for multiple characters, so we use the longer wordchar .
|
!wordchar
|
\W
|
not word characters [^0-9A-Za-z_]
|
These are not supported in Annex.
CRE | Traditional | Definition | Notes |
:alnum
|
[:alnum:]
|
alphanumeric (== [0-9A-Za-z]) | |
:alpha
|
[:alpha:]
|
alphabetic (== [A-Za-z]) | |
:ascii
|
[:ascii:]
|
ASCII (== [\x00-\x7F]) | |
:blank
|
[:blank:]
|
blank (== [\t ]) | |
:cntrl
|
[:cntrl:]
|
control (== [\x00-\x1F\x7F]) | |
:digit
|
[:digit:]
|
digits (== [0-9]) | |
:graph
|
[:graph:]
|
graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]) | |
:lower
|
[:lower:]
|
lower case (== [a-z]) | |
:print
|
[:print:]
|
printable (== [ -~] == [ [:graph:]]) | |
:punct
|
[:punct:]
|
punctuation (== [!-/:-@[-`{-~]) | |
:space
|
[:space:]
|
whitespace (== [\t\n\v\f\r ]) | |
:upper
|
[:upper:]
|
upper case (== [A-Z]) | |
:word
|
[:word:]
|
word characters (== [0-9A-Za-z_]) | |
:xdigit
|
[:xdigit:]
|
hex digit (== [0-9A-Fa-f]) |
CRE | Traditional | Definition | Notes |
'x' or "x"
|
x
|
The literal string x. |
Each type of string can contain the opposite quote, e.g. '"' is a double quote and "'" is a single quote. There are no
backslash escapes. You can use the concatenation operator to write strings
with special characters, e.g. 'a' 0x00 'b' is like
"a\0b" in C.
|
CRE | Traditional | Definition | Notes |
x y
|
xy
|
x followed by y | |
either x or y
|
x|y
|
x or y (prefer x) | Prefix syntax makes long alternations easier for humans and computers to parse. |
General repetitions look like x^(modifier repetition)
. They're only needed
in the relatively rare cases of a ranged numbered repetition or possessive
repetition. Otherwise, +
, *
, ?
, ++
, **
, ??
, and ^n
suffice.
Possessive repetitions (traditionally the +
suffix: *+
, ++
, ?+
, ...) are
like the non-greedy repetitions, but with a P
instead of an N
. For example,
x^(P *)
or x^(P 2..3)
. These are not supported in Python.
CRE | Traditional | Definition | Notes |
x*
|
x*
|
zero or more x, prefer more | |
x+
|
x+
|
one or more x, prefer more | |
x?
|
x?
|
zero or one x, prefer one | |
x^(n..m)
|
x{n,m}
|
n or n+1 or ... or m x, prefer more | |
x^(n..)
|
x{n,}
|
n or more x, prefer more | |
x^n
|
x{n}
|
exactly n x | |
x** or x^(N *)
|
x*?
|
zero or more x, prefer fewer | |
x++ or x^(N +)
|
x+?
|
one or more x, prefer fewer | |
x?? or x^(N ?)
|
x??
|
zero or one x, prefer zero | |
x^(N n..m)
|
x{n,m}?
|
n or n+1 or ... or m x, prefer fewer | |
x^(N n..)
|
x{n,}?
|
n or more x, prefer fewer |
CRE | Traditional | Definition | Notes |
{re}
|
(re)
|
numbered capturing group | |
{re as name}
|
(?P<name>re) or (?<name>re) or (?'name're)
|
named & numbered capturing group | no standard traditional syntax |
(re)
|
(?:re)
|
non-capturing group | |
MyName = expression
|
|
Named subexpression. |
Creates a pattern that can be referenced elsewhere. It does not create a capturing group. Naming convention is CapWords .
|
These constructs match empty space.
CRE | Traditional | Definition | Notes |
%begin
|
^
|
at beginning of text or line (multiline=true) | |
%end
|
$
|
at end of text (like \z not \Z) or line (multiline=true) | |
%begin-text
|
\A
|
at beginning of text | |
%end-text
|
\z
|
at end of text | |
%boundary
|
\b
|
at word boundary (\w on one side and \W, \A, or \z on the other) | |
!%boundary
|
\B
|
not a word boundary | |
!%begin-word
|
\<
|
left word boundary | |
!%end-word
|
\>
|
right word boundary |
These constructs imply a backtracking implementation. They are identified
by keywords in CAPS
.
Notes:
re2
, because the
backtracking semantics would break its linear-time guarantee.CRE | Traditional | Definition | Notes |
%ASSERT(re)
|
(?=re)
|
lookahead assertion. | |
!%ASSERT(re)
|
(?!re)
|
negative lookahead assertion. | |
%ASSERTLEFT(re)
|
(?<=re)
|
lookbehind assertion. | |
!%ASSERTLEFT(re)
|
(?<!re)
|
negative lookbehind assertion. | |
REF(1)
|
\1
|
Backreference to captured group. | |
REF(name)
|
(?P=name)
|
Backreference to named captured group. | |
IF foo THEN a ELSE b
|
(?(foo)yes|no)
|
Pattern conditional on backreference. | |
RECURSE(pattern)
|
??{name} or (?n) or (?R)
|
Recurse into the pattern. The pattern can be a group identified by name, number, or the entire pattern itself. | |
ATOMIC(...)
|
(?>...)
|
Atomic grouping. | Java and Ruby both support this. |
%END-PREV
|
\G
|
The end of the previous match. | Not supported by Python (or re2). |
A CRE can be either an expression or a list of named expressions, one of which
is Start
. For example, this is a valid CRE:
digit+
So is this:
Start = digit+
and this:
D = digit+ Start = D
Flag syntax is flags(x -y z)
(set x and z, clear y).
CRE | Traditional | Definition | Notes |
flags(multiline unicode ...)
|
(?flags)
|
set flags. In Python this is only valid at the start of the entire pattern. |
In CRE, the names for flags are whole words like multiline , not single letters.
|
List of flags:
CRE | Traditional | Definition | Notes |
ignorecase
|
i
|
case-insensitive (default false) | |
multiline
|
m
|
multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
|
|
anyall
|
s
|
let . match \n (default false) | |
ungreedy
|
U
|
ungreedy: swap meaning of x* and x+? , etc (default false)
|
not available in Python. |
unicode
|
u
|
Make character classes dependent on Unicode character properties database. | Python's re.UNICODE flag. |
debug
|
|
Display debug information about compiled expression. | Python's re.DEBUG. (May be Python only?) |
locale
|
L
|
Make \w, \W, \b, \B, \s and \S dependent on the current locale. | Python's re.LOCALE. (May be Python only?) |
CRE | Traditional | Definition | Notes |
# until end of line
|
(?#text)
|
comment |
A word without a punctuation prefix (e.g. %begin
, :alnum
) is assumed to be the
name of a subexpression, except if it's one of these reserved words.
either
, or
-- alternationas
-- capturingchars
-- start a character class, e.g. chars[a..z]
flags
-- for compilation flags.any
-- single metacharacterwordchar
, digit
, whitespace
-- Perl classesREF
, ASSERT
, ASSERTLEFT
, IF THEN ELSE
, ATOMIC,
RECURSE` --
backtracking constructs in CAPS=
-- definition of subexpressions* + ? ^
-- repetitions'
and "
-- strings()
-- grouping, for REF(1), for repeats like a^(1..3)[]
-- character classes{}
-- capture-
-- range of characters..
-- repetition range#
-- comment!
-- negation of character classes, named classes (perl posix unicode):
-- beginning of named classes%
-- zero width assertions&
-- character literalsUnused: / \ ~ @ $ | ; ` < > , .
Parser for CRE -- This is in TPE syntax. It's auto-generated from the source code, so it's the most precise documentation. TPE itself is rigorosusly specified, like PEG.
Regular expression references for various languages:
Last modified: 2013-01-27 10:40:07 -0800