Status: Draft / Needs Editing
The goal of this doc is to make it hard to say this with a straight face: "regular expression syntax is maybe a little ugly, but basically OK." It's not really OK.
Regexes are actually composed of two little languages: a way to specify
strings and a way to specify sets of characters. Inside a character
class, ^
, -
, have special meanings. Outside a character class, characters
like '.' and * are metacharacters. In Python's verbose mode, a space is
insignificant outside a character class, but still significant inside one.
(?:...)
is a hard syntax to read for non-capturing groups.^
and $
are arbitrary in what they stand for, hard to
remember.(?P<foo>...)
-- for named capturing groups and named backreferences, the P
is somewhat random.Significant space makes regular expressions very hard to read. Even with
Python's re.VERBOSE
or Perl's /x
, space within character classes is still
significant. Reading expressions like [^"'\\\[\]]
is hard.
\
is overloaded. It can:
\+
is the +
character.\n
or \010
\1
\D
stands for any character that's not
a digit.\A
stands
for the beginning of the string).It also means something to the programming language (in Python, this is disabled
with raw strings, like r'foo\\'
).
This creates some interesting quirks:
\001
and \01
both signify an octal character ... but \1
is a backreference!\b
is a word boundary outside a character class, while it's a bell
character inside one! (in Python, +re2+ chooses to avoid this)
?
is overloaded.
aa?
matches one or two a's.+?
.+
is overloaded.
++
.The extension syntax encompasses unrelated concepts.
(?:...)
is a non-capturing group(?=name)
is a lookahead assertion(?P=name)
is a named backreference(?P<name>pat)
is a named group(?i)
turns on a flag(?#comment)
is a comment^
means 3 different things:
multiline
mode it actually stands for the start of a line too.-
creates a range in a character class, except at the end, where it stands
for a hyphen. (NOTE: I guess \-
is a hyphen).
Very subtle rules:
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:
\a \b \f \n \r \t \v \x \
(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)
Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.
Negation has 3 syntaxes:
[^a]
vs [a]
-- negate a character class by adding ^
at the front.\d
vs \D
-- negate a Perl-style character class by capitalizing.(?!...)
-- negative lookahead assertion uses !
In CRE, ~
is the only syntax for negation (~chars[]
, ~digit
, or
~ASSERT
).
Zero-width assertions have multiple syntax styles (a historical accident):
^
and $
are the traditional style\A
and \Z
are closely related but look completely different. They look
like character classes.Another historical accident:
\1
is a numbered backreference, but (?=name)
is a named backreference.
(CRE is consistent with REF(1)
and REF(name)
).On a related note, Constructs have been added that conflate regular expressions with a backtracking implementation.
CRE tries to make this distinction by creating a different syntax for constructs defined using the traditional theory of automata, and constructs defined using the backtracking algorithm.
Again, the backslash is too heavily overloaded.
\g<0> means the whole match, \g<1> means group 1
\0 does NOT mean the whole match, but \1 means group 1.
Extraneous syntax.
Mentioned match vs. search confusion.
Also TODO: document exceptions. Have to import another module to catch an exception.
sre.parse_template throws them
Last modified: 2013-01-29 10:42:48 -0800