What's Wrong with Regular Expression Syntax

Status: Draft / Needs Editing

Table of Contents

The goal of this doc is to make it hard to say this with a straight face: "regular expression syntax is maybe a little ugly, but basically OK." It's not really OK.

Regexes are actually composed of two little languages: a way to specify strings and a way to specify sets of characters. Inside a character class, ^, -, have special meanings. Outside a character class, characters like '.' and * are metacharacters. In Python's verbose mode, a space is insignificant outside a character class, but still significant inside one.

General Awkwardness

(?:...) is a hard syntax to read for non-capturing groups.
Characters like ^ and $ are arbitrary in what they stand for, hard to remember.
(?P<foo>...) -- for named capturing groups and named backreferences, the P is somewhat random.

Significant space makes regular expressions very hard to read. Even with Python's re.VERBOSE or Perl's /x, space within character classes is still significant. Reading expressions like [^"'\\\[\]] is hard.

Same syntax for different concepts

\ is overloaded. It can:
- Take away the special meaning of the following character -- escaping, e.g. \+ is the + character.
- write an unprintable character in ASCII -- e.g. \n or \010
- refer to a previously captured group -- backreference, e.g. \1
- name a class of characters -- e.g. \D stands for any character that's not a digit.
- refer to a position in the string (zero-width assertion, e.g. \A stands for the beginning of the string).

It also means something to the programming language (in Python, this is disabled with raw strings, like r'foo\\').

This creates some interesting quirks:

\001 and \01 both signify an octal character ... but \1 is a backreference!
\b is a word boundary outside a character class, while it's a bell character inside one! (in Python, +re2+ chooses to avoid this)
? is overloaded.
- It originally meant "0 or 1". aa? matches one or two a's.
- Adding it after a repeition operator turns it into a *non-greed match, e.g. +?.
- After a open paren it is the extension syntax. See below.
+ is overloaded.
- Originally it meant "1 or more"
- Adding it after another reptition operator is a "possessive" repetition, e.g. ++.
The extension syntax encompasses unrelated concepts.
- (?:...) is a non-capturing group
- (?=name) is a lookahead assertion
- (?P=name) is a named backreference
- (?P<name>pat) is a named group
- (?i) turns on a flag
- (?#comment) is a comment
^ means 3 different things:
- Usually it stands for the "start" of a string
- in multiline mode it actually stands for the start of a line too.
- At the beginning of a character class, it means the entire class is negated.
- Elsewhere in a character class, it stands for itself.
- creates a range in a character class, except at the end, where it stands for a hyphen. (NOTE: I guess \- is a hyphen).

Very subtle rules:

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:

\a \b \f \n \r \t \v \x \

(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)

Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.

Different syntax for the same concepts

Negation has 3 syntaxes:

[^a] vs [a] -- negate a character class by adding ^ at the front.
\d vs \D -- negate a Perl-style character class by capitalizing.
(?!...) -- negative lookahead assertion uses !

In CRE, ~ is the only syntax for negation (~chars[], ~digit, or ~ASSERT).

Zero-width assertions have multiple syntax styles (a historical accident):

^ and $ are the traditional style
\A and \Z are closely related but look completely different. They look like character classes.

Another historical accident:

\1 is a numbered backreference, but (?=name) is a named backreference. (CRE is consistent with REF(1) and REF(name)).

On a related note, Constructs have been added that conflate regular expressions with a backtracking implementation.

CRE tries to make this distinction by creating a different syntax for constructs defined using the traditional theory of automata, and constructs defined using the backtracking algorithm.

Python substitution syntax quirks

Again, the backslash is too heavily overloaded.

\g<0> means the whole match, \g<1> means group 1

\0 does NOT mean the whole match, but \1 means group 1.

Extraneous syntax.

Python re API quirks

Mentioned match vs. search confusion.

Also TODO: document exceptions. Have to import another module to catch an exception.

sre.parse_template throws them

Last modified: 2013-01-29 10:42:48 -0800