Introducing CREs and Annex

Table of Contents

CRE is a new syntax for regular expressions, designed for readability and maintainability. It looks a little like a BNF grammar, but for a different class of language.

Annex is a Python library which implements the CRE language. It's a wrapper and drop-in replacement for Python's re module. If you follow the normal idiom of compiling the regex once at program startup, there's no runtime overhead.

CREs can in theory be used with any programming language and regular expression engine -- there's nothing Python-specific about the syntax.

One way to think of CRE is as CoffeeScript for regexes. There is a one-to-one correspondence with traditional regex syntax, so it's easy to debug if necessary. JavaScript is pretty inconsistent, but it actually needs much less rehabilitation than regexes.

Why?

Regular expressions are a powerful idea hobbled by an ancient and obscure syntax.

I believe the first use of regexes was to enter something like [ ]+$ into an editor and then throw it away. The traditional syntax is acceptable if the goal is to minimize the number of characters typed to accomplish a task.

But today, regular expressions have grown many more features, and are part of large programs. We should treat them as readable and maintainable code, not cryptic utterances to be puzzled over once and never touched again.

I've also noticed that both beginning and expert programmers have trouble with regular expressions.

When I teach an intro to Python at Google, explaining the full syntax isn't possible in the time we have, and I get confused looks when saying things like:

The caret (^) stands for the beginning of the string. Except when you're in a character class -- then it means to negate everything in the class, but only if it's the first character after the opening [. If it's not, then it stands for an actual caret character.

If you put yourself in the mindset of a beginner, you'll realize that this is ridiculously obscure.

On the expert side, regular expressions are used extensively throughout the implementation of Google's web search -- in crawling, indexing, and quality. Most engineers working on these systems wouldn't find implementing a regular expression engine too challenging. But that doesn't mean that they don't make errors simply writing regexes.

If you're still not convinced, see What's Wrong with Regular Expression Syntax.

Short Examples

Here is a Perl-style regex to match a US currency string like like $10.99:

\$\d+\.\d{2}

Here's how you would write it in the CRE language:

'$' digit+ '.' digit^2

You can see a few differences here:

Words like digit are used instead of one character abbreviations like \d.
Literals are quoted, e.g. '$' and '.'.
CRE is insensitive to the amount of whitespace between tokens. Whitespace does not stand for itself.
x^2 means "x repeated twice", instead of x{2}.

Here is a Perl-style regex to match an IP address like 192.168.0.1:

\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}

And here's one way of writing it as a CRE:

D     = digit^(1..3)
Start = D '.' D '.' D '.' D

Notice that you can define subexpressions, which makes large regexes managable.

D is defined as a sequence of 1 to 3 digits.
Start is defined as D, followed by a literal period, etc. The Start rule is special; it's where the CRE -> RE compilation process starts. (The API allows you choose another name as the start symbol, which is useful for debugging).

Regexes are often avoided past a certain size because they aren't readable or maintainable. However, this isn't an inherent limitation of regular expressions; it's only a limitation of their syntax.

Since literals are quoted, all other constructs can have improved syntax, as they aren't crowded into a small space of metacharacters. It also makes the syntax more future proof.

(note: I purposely made a typo in the Perl-style version -- did you notice?)

Another Example

Here's a regular expression to extract the URL out of a string like <a href="http://example.com/foo">:

(?i)<\s*a\s+href\s*=\s*(?:"([^"]*)"|'([^']*)'|([^'"\s>]+))\s*>

If we write it in /x format, it looks a little better, but still causes squinting:

(?i)
< \s* a \s+ href \s* = \s*
(?:
  "([^"]*)"
  |
  '([^']*)'
  |
  ([^'"\s>]+)
)
\s* >

Here is the CRE version:

flags(ignorecase)

_     = whitespace*    # optional whitespace
__    = whitespace+    # mandatory whitespace

# double quote, capture non-double quote chars with {}, double quote
DQ    = '"' { !chars["]* } '"'
SQ    = "'" { !chars[']* } "'"           # same for single quotes
NQ    = { !chars[ ' " whitespace > ]+ }  # no quotes, capture the whole thing

# The top level pattern -- now it should be self-explanatory.
Start = '<' _ 'a' __ 'href' _ '=' _ (either DQ or SQ or NQ) _ '>'

It uses the following features:

flags(...) -- equivalent of (?i) in Perl/Python
Subexpressions like Name = Expression -- not possible with traditional syntax
Named classes like whitespace -- equivalent of \s
Repetition operators like *, +, and ? -- same syntax
Alternation: either a or b or c
Literal strings, e.g. 'href'
Character classes like chars[a-z 0-9], and negated with !chars
Capturing group {} -- () in traditional syntax
Non-capturing group () -- (?:...) in traditional syntax

A couple more notes:

The case-insensitive flag is set, so we can match <A HREF...> too.
To be more explicit, DQ is defined as:
- A double quote character ",
- Zero or more non-" characters, which are captured,
- A closing ".

Long Example

If you want to see more, here's an even longer example, using a real regex found "in the wild".

Syntax Overview

CRE has a complete syntax reference, but here is an overview.

Common constructs like *, +, and ? are unchanged. \d{3} is changed to digit^3, and \d{3,4} is changed to digit^(3..4). Non-greedy repetition is specified by doubling the operators (** ++ ??), rather than adding a ? (*? +? ??).

CRE avoids cryptic one-letter abbreviations, like \d or \D. Instead we use keywords like digit, which is negated with !digit. The ! operator is used whenever a construct can be logically negated.

Literal strings are surrounded with single or double quotes. So "ab" is a CRE that matches the characters ab. The CRE '"' matches a double quote character.

Character classes use the chars keyword and square brackets, e.g. chars[a-z A-Z]. A space must separate character ranges, but not single characters.

CRE doesn't use \ for escaping, so it's easy to embed in C-style literal strings. Instead, you can use byte literals like 0x00 or unicode character literals like &201c inside ranges. Since patterns compose by concatenation, 'abc' 0x00 "def" is a CRE that matches a literal 7 character string. (Don't confuse 0x00 with '0x00'; the latter is of course a string literal with 4 characters.)

There are also named characters like &newline and &cr for \n and &cr. The literals &space, &hyphen and &bang (- and !) can be used inside character classes. Outside character classes you would just use quoted strings: ' -!'.

Unquoted strings are generally identifiers, which refer to subexpressions to be expanded inline. In the first example above, D and Start are identifiers. Identifiers should generally use the CapWords naming convention, as lower case words like flags and either are keywords.

Alternation, traditionally a|b|c, is denoted either 'a' or 'b' or 'c'. The prefix syntax makes it easier for humans and computers to parse (i.e. the grammar is LL(k)).

Capturing groups are {} instead of (). Non-capturing groups are () instead of the obscure (?:...).

Named capturing groups use the as keyword: {digit+ as year} instead of the obscure (?<year>\d+) or (?P<year>\d+).

Zero-width assertions are enclosed in angle brackets, e.g. <begin> and <end> instead of ^ and $ [1].

Top Level Syntax

A CRE is either an expression, or a list of Name = expression definitions, with Start being the top-level expression.

So digit+ is a valid CRE. It is equivalent to Start = digit+. It is also equivalent to:

D     = digit+
Start = D

although this is of course unnecessary.

Flags are specified at the front of a CRE, like flags(ignorecase) 'www' instead of (?i)www.

Backtracking Language

CRE goes out of its way to split "regular expressions" into two languages:

The constructs that specify formal regular languages.
The informal backtracking language. These constructs rely on implementation details of the regular expression matching algorithm, and in general use exponential time. Many of these constructs were popularized by Perl.

re2 is a C++ regex engine which uses automata theory to guarantee linear time execution. It is necessarily limited to the first class of constructs. The distinction is explained well in these articles.

The constructs in the backtracking language are specified using capital letters, so they are easily identified. In CRE, syntax is consistent with semantics. They are:

Backreferences
- REF(1) in CRE, \1 traditionally
- REF(name) in CRE, (?P=name) traditionally
The 4 kinds of assertions. They are surrounded in angle brackets because they're also zero-width assertions.
- <ASSERT 'a'> in CRE, (?=a) traditionally
- <!ASSERT 'a'> in CRE, (?!a) traditionally
- <ASSERTLEFT 'a'> in CRE, (?<=a) traditionally
- <!ASSERTLEFT 'a'> in CRE, (?<!a) traditionally
Conditional: IF group THEN yes ELSE no in CRE, (?(cond)yes|no) traditionally

The following are not implemented in Python and hence Annex, but CRE specifies a syntax for them:

End of the previous match: <PREV> in CRE, \G traditionally. (Also a zero-width assertion)
Recurse into pattern: RECURSE(a) in CRE, ??{...} traditionally
Atomic grouping: ATOMIC(a) in CRE, (?>...) traditionally

(Note: internally, the Annex library uses a PEG-like abstraction called TPE to parse CREs. Here backtracking is used a simpler and more principled way, e.g. with deterministic choice. The implementation document describes this language; it may be exposed at some point.)

The design page goes into some detail about the principles behind the syntax. The most important one is that similar syntax should imply similar semantics, and vice versa. This is decidedly not true of traditional regex syntax.

Python API Overview

The Annex API is intended to be more convenient than re for common cases, and also more orthogonal.

Creation

CRE pattern objects are instantiated with annex.Regex(...), which is the equivalent of re.compile(...). Following the Python philosophy of having "one way to do it", the constructor doesn't accept flags as arguments. Flags are always written in the pattern string.

Simple Matching and Capturing

Annex mostly does away with match objects, which many people find confusing or verbose. With re, the typical pattern is:

digit_re = re.compile(r'\d+')

def extract_digit(s):
  m = digit_re.match(s)
  if m:
    return m.group(0)  # the matched string
  else:
    return None

In Annex, the match() method returns the matched string, or None if there is no match:

>>> r = annex.Regex("{digit+ as month} '/' {digit+ as year}")
>>> print r.match('Date: 2/2013')
2/2013
>>> print r.match('Age: 20')
None

The capture() method returns a dictionary of captured groups. The dictionary is empty if the string doesn't match.

>>> print r.capture('Date: 2/2013')
{'MATCH': '2/2013', 'month': '2', 'year': '2013'}
>>> print r.capture('Age: 20')
{}

Named groups are preferred. They no longer have the awkward (?P<name>\d+) syntax. If you don't name the groups, then the numeric indices are used as dictionary keys, e.g. {1: 'foo'}.

Because the MATCH key appears in the dictionary, capture() returns strictly more information than match(). The match() method is convenient for the case where the pattern has no capturing groups.

Substitution

Annex uses the new replace() and replacen() functions instead of re's sub() and subn(). They allow using Python syntax you already know for replacement, rather than quirky re-specific syntax.

(With re.sub(), you use \1 or \g<1> to substitute a numbered group, or \g<name> to substitute a named group. Wart: \g<0> stands for the entire match, but \0 is a null byte!)

In Annex, you pass the target string, and then one of four keyword arguments to specify the replacement:

pattern = annex.Regex(...)
result = pattern.replace(s, template=None, format=None, repl=None, func=None, count=0)

template -- replacement as a string.Template string (Python 2.4). This has a shell-like syntax, e.g. $year/$month/$day.
format -- replacement as a str.format() string (Python 2.6). This looks like {year}/{month}/{day}.
func -- a replacement function. Just like re -- you receive the match object, and return a replacement string.
repl -- legacy re syntax for replacement string, using \.

Option 1 is prefered, and will keep your code compatible with versions back to Python 2.4.

If your codebase is already using Python 3-style str.format() instead of fmt % args, then it won't work with Python 2.4 anyway, so option 2 is appropriate.

Option 4 is included for really old Python versions.

The string to search comes first in Annex's replace(). In re.sub(), it comes second. You should always pass one of four keywords arguments to replace() rather than a positional second argument, so there shouldn't be any confusion.

Advanced

Disabling Searching

A frequent confusion with Python's API is the difference between searching and matching. Annex is consistent with Perl and JavaScript semantics, in that the target string is searched for the pattern when using match(), capture(), and family.

Usually you can simply add <begin> (^) to your pattern to anchor it on the left.

There are some rare cases where this doesn't suffice. In those cases, pass search=False to one of the methods:

>>> print r.match('Date: 2/2013', search=False)  # don't skip over prefix
None
>>> print r.match('2/2013', search=False)
2/2013

In other words, Annex by default uses the re.search method to implement match() and capture(). If search=False is passed, it uses re.match instead.

Positional Information

If you need the positional information for the matches, use matchspan() or capturespans() instead of match() and capture(). (Note the s on the end of capturespans).

These return Span instances, which are objects with 3 attributes:

value -- the substring that was captured
start -- the start index of the substring
end -- the end index of the substring

For example,

>>> print r.matchspan('Date: 2/2013')
Span(value='2/2013', ...) TODO
>>> print r.capturespans('Date: 2/2013')
TODO

More Methods

iterate(s, span=False, capture=False) takes the place of finditer() and findall(), and is more general. With no arguments, it behaves like finditer(), and yields a list of strings.

>>> for date in r.iterate('Dates: 2/2013, 10/2013'):
...   print date
2/2013
10/2013

With keyword arguments, it will return more information. capture=True returns dictionaries like capture(), and span=True returns spans like matchspan() and capturespans().

>>> for c in r.iterate('Dates: 2/2013, 10/2013', capture=True, span=True):
>>>   print c
{'MATCH': Span(value='2/2013', ...), 'month': Span(value='2', ...)}
{'MATCH': Span(value='10/2013', ...), 'month': Span(value='10', ...) }
TODO

split() remains the same.

execute() is a low-level method that is an alias for the old re.match. It returns a match object. It is meant for repeatedly matching the same string, e.g. like JavaScript's exec or re2's FindAndConsume. (Note that in many of these cases, the iterate() method will suffice.)

Testing API

TODO: Annex includes a unit testing helper. The best way to write regular expressions is in a test-driven style!

Conclusion

This was a short overview of Annex and CREs. If you find it useful, please send some feedback to the group.

You may want to see the CRE Examples page, or go back to the documentation index for more information.

[1] In what's become a bit of Google lore, Unix co-creator Ken Thompson confessed to inventing the relatively obscure ^ syntax:

it was i. sorry.
$ comes from 1,$p in earlier editors.
. for the same reason.
^ is about the only unused character on the 33 teletype.

Last modified: 2013-02-01 21:37:43 -0800