CRE is a new syntax for regular expressions, designed for readability and maintainability. It looks a little like a BNF grammar, but for a different class of language.
Annex is a Python library which implements the CRE language. It's a wrapper
and drop-in replacement for Python's re
module. If you follow the normal
idiom of compiling the regex once at program startup, there's no runtime
overhead.
CREs can in theory be used with any programming language and regular expression engine -- there's nothing Python-specific about the syntax.
One way to think of CRE is as CoffeeScript for regexes. There is a one-to-one correspondence with traditional regex syntax, so it's easy to debug if necessary. JavaScript is pretty inconsistent, but it actually needs much less rehabilitation than regexes.
Regular expressions are a powerful idea hobbled by an ancient and obscure syntax.
I believe the first use of regexes was to enter something like [ ]+$
into an
editor and then throw it away. The traditional syntax is acceptable if the
goal is to minimize the number of characters typed to accomplish a task.
But today, regular expressions have grown many more features, and are part of large programs. We should treat them as readable and maintainable code, not cryptic utterances to be puzzled over once and never touched again.
I've also noticed that both beginning and expert programmers have trouble with regular expressions.
When I teach an intro to Python at Google, explaining the full syntax isn't possible in the time we have, and I get confused looks when saying things like:
The caret (
^
) stands for the beginning of the string. Except when you're in a character class -- then it means to negate everything in the class, but only if it's the first character after the opening[
. If it's not, then it stands for an actual caret character.
If you put yourself in the mindset of a beginner, you'll realize that this is ridiculously obscure.
On the expert side, regular expressions are used extensively throughout the implementation of Google's web search -- in crawling, indexing, and quality. Most engineers working on these systems wouldn't find implementing a regular expression engine too challenging. But that doesn't mean that they don't make errors simply writing regexes.
If you're still not convinced, see What's Wrong with Regular Expression Syntax.
Here is a Perl-style regex to match a US currency string like like $10.99
:
\$\d+\.\d{2}
Here's how you would write it in the CRE language:
'$' digit+ '.' digit^2
You can see a few differences here:
digit
are used instead of one character abbreviations like \d
.'$'
and '.'
.x^2
means "x repeated twice", instead of x{2}
.Here is a Perl-style regex to match an IP address like 192.168.0.1
:
\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}
And here's one way of writing it as a CRE:
D = digit^(1..3) Start = D '.' D '.' D '.' D
Notice that you can define subexpressions, which makes large regexes managable.
D
is defined as a sequence of 1 to 3 digits.Start
is defined as D
, followed by a literal period, etc. The Start
rule is special; it's where the CRE -> RE compilation process starts. (The
API allows you choose another name as the start symbol, which is useful for
debugging).Regexes are often avoided past a certain size because they aren't readable or maintainable. However, this isn't an inherent limitation of regular expressions; it's only a limitation of their syntax.
Since literals are quoted, all other constructs can have improved syntax, as they aren't crowded into a small space of metacharacters. It also makes the syntax more future proof.
(note: I purposely made a typo in the Perl-style version -- did you notice?)
Here's a regular expression to extract the URL out of a string like
<a href="http://example.com/foo">
:
(?i)<\s*a\s+href\s*=\s*(?:"([^"]*)"|'([^']*)'|([^'"\s>]+))\s*>
If we write it in /x
format, it looks a little better, but still causes
squinting:
(?i) < \s* a \s+ href \s* = \s* (?: "([^"]*)" | '([^']*)' | ([^'"\s>]+) ) \s* >
Here is the CRE version:
flags(ignorecase) _ = whitespace* # optional whitespace __ = whitespace+ # mandatory whitespace # double quote, capture non-double quote chars with {}, double quote DQ = '"' { !chars["]* } '"' SQ = "'" { !chars[']* } "'" # same for single quotes NQ = { !chars[ ' " whitespace > ]+ } # no quotes, capture the whole thing # The top level pattern -- now it should be self-explanatory. Start = '<' _ 'a' __ 'href' _ '=' _ (either DQ or SQ or NQ) _ '>'
It uses the following features:
flags(...)
-- equivalent of (?i)
in Perl/PythonName = Expression
-- not possible with traditional syntaxwhitespace
-- equivalent of \s
*
, +
, and ?
-- same syntaxeither a or b or c
'href'
chars[a-z 0-9]
, and negated with !chars
{}
-- ()
in traditional syntax()
-- (?:...)
in traditional syntaxA couple more notes:
<A HREF...>
too.DQ
is defined as:
"
,"
characters, which are captured,"
.If you want to see more, here's an even longer example, using a real regex found "in the wild".
CRE has a complete syntax reference, but here is an overview.
Common constructs like *
, +
, and ?
are unchanged. \d{3}
is changed to
digit^3
, and \d{3,4}
is changed to digit^(3..4)
. Non-greedy repetition
is specified by doubling the operators (**
++
??
), rather than adding a
?
(*?
+?
??
).
CRE avoids cryptic one-letter abbreviations, like \d
or \D
. Instead we use
keywords like digit
, which is negated with !digit
. The !
operator is used
whenever a construct can be logically negated.
Literal strings are surrounded with single or double quotes. So "ab"
is a
CRE that matches the characters ab
. The CRE '"'
matches a double quote
character.
Character classes use the chars
keyword and square brackets, e.g. chars[a-z
A-Z]
. A space must separate character ranges, but not single characters.
CRE doesn't use \
for escaping, so it's easy to embed in C-style literal
strings. Instead, you can use byte literals like 0x00
or unicode character
literals like &201c
inside ranges. Since patterns compose by concatenation,
'abc' 0x00 "def"
is a CRE that matches a literal 7 character string. (Don't
confuse 0x00
with '0x00'
; the latter is of course a string literal with 4
characters.)
There are also named characters like &newline
and &cr
for \n
and &cr
.
The literals &space
, &hyphen
and &bang
(-
and !
) can be used inside
character classes. Outside character classes you would just use quoted
strings: ' -!'
.
Unquoted strings are generally identifiers, which refer to subexpressions
to be expanded inline. In the first example above, D
and Start
are
identifiers. Identifiers should generally use the CapWords
naming
convention, as lower case words like flags
and either
are keywords.
Alternation, traditionally a|b|c
, is denoted either 'a' or 'b' or 'c'
. The
prefix syntax makes it easier for humans and computers to parse (i.e. the
grammar is LL(k)).
Capturing groups are {}
instead of ()
. Non-capturing groups are ()
instead of the obscure (?:...)
.
Named capturing groups use the as
keyword: {digit+ as year}
instead of the
obscure (?<year>\d+)
or (?P<year>\d+)
.
Zero-width assertions are enclosed in angle brackets, e.g. <begin>
and
<end>
instead of ^
and $
[1].
A CRE is either an expression, or a list of Name = expression
definitions,
with Start
being the top-level expression.
So digit+
is a valid CRE. It is equivalent to Start = digit+
. It is also
equivalent to:
D = digit+ Start = D
although this is of course unnecessary.
Flags are specified at the front of a CRE, like flags(ignorecase) 'www'
instead of (?i)www
.
CRE goes out of its way to split "regular expressions" into two languages:
The constructs that specify formal regular languages.
The informal backtracking language. These constructs rely on implementation details of the regular expression matching algorithm, and in general use exponential time. Many of these constructs were popularized by Perl.
re2 is a C++ regex engine which uses automata theory to guarantee linear time execution. It is necessarily limited to the first class of constructs. The distinction is explained well in these articles.
The constructs in the backtracking language are specified using capital letters, so they are easily identified. In CRE, syntax is consistent with semantics. They are:
REF(1)
in CRE, \1
traditionallyREF(name)
in CRE, (?P=name)
traditionally<ASSERT 'a'>
in CRE, (?=a)
traditionally<!ASSERT 'a'>
in CRE, (?!a)
traditionally<ASSERTLEFT 'a'>
in CRE, (?<=a)
traditionally<!ASSERTLEFT 'a'>
in CRE, (?<!a)
traditionallyIF group THEN yes ELSE no
in CRE, (?(cond)yes|no)
traditionallyThe following are not implemented in Python and hence Annex, but CRE specifies a syntax for them:
<PREV>
in CRE, \G
traditionally. (Also a
zero-width assertion)RECURSE(a)
in CRE, ??{...}
traditionallyATOMIC(a)
in CRE, (?>...)
traditionally(Note: internally, the Annex library uses a PEG-like abstraction called TPE to parse CREs. Here backtracking is used a simpler and more principled way, e.g. with deterministic choice. The implementation document describes this language; it may be exposed at some point.)
The design page goes into some detail about the principles behind the syntax. The most important one is that similar syntax should imply similar semantics, and vice versa. This is decidedly not true of traditional regex syntax.
The Annex API is intended to be more convenient than re
for common cases, and
also more orthogonal.
CRE pattern objects are instantiated with annex.Regex(...)
, which is the
equivalent of re.compile(...)
. Following the Python philosophy of having
"one way to do it", the constructor doesn't accept flags as arguments. Flags
are always written in the pattern string.
Annex mostly does away with match objects, which many people find confusing
or verbose. With re
, the typical pattern is:
digit_re = re.compile(r'\d+')
def extract_digit(s):
m = digit_re.match(s)
if m:
return m.group(0) # the matched string
else:
return None
In Annex, the match()
method returns the matched string, or None
if there
is no match:
>>> r = annex.Regex("{digit+ as month} '/' {digit+ as year}")
>>> print r.match('Date: 2/2013')
2/2013
>>> print r.match('Age: 20')
None
The capture()
method returns a dictionary of captured groups. The dictionary
is empty if the string doesn't match.
>>> print r.capture('Date: 2/2013')
{'MATCH': '2/2013', 'month': '2', 'year': '2013'}
>>> print r.capture('Age: 20')
{}
Named groups are preferred. They no longer have the awkward (?P<name>\d+)
syntax. If you don't name the groups, then the numeric indices are used as
dictionary keys, e.g. {1: 'foo'}
.
Because the MATCH
key appears in the dictionary, capture()
returns
strictly more information than match()
. The match()
method is convenient
for the case where the pattern has no capturing groups.
Annex uses the new replace()
and replacen()
functions instead of re
's
sub()
and subn()
. They allow using Python syntax you already know for
replacement, rather than quirky re
-specific syntax.
(With re.sub()
, you use \1
or \g<1>
to substitute a numbered group, or
\g<name>
to substitute a named group. Wart: \g<0>
stands for the entire
match, but \0
is a null byte!)
In Annex, you pass the target string, and then one of four keyword arguments to specify the replacement:
pattern = annex.Regex(...)
result = pattern.replace(s, template=None, format=None, repl=None, func=None, count=0)
$year/$month/$day
.{year}/{month}/{day}
.re
-- you receive the match
object, and return a replacement string.re
syntax for replacement string, using \
.Option 1 is prefered, and will keep your code compatible with versions back to Python 2.4.
If your codebase is already using Python 3-style str.format()
instead of fmt
% args
, then it won't work with Python 2.4 anyway, so option 2 is appropriate.
Option 4 is included for really old Python versions.
The string to search comes first in Annex's replace()
. In re.sub()
, it
comes second. You should always pass one of four keywords arguments to
replace()
rather than a positional second argument, so there shouldn't be any
confusion.
A frequent confusion with Python's API is the difference between searching and
matching. Annex is
consistent with Perl and JavaScript semantics, in that the target string is
searched for the pattern when using match()
, capture()
, and family.
Usually you can simply add <begin>
(^
) to your pattern to anchor it on the
left.
There are some rare cases where this doesn't suffice. In those cases, pass
search=False
to one of the methods:
>>> print r.match('Date: 2/2013', search=False) # don't skip over prefix
None
>>> print r.match('2/2013', search=False)
2/2013
In other words, Annex by default uses the re.search method to implement
match()
and capture()
. If search=False
is passed, it uses re.match
instead.
If you need the positional information for the matches, use matchspan()
or
capturespans()
instead of match()
and capture()
. (Note the s
on the
end of capturespans
).
These return Span
instances, which are objects with 3 attributes:
For example,
>>> print r.matchspan('Date: 2/2013')
Span(value='2/2013', ...) TODO
>>> print r.capturespans('Date: 2/2013')
TODO
iterate(s, span=False, capture=False)
takes the place of finditer()
and
findall()
, and is more general. With no arguments, it behaves like
finditer()
, and yields a list of strings.
>>> for date in r.iterate('Dates: 2/2013, 10/2013'):
... print date
2/2013
10/2013
With keyword arguments, it will return more information. capture=True
returns dictionaries like capture()
, and span=True
returns spans like
matchspan()
and capturespans()
.
>>> for c in r.iterate('Dates: 2/2013, 10/2013', capture=True, span=True):
>>> print c
{'MATCH': Span(value='2/2013', ...), 'month': Span(value='2', ...)}
{'MATCH': Span(value='10/2013', ...), 'month': Span(value='10', ...) }
TODO
split() remains the same.
execute()
is a low-level method that is an alias for the old
re.match. It returns a match object. It is meant for
repeatedly matching the same string, e.g. like JavaScript's exec
or re2's
FindAndConsume
. (Note that in many of these cases, the iterate()
method
will suffice.)
TODO: Annex includes a unit testing helper. The best way to write regular expressions is in a test-driven style!
This was a short overview of Annex and CREs. If you find it useful, please send some feedback to the group.
You may want to see the CRE Examples page, or go back to the documentation index for more information.
[1] In what's become a bit of Google lore, Unix co-creator Ken Thompson
confessed to inventing the relatively obscure ^
syntax:
it was i. sorry.
$
comes from1,$p
in earlier editors.
.
for the same reason.
^
is about the only unused character on the 33 teletype.
Last modified: 2013-02-01 21:37:43 -0800