Annex Documentation

CRE is a new syntax for regular expressions. Annex is a Python library that implements this syntax.

Start with the Introduction to CREs and Annex if you'd like to learn more.

Quick Start

Download tarball, or use pip / easy_install (I presume they grab from PyPI.)
Test that it works.

$ python
...
>>> import annex
>>> r = annex.Regex('digit+')
>>> print r.match('weight: 123 lbs')
'123'

Get the Source Code

You can download a stable release from PyPI, or get the latest code from Google Code.

There's also a Github mirror.

Tips for using Annex

Always use unit tests for regular expressions.

Here is a short template.
Split your regexes up into named expressions. Don't repeat stuff.

Hi
If you get a syntax error on a regex, then try commenting out parts.

Hi
You can use r.as_python_re() to debug the translation. See pyre.py as well.

Hi

Unicode

Annex behaves just like Python's re module with regard to unicode. The pattern and the string being matched can be both be either bytes or unicode.

Unlike Python REs, CREs are always representable in pure ASCII, since you can use a code point like &201c to represent a unicode character.

Here's a CRE that matches smart quotes:

annex.Regex("  &201c {any+} &201d  ")

The Python alternative would be:

annex.Regex(ru"  '\u201c' {any+} '\u201d'  ")

The outer double quotes are for Python; the inner single quotes are CRE syntax. The backslash escapes are Python syntax.

Further Work

Here are some ideas for projects that continue this line of thinking.

Command line tool to generate EREs for grep. Or BREs for grep and sed, since sed doesn't appear to have the equivalent of -E.
Is anyone interested in extending the syntax for Perl's backtracking constructs? There are obscure/exotic ones like "branch reset", written as (?|...). They would be in CAPS to indicate that they require backtracking, like ASSERT and REF.

Personally I prefer stick with the regular constructs, but some people may by fluent with them. The book Mastering Regular Expressions goes into great detail about optimizing backtracking with some of these constructs.
Implementations in other languages. Any language with a regex language may benefit from this. There are some notes on the [implementation](implementation.html] that may help.
Writing some debugging/comprehension tools like [this] or [that]. The lexer and grammar are implemented pretty cleanly, so this should be possible. You could use Poly to write a CGI-like script.
Does Unicode need more work?
Better tracing -- for debugging parsing.
Python: re2 has a multi-engine architecture. It uses different backends for different patterns. Most (all?) of the backends guarantee linear time execution, and it doesn't support assertions or backreferences because of this.

There are definitely cases where I want to accept arbitrary regexps from users in Python, and be guaranteed that it doesn't blow up. (Although it has been mostly for matching against short strings like URLs, not long strings like source code, as was done in Google Code Search).

To implement this in a backward-compatible way, Python could add a a linear time engine as well. It could use this for any pattern that doesn't have assertions or backreferences. And it could have an option at regex compile time to limit an expression to a particular backend.

Or maybe I am mistaken and Python is already fast enough on all patterns in that set. It should be easy to do some benchmarks like re2.
Should posix char classes like :alnum be supported in Python? People can trivially include them in their patterns. There could be a module system for definitions too.
Reverse parsers. There could be parsers for the various flavors of regexes and converters to CRE. The common sexpr format will help with this. This conversion should always succeed, as CRE aims define the superset of all regex implementations. (This sounds ambitious but is not very hard, since the syntax is designed to be extensible, i.e. it doesn't take up all the punctuation characters.)
I plan to use the annex +Parser+ (TPE) and +Lexer+ abstractions for some other parsing tasks. They are described somewhat in the implementation doc, but will become an official part of the API at some point. If you play around with them and have any feedback, I'd be interested.

Last modified: 2016-07-09 14:19:28 -0700