CRE is a new syntax for regular expressions. Annex is a Python library that implements this syntax.
Start with the Introduction to CREs and Annex if you'd like to learn more.
More documentation/resources:
Shorter pieces of documentation appear inline, below.
Download tarball, or use pip / easy_install (I presume they grab from PyPI.)
Test that it works.
$ python
...
>>> import annex
>>> r = annex.Regex('digit+')
>>> print r.match('weight: 123 lbs')
'123'
You can download a stable release from PyPI, or get the latest code from Google Code.
There's also a Github mirror.
Always use unit tests for regular expressions.
Here is a short template.
Split your regexes up into named expressions. Don't repeat stuff.
Hi
If you get a syntax error on a regex, then try commenting out parts.
Hi
You can use r.as_python_re() to debug the translation. See pyre.py as well.
Hi
Annex behaves just like Python's re
module with regard to unicode. The
pattern and the string being matched can be both be either bytes
or unicode
.
Unlike Python REs, CREs are always representable in pure ASCII, since you can
use a code point like &201c
to represent a unicode character.
Here's a CRE that matches smart quotes:
annex.Regex(" &201c {any+} &201d ")
The Python alternative would be:
annex.Regex(ru" '\u201c' {any+} '\u201d' ")
The outer double quotes are for Python; the inner single quotes are CRE syntax. The backslash escapes are Python syntax.
Here are some ideas for projects that continue this line of thinking.
grep
. Or BREs for grep
and sed
,
since sed
doesn't appear to have the equivalent of -E
.Is anyone interested in extending the syntax for Perl's backtracking
constructs? There are obscure/exotic ones like "branch reset", written as
(?|...)
. They would be in CAPS to indicate that they require backtracking,
like ASSERT
and REF
.
Personally I prefer stick with the regular constructs, but some people may by fluent with them. The book Mastering Regular Expressions goes into great detail about optimizing backtracking with some of these constructs.
Implementations in other languages. Any language with a regex language may benefit from this. There are some notes on the [implementation](implementation.html] that may help.
Writing some debugging/comprehension tools like [this] or [that]. The lexer
and grammar are implemented pretty cleanly, so this should be possible. You
could use Poly
to write a CGI-like script.
Does Unicode need more work?
Better tracing -- for debugging parsing.
Python: re2
has a multi-engine architecture. It uses different backends
for different patterns. Most (all?) of the backends guarantee linear time
execution, and it doesn't support assertions or backreferences because of this.
There are definitely cases where I want to accept arbitrary regexps from users in Python, and be guaranteed that it doesn't blow up. (Although it has been mostly for matching against short strings like URLs, not long strings like source code, as was done in Google Code Search).
To implement this in a backward-compatible way, Python could add a a linear time engine as well. It could use this for any pattern that doesn't have assertions or backreferences. And it could have an option at regex compile time to limit an expression to a particular backend.
Or maybe I am mistaken and Python is already fast enough on all patterns in
that set. It should be easy to do some benchmarks like re2
.
Should posix char classes like :alnum
be supported in Python? People can
trivially include them in their patterns. There could be a module system for
definitions too.
Reverse parsers. There could be parsers for the various flavors of regexes and converters to CRE. The common sexpr format will help with this. This conversion should always succeed, as CRE aims define the superset of all regex implementations. (This sounds ambitious but is not very hard, since the syntax is designed to be extensible, i.e. it doesn't take up all the punctuation characters.)
I plan to use the annex +Parser+ (TPE) and +Lexer+ abstractions for some other parsing tasks. They are described somewhat in the implementation doc, but will become an official part of the API at some point. If you play around with them and have any feedback, I'd be interested.
Last modified: 2016-07-09 14:19:28 -0700