CRE Design

Table of Contents

Here is some design rationale. It's not necessary for understanding how to use CRE.

Requirements for CRE Syntax

Different semantics implies different syntax, and vice-versa.
Similar semantics implies similar syntax, and vice-versa.
Should be easy for people already familiar with regular expressions to learn. Common constructs like * and ? aren't changed.
Should be easy for complete beginners to learn. Less cryptic punctuation and one letter abbreviations. It doesn't look like line noise.
Should generally be more readable: allow splitting definitions across multiple lines, allow comments, etc.
Should be easy for both computers and humans to parse (LL(k) grammar).
Get rid of cryptic one-letter abbreviations.
Within that constraint, common constructs should be shorter.
Avoid excessive punctuation, but use it where appropriate.
Should be extensible to cover all regex syntax anywhere. As long as literals are quoted, this isn't very hard.
Avoid ambiguity with host languages like C++ and Java that don't have a syntax for strings without \ escaping. In these languages, a regex representing literal backslash is represented as "\\\\", which is confusing.
All expressions should be expressable in ASCII. Python's RE engine doesn't appear to have this property. You can't read a regex from an ASCII file and to match u'\u03bc' (although you can write u'\u03bc' in a pattern in Python source code of course)
Should be only one way to do it.
- Some things are similar but have sligthly different behavior? :alnum and wordchar?
- ++ and ^(N +) are redundant, but the full repetition syntax is rarely needed.
Constructs not implementable in linear time (i.e. not in re2) should be visually distinctive. This is the backtracking sublanguage.

Requirements for Python API

Expose all the functionality of the re module.
- All Python regex syntax.
- All Python methods.
- Clean up things that are error-prone, but provide a drop-in replacement too.
Make the common case shorter, e.g. m = r.match(); if m: return m.group(0) is too much.
Clean up things that are non-orthogonal.
Most people don't like match objects, especially people coming from other languages like Perl.
Substitution syntax should reuse existing Python syntax.

Last modified: 2013-01-29 10:42:48 -0800