Here is some design rationale. It's not necessary for understanding how to use CRE.
Different semantics implies different syntax, and vice-versa.
Similar semantics implies similar syntax, and vice-versa.
Should be easy for people already familiar with regular expressions to
learn. Common constructs like *
and ?
aren't changed.
Should be easy for complete beginners to learn. Less cryptic punctuation and one letter abbreviations. It doesn't look like line noise.
Should generally be more readable: allow splitting definitions across multiple lines, allow comments, etc.
Should be easy for both computers and humans to parse (LL(k)
grammar).
Get rid of cryptic one-letter abbreviations.
Within that constraint, common constructs should be shorter.
Avoid excessive punctuation, but use it where appropriate.
Should be extensible to cover all regex syntax anywhere. As long as literals are quoted, this isn't very hard.
Avoid ambiguity with host languages like C++ and Java that don't have a syntax
for strings without \
escaping. In these languages, a regex representing
literal backslash is represented as "\\\\"
, which is confusing.
All expressions should be expressable in ASCII. Python's RE engine doesn't
appear to have this property. You can't read a regex from an ASCII file and
to match u'\u03bc'
(although you can write u'\u03bc'
in a pattern in
Python source code of course)
Should be only one way to do it.
:alnum
and
wordchar
?++
and ^(N +)
are redundant, but the full repetition syntax is rarely
needed.Constructs not implementable in linear time (i.e. not in re2) should be visually distinctive. This is the backtracking sublanguage.
m = r.match(); if m: return m.group(0)
is
too much.Last modified: 2013-01-29 10:42:48 -0800