When regular expressions get longer, the advantage of CREs becomes even more apparent.
This blog post describes how to extract URLs from free-form text. Consider these 3 lines:
(Something like http://foo.com/blah_blah_(wikipedia)) A url with parentheses: http://foo.com/blah_(wikipedia)#cite-1 Period ends this sentence: http://foo.com/blah_blah.
In the first case, the inner parentheses are part of the URL, but the outer ones aren't. In the third case, the period is not part of the URL.
Something like this is used in Markdown to auto-linkify URLs. (I'm writing this article in Markdown.)
Even with Perl's /x
option, which allows insignificant space, and extensive
comments, the following regex is pretty hard to read.
(?xi) \b ( # Capture 1: entire matched URL (?: https?:// # http or https protocol | # or www\d{0,3}[.] # "www.", "www1.", "www2." ... "www999." | # or [a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash ) (?: # One or more: [^\s()<>]+ # Run of non-space, non-()<> | # or \(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels )+ (?: # End with: \(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | # or [^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars ) )
There's also a lot of duplication. The pattern [^\s()<>]
appears in five
places.
How many capturing groups are there in this regex? It's hard to tell, but
trying it in an interpreter reveals that there are five. As far as I can
tell, the intention is to have one group -- the top level one -- so this may
be a mistake. Non-capturing groups (?:...)
are used in some places, but
there are also literal parentheses \(\)
, and capturing groups ()
, which are
very hard to distinguish.
Here's how you write this as a CRE:
flags(ignorecase) Protocol = 'http' 's'? '://' # http or https protocol WWW = 'www' digit^(..3) '.' # "www.", "www1.", "www2." ... "www999." # looks like a domain name followed by a slash, e.g. "foo.cn/" # TODO: change to use .. for char ranges? DomainLike = chars[ a-z 0-9 . &hyphen ]+ '.' chars[a-z]^(2..4) '/' # Balanced text, doesn't have () or <> Balanced = !chars[ whitespace () <> ]+ # Up to 1 pair of balanced parens (foo) Balanced1 = { '(' Balanced ')' } # Up to 2 pair of balanced parens (foo) or ((foo)) Balanced12 = '(' { either Balanced or Balanced1 }* ')' # None of these characters can end a URL. EndUrl = !chars[ whitespace ` &bang () &lbracket &rbracket {} ;: '" ., <> ? 0xab 0xbb &201c &201d &2018 &2019 ] Start = %boundary { ( either Protocol or WWW or DomainLike ) ( either Balanced or Balanced12 )+ ( either Balanced12 or EndUrl ) }
It uses the following features:
flags(...)
-- equivalent of (?xi)
in Perl/PythonName = Expression
-- not possible with Perl-style syntax'www'
*
, +
, and ?
-- same syntax^(2..4) -- equivalent of
{2,4}`chars[ a-z 0-9 ]
, and negated with !chars
whitespace
-- equivalent of \s
&bang
and &lbracket
0xab
&201c
()
-- (?:...)
in Perl{}
-- ()
in PerlHere's how to read it:
Protocol = 'http' 's'? '://' # http or https protocol
http
s
(http or https)://
then
WWW = 'www' digit^(..3) '.' # "www.", "www1.", "www2." ... "www999."
then
DomainLike = chars[ a-z 0-9 . hyphen ]+ '.' chars[a-z]^(2..4) '/'
hyphen
is used to escape the hyphen, which is an
operator in a-z
)./
Let's skip down to the Start
symbol and see how these 3 rules are used.
Start = %boundary { ( either Protocol or WWW or DomainLike ) ( either Balanced or Balanced12 )+ ( either Balanced12 or EndUrl ) }
%boundary
, which is the equivalent of \b
(a word
boundary). CRE represents the zero-width assertions like ^
, $
, and \b
with words that start with %
.{}
. Normal parens are non-capturing (the
equivalent of (?:...)
). Inside this group is a sequence of 3 disjunctions:
either Protocol or WWW or DomainLike
-- Those are our 3 rules.Balanced
or a character that may not end a URL (according to this
heuristic).Notes:
[^\s()<>]
are collapsed into the Balanced*
subexpressions.Last modified: 2013-01-29 10:42:48 -0800