Long URL Capture Example

When regular expressions get longer, the advantage of CREs becomes even more apparent.

This blog post describes how to extract URLs from free-form text. Consider these 3 lines:

(Something like http://foo.com/blah_blah_(wikipedia))
A url with parentheses: http://foo.com/blah_(wikipedia)#cite-1
Period ends this sentence: http://foo.com/blah_blah.

In the first case, the inner parentheses are part of the URL, but the outer ones aren't. In the third case, the period is not part of the URL.

Something like this is used in Markdown to auto-linkify URLs. (I'm writing this article in Markdown.)

Even with Perl's /x option, which allows insignificant space, and extensive comments, the following regex is pretty hard to read.

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://                   # http or https protocol
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." ... "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                                 # One or more:
    [^\s()<>]+                          # Run of non-space, non-()<>
    |                                   #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                                 # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]      # not a space or one of these punct chars
  )
)

There's also a lot of duplication. The pattern [^\s()<>] appears in five places.

How many capturing groups are there in this regex? It's hard to tell, but trying it in an interpreter reveals that there are five. As far as I can tell, the intention is to have one group -- the top level one -- so this may be a mistake. Non-capturing groups (?:...) are used in some places, but there are also literal parentheses , and capturing groups (), which are very hard to distinguish.

Here's how you write this as a CRE:

flags(ignorecase)

Protocol   = 'http' 's'? '://'        # http or https protocol

WWW        = 'www' digit^(..3) '.'    # "www.", "www1.", "www2." ... "www999."

# looks like a domain name followed by a slash, e.g. "foo.cn/"
# TODO: change to use .. for char ranges?
DomainLike = chars[ a-z 0-9 . &hyphen ]+ '.' chars[a-z]^(2..4) '/'

# Balanced text, doesn't have () or <>
Balanced   = !chars[ whitespace () <> ]+

# Up to 1 pair of balanced parens    (foo)
Balanced1  = { '(' Balanced ')' }

# Up to 2 pair of balanced parens    (foo) or ((foo))
Balanced12  = '(' { either Balanced or Balanced1 }* ')'

# None of these characters can end a URL.
EndUrl     = !chars[ whitespace ` &bang
                     ()
                     &lbracket &rbracket
                     {}
                     ;: '" ., <> ?
                     0xab 0xbb
                     &201c &201d
                     &2018 &2019
                   ]

Start      = %boundary
             {
                 ( either Protocol or WWW or DomainLike )
                 ( either Balanced or Balanced12 )+
                 ( either Balanced12 or EndUrl )
             }

It uses the following features:

flags(...) -- equivalent of (?xi) in Perl/Python
subexpressions like Name = Expression -- not possible with Perl-style syntax
literal strings, e.g. 'www'
Repetition operators like *, +, and ? -- same syntax
Ranged repetition like ^(2..4) -- equivalent of{2,4}`
Character classes like chars[ a-z 0-9 ], and negated with !chars
named char classes like whitespace -- equivalent of \s
character literals
- named character literals like &bang and &lbracket
- hex literals like 0xab
- unicode character literals like &201c
Non-capturing group () -- (?:...) in Perl
Capturing group {} -- () in Perl

Here's how to read it:

Protocol = 'http' 's'? '://'        # http or https protocol

the literal http
an optional s (http or https)
then the literal ://

then

WWW = 'www' digit^(..3) '.'    # "www.", "www1.", "www2." ... "www999."

The literal 'www'
up to 3 digits
then a literal period

then

DomainLike = chars[ a-z 0-9 . hyphen ]+ '.' chars[a-z]^(2..4) '/'

one or more of the characters a to z, 0 to 9, a literal period, or a hyphen. (The keyword hyphen is used to escape the hyphen, which is an operator in a-z).
a literal period
a lower case letter, repeated 2 to 4 times
a literal /

Let's skip down to the Start symbol and see how these 3 rules are used.

Start      = %boundary
             {
                 ( either Protocol or WWW or DomainLike )
                 ( either Balanced or Balanced12 )+
                 ( either Balanced12 or EndUrl )
             }

First we have %boundary, which is the equivalent of \b (a word boundary). CRE represents the zero-width assertions like ^, $, and \b with words that start with %.
Then we have a capturing group {}. Normal parens are non-capturing (the equivalent of (?:...)). Inside this group is a sequence of 3 disjunctions:
- either Protocol or WWW or DomainLike -- Those are our 3 rules.
- one or more of: a string with no parens, or a string with up to 2 balanced parens
- either (up Balanced or a character that may not end a URL (according to this heuristic).

Notes:

Notice that 5 repetitions of [^\s()<>] are collapsed into the Balanced* subexpressions.
The extra capturing groups noted above are preserved for comparison.

Last modified: 2013-01-29 10:42:48 -0800