Long URL Capture Example

When regular expressions get longer, the advantage of CREs becomes even more apparent.

This blog post describes how to extract URLs from free-form text. Consider these 3 lines:

(Something like http://foo.com/blah_blah_(wikipedia))
A url with parentheses: http://foo.com/blah_(wikipedia)#cite-1
Period ends this sentence: http://foo.com/blah_blah.

In the first case, the inner parentheses are part of the URL, but the outer ones aren't. In the third case, the period is not part of the URL.

Something like this is used in Markdown to auto-linkify URLs. (I'm writing this article in Markdown.)

Even with Perl's /x option, which allows insignificant space, and extensive comments, the following regex is pretty hard to read.

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://                   # http or https protocol
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." ... "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                                 # One or more:
    [^\s()<>]+                          # Run of non-space, non-()<>
    |                                   #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                                 # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]      # not a space or one of these punct chars
  )
)

There's also a lot of duplication. The pattern [^\s()<>] appears in five places.

How many capturing groups are there in this regex? It's hard to tell, but trying it in an interpreter reveals that there are five. As far as I can tell, the intention is to have one group -- the top level one -- so this may be a mistake. Non-capturing groups (?:...) are used in some places, but there are also literal parentheses \(\), and capturing groups (), which are very hard to distinguish.

Here's how you write this as a CRE:

flags(ignorecase)

Protocol   = 'http' 's'? '://'        # http or https protocol

WWW        = 'www' digit^(..3) '.'    # "www.", "www1.", "www2." ... "www999."

# looks like a domain name followed by a slash, e.g. "foo.cn/"
# TODO: change to use .. for char ranges?
DomainLike = chars[ a-z 0-9 . &hyphen ]+ '.' chars[a-z]^(2..4) '/'

# Balanced text, doesn't have () or <>
Balanced   = !chars[ whitespace () <> ]+

# Up to 1 pair of balanced parens    (foo)
Balanced1  = { '(' Balanced ')' }

# Up to 2 pair of balanced parens    (foo) or ((foo))
Balanced12  = '(' { either Balanced or Balanced1 }* ')'

# None of these characters can end a URL.
EndUrl     = !chars[ whitespace ` &bang
                     ()
                     &lbracket &rbracket
                     {}
                     ;: '" ., <> ?
                     0xab 0xbb
                     &201c &201d
                     &2018 &2019
                   ]

Start      = %boundary
             {
                 ( either Protocol or WWW or DomainLike )
                 ( either Balanced or Balanced12 )+
                 ( either Balanced12 or EndUrl )
             }

It uses the following features:

Here's how to read it:

Protocol = 'http' 's'? '://'        # http or https protocol

then

WWW = 'www' digit^(..3) '.'    # "www.", "www1.", "www2." ... "www999."

then

DomainLike = chars[ a-z 0-9 . hyphen ]+ '.' chars[a-z]^(2..4) '/'

Let's skip down to the Start symbol and see how these 3 rules are used.

Start      = %boundary
             {
                 ( either Protocol or WWW or DomainLike )
                 ( either Balanced or Balanced12 )+
                 ( either Balanced12 or EndUrl )
             }

Notes:


Last modified: 2013-01-29 10:42:48 -0800