Egg Expressions (YSH Regexes)

These patterns are intended to be familiar, but they differ from POSIX or Perl expressions in important ways. So we call them eggexes rather than regexes!

Table of Contents

Why Invent a New Language?

Example of Pattern Reuse

Design Philosophy

The Expression Language Is Consistent

Expression Primitives

. Is Now dot

Classes Are Unadorned: word, w, alnum

Zero-width Assertions Look Like %this

Single-Quoted Strings

Compound Expressions

Sequence and Alternation Are Unchanged

Repetition Is Unchanged In Common Cases, and Better in Rare Cases

Negation Consistently Uses !

Splice Other Patterns @var_name or UpperCaseVarName

Group With ()

Capture with <capture ...>

Character Class Literals Use []

Backtracking Constructs Use !! (Discouraged)

Outside the Expression language

Flags and Translation Preferences (;)

Use character literals rather than C-Escaped strings

POSIX ERE Limitations

Repetition of Strings Requires Grouping

Unicode char literals are limited in range

Don't put non-ASCII bytes in string sets in char classes

Char class literals: ^ - ] \

Critiques

Regexes Are Hard To Read

YSH is Shorter Than Bash

... and Perl

Design Notes

Eggexes In Other Languages

Backward Compatibility

FAQ

The Name Sounds Funny.

How Do Eggexes Compare with Perl 6 Regexes and the Rosie Pattern Language?

What About Eggex versus Parsing Expression Grammars? (PEGs)

Why Don't dot, %start, and %end Have More Precise Names?

Where Do I Send Feedback?

Why Invent a New Language?

Example of Pattern Reuse

Design Philosophy

The Expression Language Is Consistent

For example, it's easy to see that these patterns all match three characters:

And that you have to look up the definition of HexDigit to know how many characters this matches:

Expression Primitives

. Is Now dot

But . is still accepted. It usually matches any character except a newline, although this changes based on flags (e.g. dotall, unicode).

Classes Are Unadorned: word, w, alnum

Zero-width Assertions Look Like %this

Single-Quoted Strings

Note: instead of using double-quoted strings like "xyz $var", you can splice a strings into an eggex:

Compound Expressions

Sequence and Alternation Are Unchanged

Repetition Is Unchanged In Common Cases, and Better in Rare Cases

Negation Consistently Uses !

Splice Other Patterns @var_name or UpperCaseVarName

Group With ()

See note below: When translating to POSIX ERE, grouping becomes a capturing group. POSIX ERE has no non-capturing groups.

Capture with <capture ...>

Character Class Literals Use []

Backtracking Constructs Use !! (Discouraged)

Since they all begin with !!, You can visually audit your code for potential performance problems.

Outside the Expression language

Flags and Translation Preferences (;)

reg_icase / i (Ignore Case)

Use this flag to ignore case when matching. For example, /'foo'; i/ matches 'FOO', but /'foo'/ doesn't.

reg_newline (Multiline)

With this flag, %end will match before a newline and %start will match after a newline.

Without the flag, %start and %end only match from the start or end of the string, respectively.

Multiline Syntax

The YSH API

It also has an explicit and powerful Python-like API with the search() and leftMatch()` methods on strings.

Language Reference

Usage Notes

Use character literals rather than C-Escaped strings

POSIX ERE Limitations

Repetition of Strings Requires Grouping

Repetitions like * + ? apply only to the last character, so literal strings need extra grouping:

This is necessary because ERE doesn't have non-capturing groups like Perl's (?:...), and Eggex only does "dumb" translations. It doesn't silently insert constructs that change the meaning of the pattern.

Unicode char literals are limited in range

They happen to be identical when translated to ERE, but may not be when translated to PCRE.

Don't put non-ASCII bytes in string sets in char classes

Char class literals: ^ - ] \

The literal characters ^ - ] \ are problematic because they can be confused with operators.

Critiques

Regexes Are Hard To Read

YSH is Shorter Than Bash

... and Perl

Design Notes

Eggexes In Other Languages

The eggex syntax can be incorporated into other tools and shells. It's designed to be separate from YSH -- hence the separate name.

Backward Compatibility

Eggexes aren't backward compatible in general, but they retain some legacy operators like ^ . $ to ease the transition. These expressions are valid eggexes and valid POSIX EREs:

FAQ

The Name Sounds Funny.

If "eggex" sounds too much like "regex" to you, simply say "egg expression". It won't be confused with "regular expression" or "regex".

How Do Eggexes Compare with Perl 6 Regexes and the Rosie Pattern Language?

All three languages support pattern composition and have quoted literals. And they have the goal of improving upon Perl 5 regex syntax, which has made its way into every major programming language (Python, Java, C++, etc.)

The main difference is that Eggexes are meant to be used with existing regex engines. For example, you translate them to a POSIX ERE, which is executed by egrep or awk. Or you translate them to a Perl-like syntax and use them in Python, JavaScript, Java, or C++ programs.

Perl 6 and Rosie have their own engines that are more powerful than PCRE, Python, etc. That means they cannot be used this way.

What About Eggex versus Parsing Expression Grammars? (PEGs)

The short answer is that they can be complementary: PEGs are closer to parsing, while eggex and regular languages are closer to lexing. Related:

The PEG model is more resource intensive, but it can recognize more languages, and it can recognize recursive structure (trees).

Why Don't dot, %start, and %end Have More Precise Names?

Because the meanings of . ^ and $ are usually affected by regex engine flags, like dotall, multiline, and unicode.

As a result, the names mean nothing more than "however your regex engine interprets . ^ and $".

As mentioned in the "Philosophy" section above, eggex only does a superficial, one-to-one translation. It doesn't understand the details of which characters will be matched under which engine.

Where Do I Send Feedback?

Please try them, as described in this post and the README, and send us feedback!

Egg Expressions (YSH Regexes)

Why Invent a New Language?

Example of Pattern Reuse

Design Philosophy

The Expression Language Is Consistent

Expression Primitives

. Is Now dot

Classes Are Unadorned: word, w, alnum

Zero-width Assertions Look Like %this

Single-Quoted Strings

Compound Expressions

Sequence and Alternation Are Unchanged

Repetition Is Unchanged In Common Cases, and Better in Rare Cases

Negation Consistently Uses !

Splice Other Patterns @var_name or UpperCaseVarName

Group With ()

Capture with <capture ...>

Character Class Literals Use []

Backtracking Constructs Use !! (Discouraged)

Outside the Expression language

Flags and Translation Preferences (;)

reg_icase / i (Ignore Case)

reg_newline (Multiline)

Multiline Syntax

The YSH API

Language Reference

Usage Notes

Use character literals rather than C-Escaped strings

POSIX ERE Limitations

Repetition of Strings Requires Grouping

Unicode char literals are limited in range

Don't put non-ASCII bytes in string sets in char classes

Char class literals: ^ - ] \

Critiques

Regexes Are Hard To Read

YSH is Shorter Than Bash

... and Perl

Design Notes

Eggexes In Other Languages

Backward Compatibility

FAQ

The Name Sounds Funny.

How Do Eggexes Compare with Perl 6 Regexes and the Rosie Pattern Language?

What About Eggex versus Parsing Expression Grammars? (PEGs)

Why Don't dot, %start, and %end Have More Precise Names?

Where Do I Send Feedback?

`.` Is Now `dot`

Classes Are Unadorned: `word`, `w`, `alnum`

Zero-width Assertions Look Like `%this`

Splice Other Patterns `@var_name` or `UpperCaseVarName`

Group With `()`

Capture with `<capture ...>`

Character Class Literals Use `[]`

Backtracking Constructs Use `!!` (Discouraged)

Flags and Translation Preferences (`;`)

`reg_icase` / `i` (Ignore Case)

`reg_newline` (Multiline)

Char class literals: `^ - ] \`

Why Don't `dot`, `%start`, and `%end` Have More Precise Names?