blog | oilshell.org

Regular Languages, Part 2: Ideas and Questions

2020-07-22

The last post was about Eggex and regular languages, in theory and in practice.

It had a long appendix, which I've moved to this post. It's unpolished, but may spark some ideas. Feel free to leave comments!

Table of Contents
A Shortcut for Implementing an Engine
Regular Expression Derivatives: Papers and Code
Research Questions
More Powerful Linear Time, Constant Space Languages?
Ad-Hoc Context in Common Lexing Tools? Recursion?
Conclusion

A Shortcut for Implementing an Engine

In the long term, Oil may grow its own regular language / pattern matching engine. I've tried to avoid it, but it seems inevitable because most shells do it to a degree.

For example, rather than using libc's glob() and fnmatch(), which are restricted forms of regular languages, they implement their own. One reason is to implement the globstar option (**) and extended globs like @(foo|bar). (Unfortunately, I found that these implementations backtrack.)

Another reason to have a pattern matching engine: I've found re2c very useful in implementing Oil, and I'd like "lift" its functionality to the user level.

But writing one from scratch is a huge amount of work! So here's a crazy idea to compile re2c to WebAssembly and embed a WebAssembly VM in Oil:

Summary:

  1. re2c could grow a new WebAssembly backend, and
  2. The tool itself could be compiled to WebAssembly

(These T or J diagrams may help in thinking about that.)

Another reason for WebAssembly: I want let users avoid the tortured string manipulation syntax in shell. Letting people write arbitrary string functions in fast, portable code is one way to do that.

Problem: I don't like opaque binary blobs, in either the browser or in the shell. Even if they're sandboxed. I want the code to have provenance. To start, maybe this feature can be disabled by default.

This idea feels like a pipe dream, since there are so many other things to do in Oil. But it does make sense because regular languages are a crucial part of shell.

On the other hand, there's a concrete way to start it. Let me know if you have ever compiled C code to WebAssembly. I haven't! re2c is plain C++ with few dependencies, and we actually compile a specific version of it in Oil's continuous build.

Regular Expression Derivatives: Papers and Code

A different way to implement an engine would be through the derivatives technique. I don't have experience with it, so I'm just dropping a few links. Let me know if you do.

Some implementations:

(I recovered these links from an draft of "Dev Log #11" last year, which I never published. Hat tip to Santi from Recurse Center for sending me down this rabbithole of research :-) )

Research Questions

More Powerful Linear Time, Constant Space Languages?

As mentioned in the last post, I use regular languages as a linear-time, constant space programming language. They do a lot of work for "free", performance-wise.

Is there a more powerful model that's still constrained in resource usage? For example, Langdale mentioned that many usages of backreferences aren't exponential.

"Powerful" can be defined either:

Related:

Ad-Hoc Context in Common Lexing Tools? Recursion?

All practical parsing tools have developed ad-hoc mechanisms to deal with the problems that lexer modes solve, e.g. recognizing string interpolation syntax.

These questions are complicated by the interaction with the parsing model. Oil relies on explicit mode changes in its hand-written parsers, which doesn't work well with many parser generators.

Here are some notes on the mechanisms that different tools have come up with:

Conclusion

Let me know if any of those ideas are interesting. Especially if you want help write a pattern matching engine!

I think it would be fun to compile some C code to WebAssembly. Though we can likely do something much simpler (maybe with derivatives) that will address the deficiencies of libc glob().