blog | oilshell.org

How to Parse Shell Like a Programming Language

2019-02-07

After success running completion scripts, I started writing a post titled The Interactive Shell Needs a Principled Parser.

It led me to revisit my 2016 posts on #parsing-shell. This post links to them, summarizes them, and provides recent updates.

I lightly edited the posts and improved the formatting. And I replaced "Oil" with "OSH", because I wrote them before clarifying the OSH language vs. the Oil language.

The ideas in these posts will re-appear in the blog, so they're worth another look. Let me know if anything doesn't make sense.

One lesson I've learned over the last few years: If you can parse shell, then you can implement it!

That is, parsing correctly and efficiently is the hardest part of the Oil project. Managing file descriptors is a close second.

Table of Contents
Three Nice Properties of a Parser
Applying Them to the Shell Language
Parsing
Lexing
Representing Source Code
Missing Posts
2019 Updates
List of Lexer Modes
List of Sublanguages
Where These Ideas Will Re-appear
Appendix A: Minor Sublanguages
Appendix B: Other Shell Parsers

Three Nice Properties of a Parser

Roughly speaking, I want the OSH parser to behave like a Python or JavaScript parser. I aimed for these properties:

  1. Static Parsing. Every part of your program is recognized without reference to runtime data. The parser only looks at the source files!
  2. One-Pass Lexing. After characters are turned into tokens, the parser never looks at them again. Instead, it uses token IDs to make all decisions.
  3. One-Pass Parsing with Limited Lookahead. The parser looks at a couple tokens at a time to decide which nodes to output. It doesn't backtrack or look arbitrarily far ahead. It doesn't look at a token more than once.

There are several quirks in the shell language that prevent us from strictly following these principles. And "real" parsers violate them: for example, most parsers re-examine the contents of tokens for C-like string and number literals (e.g. Python's parser).

But I claim that they're good aspirations. They improve the quality of your parsing code, and can improve the design of your language.

(More concretely, I discussed a security issue related to dynamic parsing at BayLISA. I plan to write about it.)

Applying Them to the Shell Language

These posts explain how the OSH parser works, guided by the three principles above.

Parsing

OSH Parses Shell Scripts Up Front in a Single Pass. Other Shells Don't. A demonstration of the difference between static and dynamic parsing.

Parsing Bash is Undecidable. An example of undesirable dynamic parsing, and how I defined it out of the OSH language. Key point: almost all of bash can be statically parsed.

Grammar for Variable Substitutions. I divide shell into the command, word, arith, and bool sublanguages. The grammar helped me statically parse what's inside ${}. In contrast, bash, zsh, and mksh dynamically parse ${}.

The Five Meanings of #. And What Does ${####} Mean? Our strict parsing philosophy helps make sense of an ill-defined language.

OSH Can Be Parsed With Two Tokens of Lookahead. The single pass of parsing can be done with a simple and efficient algorithm. (Relatively harmless exceptions: assignments, tilde recognition, and brace expansion.)

Pratt Parsing Index and Updates. We use Pratt Parsing to recognize the arith sublanguage.

Lexing

How OSH Uses Lexer Modes. Parsing interleaved sublanguages in a single pass requires support from the lexer. Also see later posts tagged #lexing.

Representing Source Code

What is Zephyr ASDL? The OSH parser returns an elaborate data structure, which is expressed in ASDL, a DSL borrowed from CPython. It lets us use the model of algebraic data types in Python.

From AST to Lossless Syntax Tree. Translating shell to the Oil language requires a detailed representation the code.

The LST has benefits beyond translation, which I'll explain in future posts:

Missing Posts

I've documented most of the OSH parser, but there are some holes:

re2c is a Useful, High-Quality, Tool. We've gotten a lot of mileage out of re2c. How does it work?

Recursive Descent Parsing. Almost all hand-written parsers are recursive descent parsers. OSH uses their simple structure to support the interactive shell.

Parsing Retrospective. The single-pass, static parsing model worked very well over the years, but there are peculiaries that fall outside of it. For example:

2019 Updates

In 2016, I had a fairly complete shell parser, but the interpreter couldn't run real programs. It can now run thousands of lines of distro scripts and completion scripts, which implicitly validates that the language is correct.

So now is a good time for two updates:

List of Lexer Modes

The 10 lexer modes I listed have new names:

Outer SQ DQ Dollar_SQ Arith
VS_1 VS_2 VS_ArgUnquoted VS_ArgDQ BashRegex

Since then, OSH has required 4 more modes, for a total of 14:

Comment To recognize # until the end of the line. Comments aren't totally straightforward in shell.
Backtick To treat quotes and backslashes differently in "`echo hi`" versus "$(echo hi)".
DBracket For what's inside [[, aka the bool sublanguage.
ExtGlob For extended globs like *.@(cc|cpp|h).

Interestingly, most of these modes support the Word sublanguage. The Oil language should dramatically simplify this situation.

List of Sublanguages

In both the grammar post and the lookahead post, I mentioned 4 statically parsed, mutually recursive sublanguages:

Command echo hi, loops, conditionals, functions, ...
Word "name=${1:-default}", many kinds of quotes, variable substitutions, string operators, ...
Arith $(( x * 2 )), a[i++], ...
Bool [[ $name =~ [a-z]+ ]], [[ $LEN -eq 0 ]], ...

I've since implemented the following features and now consider them to be sublanguages. One reason is that they all have recursive structure.

Brace Expansion For example, {andy,bob,user{9..12}}@foo.com. It's recognized in a static "metaprogramming" pass. Parsing it doesn't depend on runtime data.
Globs and Extended Globs Globs don't have recursive structure, but extended globs do. They're dynamically parsed in OSH, but should be statically parsed in Oil. OSH sometimes parses globs itself, and sometimes uses libc, especially for extended globs.
Regexes (POSIX ERE) Regexes need a special lexer mode for compatibility with bash. After lexing, they're dynamically parsed by invoking libc.
test builtin aka [ The dynamically parsed variant of the bool sublanguage, aka [[. Dynamic parsing causes ambiguities that don't appear in the statically parsed [[.

Where These Ideas Will Re-appear

When I wrote these posts on parsing shell, I was concerned with:

  1. Discovering what the shell language really is,
  2. Defining a better version of it, and
  3. Providing better error messages to improve productivity.

But these ideas also relate to other parts of the project:

  1. The interactive shell is intimately tied to the OSH parser.
  2. The Oil language will adhere to the three principles more closely than OSH does. It will have fewer lexer modes and sublanguages, which will make it easier to learn and remember.

The next post will illustrate the relationship between the interactive shell and the parser. Being principled about text has paid off in unexpected ways!

Appendix A: Minor Sublanguages

Non-recursive:

I discussed these possibilities in Dev Log #9:

Appendix B: Other Shell Parsers

This isn't necessarily the only good way to parse shell, but the fact that OSH runs many real programs means that it's one good way. In 2016, I wasn't sure if the model I chose would hold up over time, but it's now clear that it has.

I maintain a list of shell-related projects on the wiki, including other shell parsers.

ShellCheck is one of the earliest and most impressive projects, written with Haskell's Parsec library. Though it's not clear to me if it recognizes shell well enough to execute it, or if it only looks for patterns to warn about.

I'm also interested in Morbig and Smoosh, the academic projects that study shell. The main difference between them and Oil is that I'm interested in the dialect of shell used in the wild (including bash), as opposed to POSIX shell.

On a related note, I have a list of questions regarding signals and file descriptors that some formalization may help with. You may see these questions in a future blog post!