Status Update and Blog Backlog

2017-02-26

Two days before I went on vacation, I described how I transformed the OSH AST into what I call the Lossless Syntax Tree. This was motivated by the requirement to translate shell to Oil (part one, part two).

The post generated quality discussion on Hacker News, Lobsters, and Reddit, which is what I was hoping for. I wanted to "crowdsource" my research into how different language platforms represent code losslessly.

I made a wiki page called Lossless Syntax Tree Pattern to distill the responses, planning to turn it into a blog post. I also drafted a post that showed more examples of the AST versus the LST.

Then I then went on vacation. When got back on Wednesday, full of renewed energy for the project, I directed it at coding instead of blog posts.

That was the right thing to do, but unfortunately it means that the blog is backlogged. Drafts are being neglected and TODOs are piling up.

In this post I'll summarize what I had planned to write about, without making a promise to do so any time soon. Tomorrow I'll talk about the coding tasks that have higher priority.

Leave a comment if you want to see more on any of these topics.

Blog Backlog

In the Blog TODO Stack, I grouped future blog posts into four themes:

Shell: The Good Parts. Features that a modern shell should preserve and extend.
ASDL. A schema language using the model of algebraic data types, which forms the backbone of the interpreter architecture.
The Difficulty of Parsing. General-purpose parsing tools are not suitable for production-quality interpreters and compilers.
Metaprogramming. It's important and widespread.

I managed to knock off two posts: Pretty Printing ASTs with ASDL and The Thinner Waist of the Interpreter, but there are still many loose ends.

It should take three or four posts to wrap up the first two themes. I don't feel as much urgency with the third and fourth themes, since they'll benefit from future experience in implementing Oil.

There are at least three more themes in play. Here's a list of possible posts:

(5) The Lossless Syntax Tree Pattern

(a) Lossless Syntax Tree, Part Two. As mentioned, this draft goes into more detail on the AST vs. Lossless Syntax Tree for OSH.

(b) An Algorithm for Style-Preserving Source Code Translation. The algorithm I used in translating shell to Oil is worth describing.

One of the best documents is the design doc for Microsoft's Roslyn platform for C# and Visual Basic. Clang is also powerful and mature, but its documentation isn't as good.

(d) Lossless Syntax Tree Conclusions. Make the following arguments:

The pattern indeed deserves a new name. Many language platforms use it, but they're using different names like AST, syntax tree, full syntax tree, etc.
Lossless syntax trees are a hard design problem. Evidence from Go, Dart, Roslyn, and Caml.
Language ecosystems that use a Lossless Syntax Tree often have two separate parsers: one that produces an AST for interpretation or compilation, and one that produces an LST for automated refactoring and auto-formatting.
- I think Clang and Roslyn are the only ones that unify the two parsers, but I'd love to be corrected on this.

(6) Shell: The Bad Parts

There have been several posts about parsing problems in shell:

There are an equal number of problems related to execution. A few that come to mind:

Shell has Dynamic Scope. Dynamic scope means that the callee can see all of the caller's variables, not just the arguments it passed. In other words, "local" variable lookup traverses the call stack!
Most people are not familiar with this discredited idea in programming languages.

Run dynamic-scope/run.sh and see what happens.
Bash Evaluates Code in Strings Without Eval. This is issue 3, an undocumented feature which is bizarre even for experienced shell users. It also relates to an infinite variable name evaluation rule that you won't see in any other language.

Bash has Separate Expression Languages for Strings, Ints, and Booleans. This design has bad consequences:
- [[ a = b ]] tests for equality of strings, while (( a = b )) does assignment of variables.
- [[ $x == $y ]], [[ $x -eq $y ]], and (( $x == $y )) are three more ways to test for equality.

Shell is so confusing that experts are wrong about it:

Quoting the Right Hand Side of Assignments Isn't Necessary. Word splitting and globbing only happen within commands, not assignments. Authoritative shell advice doesn't mention this:
- Google Shell Style Guide on Quoting
- http://mywiki.wooledge.org/Quotes
- Try !qefs in #bash on FreeNode
NOTE: Besides inhibiting word splitting and globbing, quoting also inhibits tilde expansion. If you know of other reasons to quote the RHS of an assignment, leave a comment.
Word Elision Leads to Command Elision. Word elision is when an empty, unquoted word is omitted from an argv array. It works in tandem with word splitting (which is a poor substitute for arrays).

Command elision is when word elision leads to an empty argv array.

This came up in the thread Evaluations of backticks in if statements on the help-bash mailing list. More than one bash expert was confused by this. It boils down to [] vs. [""].

After some testing, I figured out the command elision rule, which is missing from OSH. Bash maintainer Chet Ramey pointed out the section of POSIX that covers it.

(7) The Oil Language Design

This is the most important theme. I'm writing about the good and bad parts of shell to motivate the design of a new shell language.

It deserves a separate roadmap, but here's what I'm thinking right now:

More Shell Features. Translating Shell to Oil talked about funcs, procs, subshells, if, case, etc. There are more features to talk about: multiline strings in place of here docs, arrays, globs, brace expansion, regular expressions. Command vs. Expression Mode.
Macro Languages Don't Age Gracefully. Shell shares a design flaw with Make and regular expressions: the fact that there are no or few invalid programs leaves little room for the language to evolve. In contrast, JavaScript is a language with mistakes, but it has room to grow.
Syntactic Puns. A correspondence between syntax and semantics makes languages more usable. A pun is a syntax with multiple meanings. static is a pun in C; [ a = b ] is a sort of "inter-language" pun in shell. Method calls in Java/Python have multiple meanings.
Influence from Python. Python is my favorite language, and I plan to steal many of its features for Oil. More notable are the places where we'll diverge from Python:
- Homogeneous arrays vs. heterogeneous lists.
- Mutability: var, const, tuples vs. lists.
- Declaration/Initialization vs. mutation.
- Unicode: code points vs. UTF-8.
- Syntax: empty tuple; empty set vs. empty dictionary.
- Classes and modules are more static.
- "Accidentally quadratic" problems.
Influence from R. The Oil language will have tables. The slogan is that the output ls and ps are both tables. R is a language designed around tables: it's the only language without the "ORM problem".

Conclusion

I've written short blurbs for more than a dozen possible blog post in three themes. The most important theme is #7: the Oil language design.

If you're interested in anything in particular, leave a comment.

In the next post, I'll describe what coding tasks I'm prioritizing over blog posts. The main goal is to attract contributors. If that works, I may have more time for blogging!