Home

From AST to Lossless Syntax Tree, Part Two

2017-02-12

(We'll see what I mean by "ordered" in the next post.)

More Differences

In this section, I'll explain this messy diff of osh.asdl. Recall that this schema was originally designed for execution, but now it's become a Lossless Syntax Tree for translation to Oil.

Differences:

(1) As mentioned, I introduced the concept of a Sentence, and removed the Fork node. A sentence can end with ; or &. The former will be compiled to synchronous execution, and the latter asynchronous.

(2) I preserved the difference between $var and ${var}, which are translated to $var and $(var) in Oil. They of course mean the same thing.

(3) The Assignment node now stores only a keyword: either declare, export, local, or readonly. Prior to this I was trying to represent the semantics of each statement, but now we only represent the syntax.

(4) The body of a loop like do; echo hi; done and a block like { echo hi; } are now represented differently. We translate them to a uniform block syntax.

(5) And if statement now has an optional else_action. Prior to this, we would represent the else statement like elif true; echo else action, but this isn't good enough for translation.

(6) Case statements now have a sequence of case_arm. This preserves the "ordering" property of the Lossless Syntax Tree -- we should be able to traverse the children of a node in syntactic order. We shouldn't have to jump back and forth.

I made note of a couple differences that aren't represented:

These haven't proved necessary for translation so I cheated a bit. If we really needed to recover the original source rather than just translate to Oil, I would fully represent these differences.

This change also means that Oil is starting to like a compiler, because I need to "lower" osh.asdl to a simpler form for execution (called ovm.asdl for now).

Links

Recap of Three Priorities

  1. We tested the oil language design
  2. Shell executor in Python still works. It interprets the lossless syntax tree.
  3. I haven't worked on the C++ code in awhile. ovm.asdl isn't fleshed out, but I have oheap.

More components needed:

Switching gears:

I need to parallelize all this work.

Conclusion

Next post:

Annotated Lossless Syntax Tree.

Algorithm for

  1. Testing the oil language design by converting shell scripts

I started the oil project by writing a tree-walking interpreter for osh, a dialect of bash. All shells that I've looked are tree-walking interpreters, although they interleave parsing an execution, so they cannot have a single thin waist.

In this post, I made note of three use cases for "the" AST:

  1. Execution
  2. Converting the code from osh to oil

I made

. They must have multiple rephave multiple
rather than having a single

at use this pat Like all shells, Oil started as a tree-walking 0interpreter.

I had an AST for oil

"Transpilers" like CoffeeScript don't need to do this, because the target JavaScript code won't be edited.

A Source Code Representation: the Annotated Lossless Syntax Tree

In the last two posts, I showed examples of the osh to oil translation. The translation algorithm must preserve comments and formatting, i.e. indentation and line breaks.

I expect the upgrade process from bash to be:

  1. Run the osh to oil translator
  2. Fix up any bugs/warnings
  3. Fix it up for style, optionally running an oil-format (like clang-format).

The algorithms and data structures to accomplish this style-preserving translation took some thinking.

In this post, I'll describe the data structure I think of as the "Lossless Syntax Tree". It's different than both an AST (abstract syntax tree) and a parse tree/concrete syntax tree.

using the lossless syntax tree. In the next post, I'll describe the algorithm for

What can you do with this data structure?

  1. Style-preserving translation to another language. C++ 11 to C++ 14.
  2. Reformatting -- changing the whitespace.
  3. "Lowering" to a more abstract tree for interpretation or compilation.
  4. Direct execution. This is what we do right now.

Formatting

https://clang.llvm.org/docs/LibFormat.html

FIXME: Write up design. Interface

The core routine of LibFormat is reformat():

tooling::Replacements reformat(const FormatStyle &Style, Lexer &Lex, SourceManager &SourceMgr, std::vector Ranges);

I think that this is compatible with my representation, but I don't know.

Python DOES USE a Concrete syntax tree. Python.asdl is the abstract syntax tree.

But Grammar/Grammar creates a pure cnocrete syntax tree. There are NO SEMANTIC ACTIONS. But that's why Python needs .pyc files. That is shown in Haberman's post.

I want a FAST PARSER. And although I haven't measured it, creating 3 extra nodes for every command is bound to make it slower!

ANTLR 3 vs ANTLR 4. Always creates CST. I believe this is the wrong choice, or at least it makes me want to use ANTLR even less.

ANTLR is dumb; shouldn't use.

Questions

Wiki page: Source to Source Translation

CoffeeScript Go


Discuss this post on Reddit.
Get notified about new posts via @oilshellblog on Twitter.