From AST to Lossless Syntax Tree

2017-02-11

The Oil project encompasses two shell languages supported by the same runtime:

OSH, a bash dialect, and
Oil, a more usable and consistent language, with room to add features.

As noted in ASDL Implementation and Use Cases, I began with a simple AST interpreter for OSH, thinking that the two languages could share the same AST.

However, in addition to executing OSH, we need to translate it to Oil. The last two posts (part one, part two) demonstrated this style-preserving translation on real shell scripts.

This translation requires a different representation than execution. That is, osh.asdl used to be an AST, but it's now what I call a Lossless Syntax Tree. This post explains what that is, how it differs from both an AST and a parse tree, and shows examples.

Table of Contents

Related Work

What is a Lossless Syntax Tree?

Example

What about Whitespace and Comments?

Related Work

Are we solving the same problem as CoffeeScript? It also translates source code to source code.

No, because the resulting JavaScript only needs to be executed, while the output of our OSH-to-Oil translator will be subsequently edited by a human.

This means that CoffeeScript can just print the AST, because the AST represents exactly what's necessary for execution. In contrast, we must preserve comments and formatting with the Lossless Syntax Tree.

A better comparison is Python's 2to3, which converts Python 2 to Python 3.

Clang also uses such a representation. An observation that stuck in my head is that Clang preserves physical parentheses while GCC doesn't. That is, GCC represents 1 + 2 * 3 and 1 + (2 * 3) the same way, since they both evaluate to 7, but Clang represents them differently.

I believe proprietary IDEs like the ones from Microsoft have used such representations for a long time. I'd like to hear the terms they use to describe their data structures.

In summary, when generating source code for human consumption, it's important to represent features that are irrelevant to a computer. I hope that authors of other language front ends will comment on this post and provide examples of different approaches, and perhaps argue with my terminology.

What is a Lossless Syntax Tree?

It's a representation for source code that can be converted back to the original text. In Oil, I implement it with a combination of:

An ordered ASDL tree structure, and
An array of line spans. Concatenating these spans gives you back the original source file.

It's not an AST because it preserves details of the syntax that are irrelevant to interpretation and compilation. For example, these two lines of shell do the same thing, but look different:

echo foo > stdout.txt
echo foo 1> stdout.txt  # file descriptor 1 is stdout

As with the example of parentheses in Clang, we want to preserve these differences rather than ignore them.

But a Lossless Syntax Tree isn't a parse tree / concrete syntax tree either, because it has fewer nodes.

In What is Zephyr ASDL?, I linked to this StackOverflow post, which has a graphic at the bottom that compares an AST and a parse tree.

Eli Bendersky also has a good article on it: Abstract vs. Concrete Syntax Trees. He points out two good definitions from the Dragon Book there:

A parse tree pictorially shows how the start symbol of a grammar derives a string in the language.

Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.

So I think this representation deserves its own name. You would say that GCC has an AST but not a Lossless Syntax Tree.

Example

To be concrete, consider the program:

echo one; echo two

When we only wanted to execute the code, we represented it in abbreviated format like this:

(CommandList
  children: [
    (C {(echo)} {(one)})
    (C {(echo)} {(two)})
  ]
)

But translation imposes different requirements. Oil doesn't require semi-colons in the same places that shell does, so sometimes we remove them. To do this without disturbing the surrounding code, I introduced the Sentence node type, which includes a terminator like ; or &:

(CommandList
  children: [
    (Sentence command:(C {(echo)} {(one)}) terminator:<Op_Semi ";">)
    (C {(echo)} {(two)})
  ]
)

This way we know which semi-colon goes with which statement, and can decide whether to skip each one during translation.

In contrast, a parse tree produced with the POSIX shell grammar would look something like this:

complete_command
  list
    and_or
      pipeline
        pipe_sequence
          command
            simple_command
              (Sentence command:(C {(echo)} {(one)}) terminator:<Op_Semi ";">)
              
    and_or
      pipeline
        pipe_sequence
          command
            simple_command
              (C {(echo)} {(two)})

I don't use this representation because it's bigger and thus slower. It's not necessary for either translation or execution.

But some programs do use such a representation:

In contrast to ANTLR v3, ANTLR v4 always materializes a parse tree. Transforming it into an AST requires another stage.
Python creates a parse tree using Grammar/Grammar before transforming it into an AST. (This was mentioned in in What is Zephyr ASDL?.)

I can't think of a good reason for doing this. I think some people want "pure declarative" syntax, i.e. they don't want to clutter the grammar with semantic actions, but I see this as a weak reason. Please leave a comment if you have insight into this.

What about Whitespace and Comments?

We saw a tree that's more detailed than an AST, but less cluttered than a parse tree. However, the trees have no concept of whitespace or comments. There's not really a "natural" place to put such tokens because they're not part of the language proper.

The OSH parser handles this as follows:

Use the lexer to divide the program into an array of spans. A span is a triple describing a substring of a line: (int line_id, int col, int length). Concatenating the spans reproduces the original source text.
Number the spans from 0 to N-1. Call this integer the span ID.
Decorate the tree with span IDs of significant tokens like words and operators. This isn't an exact science — you may save more or fewer span IDs depending on the transformations you need to make.

Because span IDs are ascending, we now know where insignificant tokens like whitespace and comments are placed relative to the significant ones.

For example, this program:

echo one  # comment
echo two

is represented as follows:

(CommandList
  children: [
    (SimpleCommand
      words: [
        (CompoundWord parts:[(LiteralPart token:(token id:Lit_Chars val:echo span_id:0))])
        (CompoundWord parts:[(LiteralPart token:(token id:Lit_Chars val:one span_id:2))])
      ]
    )
    (SimpleCommand
      words: [
        (CompoundWord parts:[(LiteralPart token:(token id:Lit_Chars val:echo span_id:7))])
        (CompoundWord parts:[(LiteralPart token:(token id:Lit_Chars val:two span_id:9))])
      ]
    )
  ]
)

Spans (stored in an Arena):

    0 'echo'
    1 ' '
    2 'one'
    3 '  '
    4 '#'
    5 ' comment'
    6 '\n'
    7 'echo'
    8 ' '
    9 'two'
   10 '\n'
(11 spans)

Notice that span IDs 0, 2, 7, and 9, are significant and saved in the tree, but other spans aren't. To perform style-preserving source translation, we replace the significant spans, while leaving the insignificant ones in tact.

Although I haven't written this code yet, I believe that we can do the converse to perform AST-preserving reformatting. That is, we leave the significant tokens alone and change the insignificant ones. Please leave a comment if you've worked on a source code reformatter.

The next post will show more examples of the evolution from AST to Lossless Syntax Tree. I'll talk about how this change affects the program architecture.

I'll also go into more detail on the source translation algorithm, include related links, and ask some questions.

Spelling note: I'm now writing osh as OSH and oil as Oil, because languages are proper nouns and OSH is pronounced as an acronym.