Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

Notes on OSH Architecture

This doc is written for contributors or users who want to understand the Oil codebase. These internal details are subject to change.

Table of Contents
List of Regex-Based Lexers
Parser Issues
Where We Re-parse Previously Parsed Text (Unfortunately)
Where VirtualLineReader is Used
Extra Passes Over the LST
Parser Lookahead
Lexer Unread
Where the Arena Invariant is Broken
Where Parsers are Instantiated
Runtime Issues
Where OSH Parses Code in Strings Formed at Runtime
Where Bash Parses Code in Strings Formed at Runtime (perhaps unintentionally)
Parse Errors at Runtime (Need Line Numbers)
Other Cross-Cutting Observations
Where $IFS is Used
Shell Function Callbacks
Where Unicode is Respected
Parse-time and Runtime Pairs
Other Pairs
Build Time
Dependencies
Borrowed Code
Generated Code
More
The OSH Parser
State Machines
Links

List of Regex-Based Lexers

Oil uses regex-based lexers, which are turned into efficient C code with re2c. We intentionally avoid hand-written code that manipulates strings char-by-char, since that strategy is error prone; it's inevitable that rare cases will be mishandled.

The list of lexers can be found by looking at native/fastlex.c:

Parser Issues

This section is about extra passes ("irregularities") at parse time. In the "Runtime Issues" section below, we discuss cases that involve parsing after variable expansion, etc.

Where We Re-parse Previously Parsed Text (Unfortunately)

This makes it harder to produce good error messages with source location info. It also implications for translation, because we break the "arena invariant".

(1) Array L-values like a[x+1]=foo. bash allows splitting arithmetic expressions across word boundaries: a[x + 1]=foo. But I don't see this used, and it would significantly complicate the OSH parser.

(in _MakeAssignPair in osh/cmd_parse.py)

(2) Backticks. There is an extra level of backslash quoting that may happen compared with $().

(in _ReadCommandSubPart in osh/word_parse.py)

Where VirtualLineReader is Used

This isn't necessarily re-parsing, but it's re-reading.

Extra Passes Over the LST

These are handled up front, but not in a single pass.

Parser Lookahead

Lexer Unread

osh/word_parse.py calls `lexer.MaybeUnreadOne() to handle right parens in this case:

(case x in x) ;; esac )

This is sort of like the ungetc() I've seen in other shell lexers.

Where the Arena Invariant is Broken

Where Parsers are Instantiated

Runtime Issues

Where OSH Parses Code in Strings Formed at Runtime

(1) Alias expansion like alias foo='ls | wc -l'. Aliases are like "lexical macros".

(2) Prompt strings. $PS1 and family first undergo \ substitution, and then the resulting strings are parsed as words, with $ escaped to \$.

(3) Builtins.

Where Bash Parses Code in Strings Formed at Runtime (perhaps unintentionally)

All of the cases above, plus:

(1) Recursive Arithmetic Evaluation:

$ a='1+2'
$ b='a+3'
$ echo $(( b ))
6

This also happens for the operands to [[ x -eq x ]].

NOTE that a='$(echo 3) results in a syntax error. I believe this was due to the ShellShock mitigation.

(2) The unset builtin takes an LValue. (not yet implemented in OSH)

$ a=(1 2 3 4)
$ expr='a[1+1]'
$ unset "$expr"
$ argv "${a[@]}"
['1', '2', '4']

(3) printf -v takes an "LValue".

(4) Var refs with ${!x} takes a "cell". (not yet implemented OSH. Relied on by bash-completion, as discovered by Greg Price)

$ a=(1 2 3 4)
$ expr='a[$(echo 2 | tee BAD)]'
$ echo ${!expr}
3
$ cat BAD
2

(5) test -v takes a "cell".

(6) ShellShock (removed from bash): export -f, all variables were checked for a certain pattern.

Parse Errors at Runtime (Need Line Numbers)

Other Cross-Cutting Observations

Where $IFS is Used

Shell Function Callbacks

Where Unicode is Respected

See the doc on Unicode.

Parse-time and Runtime Pairs

Other Pairs

Build Time

Dependencies

Borrowed Code

Generated Code

More

The OSH Parser

TODO: Move this

The OSH parser is better than other shell parsers:

Bad: it's a slower! This needs to be fixed.

Where the parser is reused:

State Machines

The point of a state machine is to make sure all cases are handled!

Links


Generated on Sat Dec 7 23:40:10 PST 2019