oilshell.org

Oil Cross Reference

A list of topics and anchors that the blog and other docs link to.

Table of Contents
Project Components
Oil Terms
Adjacent Projects
Research Projects
Unix System Calls
Relevant Software Libraries
Tools
For Shell Scripts
For Implementing Programming Languages
For Code Improvement
The Unix Shell
Useful Documents
Shell Language Terms
Shell Implementations
Programming Languages
Concepts
Little Languages / DSLs
Related Languages
Algorithms and Data Structures
Books

Project Components

#oil-language
Oil Language — A new dialect of shell that's available within the osh binary. It's parsed and evaluated like Python or Javascript, as opposed to being a "macro processor". It has:

Prior to August 2019: A new shell language, to which bash programs can be automatically translated. It has a superset of bash functionality, with a syntax designed all at once instead of evolved. It will also incorporate elements of Awk and Make.

#osh-language
OSH Language — A statically-parsable language based on the common use of shell, in particular bash. The design criteria for the language are:

#mycpp
mycpp — mycpp translates a large part of the Oil interpreter from statically-typed Python to C++. It's not a general-purpose translator.

It depends on MyPy, and you can think of it as a hybrid between the recent mypyc compiler and the old Shed Skin compiler.

#opy
OPy — A Python bytecode compiler based on pgen2 and compiler2. This small piece of code allows us to adapt Python to the needs of the Oil project. See Building Oil with the OPy Bytecode Compiler.

As of December 2019, we expect OPy to be replaced by mycpp, which generates faster code.

#boil
Boil — The working name for the part of Oil that subsumes GNU Make. No code for this exists yet.

#OVM
OVM — The virtual machine that OSH and Oil will run on. As an implementation detail, it's a fork of the CPython VM.

#OVM2
OVM2 — A nascent VM to replace Oil's use of the CPython VM.

#OHeap2
OHeap2 — A data format for OVM2 that is like a SmallTalk image or v8 snapshot. Inspired by the first version of oheap.

#readline
readline — A line-editing library derived from bash. It has emacs and vi modes.

#pylibc
pylibc — An extension module to expose libc functions to Python. Python implements its own glob() or fnmatch() that are different from the ones in libc. We may also need libc's locale-aware string functions.

Oil Terms

#nice-translation
Nice Translation — A shell-to-Oil translation that only uses recommended concepts in Oil.

#compatible-translation
Compatible Translation — A shell-to-Oil translation that uses Oil features that exist only for the sake of bash compatibility.

#naive-style
Naive Style — TODO

#pedantic-style
Pedantic Style — TODO

Adjacent Projects

#aboriginal-linux
Aboriginal Linux — Shell scripts that implement the minimal Linux system that can rebuild itself (discontinued as of April 2017.)

#abuild
abuild — A 2500-line shell script that builds Alpine Linux packages.

#alpine-linux
Alpine Linux — A minimal Linux distribution based on musl libc and busybox.

#bash-completion
bash-completion — A companion project to bash that provides interactive completion for the common Unix commands. Most Linux distros use it, including Debian and Ubuntu. It consists of tens of thousands of lines of bash code.

#bwk
bwk — Some software archaeology I did on Kernighan's Awk, to research how Awk relates to the shell. (One interesting thing: they both don't implement first-class compound data structures, and thus lack garbage collection.)

#autotools
GNU autotools — A meta-build system that generates configure shell scripts and Makefiles from m4 macros.

#busybox
BusyBox — A reimplementation of standard Unix command line utilities, commonly used on embedded Linux systems.

#debian
debian — One of the oldest and most popular Linux distributions. It uses the apt package manager, which wraps dpkg. Ubuntu is based on Debian.

#debootstrap
debootstrapDebian uses this large shell program to construct its base image from binary packages.

#nix
Nix — A purely-functional package manager and Linux distribution. As with nearly all distributions, bash plays a fundamental role in building binary packages.

#pypy
PyPy — A Python interpreter written in Python (including a restricted subset RPython). It has novel JIT technology and a focus on speed.

#tinypy
tinypy — A interpreter for a subset of Python written in just ~2K lines of C and ~2K lines of Python (using a very dense style). I used some tinypy code for my pratt-parsing-demo, and it inspired the plan for Oil to have a Python interpreter.

#toybox
Toybox — A reimplementation of standard Unix command line utilities, by the former maintainer of busybox.

#ninja
Ninja — A "low-level" build system focused on incremental build speed. High level languages like CMake generate Ninja build files.

#tmux
tmux — A Unix terminal multiplexer which provides a better interactive interface than shell job control. GNU Screen is another popular option.

Research Projects

#smoosh
Smoosh - The Symbolic, Mechanized, Observable, Operational Shell — A formalization of the POSIX shell standard. Source code (in Lem and OCaml) is available.

Unix System Calls

#chroot
chroot — A system call that gives a process a view of its own "virtual" file system. Linux container technology like Docker or LXC can be thought of as a "chroot on steroids".

Relevant Software Libraries

#libc
The C Standard Library — The shell communicates with the kernel through the C standard library. Popular implementations include GNU libc and musl libc.

#tokenize
Python tokenize module — A reimplementation of Parser/tokenizer.c in pure Python. Part of the Python standard library.

#pgen2
pgen2 — A reimplementation of Parser/pgen.c in Python, done for lib2to3.

#compiler2
compiler2compiler2 is my name for the deprecated Python 2.7 compiler module. It does the same thing as Parser/compile.c, but in Python.

#byterun
byterun — A Python bytecode interpreter loop written in Python, described in the AOSA Book. It does the same thing as ceval.c in CPython.

#dplyr
dplyr — A "modern" data frame library for R. Part of the Tidyverse. I use it to analyze Oil's code and dependencies.

#tidyverse
TidyVerse — Hadley Wickham created this set of R packages. They reinvent R's data structures and standard library through metaprogramming!

#yajl
Yet Another JSON Library — Oil uses this C library to parse and print JSON. Because Oil has Python's data structures, we use a fork of the py-yajl Python binding to wrap yajl's nice streaming API.

Tools

For Shell Scripts

#coreutils
coreutils — The GNU implementation of ls, cp, mv, etc. It also has versions of test, time, and kill, which are typically shadowed by similar-but-different shell builtins.

#find
find — A classic Unix tool that walks a directory tree, filters its entries, and performs actions. GNU findutils implements it.

Many users don't realize that find is an expression language like expr or test. It looks nothing like Awk, but they both apply predicates and actions to a stream.

#xargs
xargs — A tool that builds and executes command lines from stdin. A very useful GNU extension is xargs -P, which starts processes in parallel.

#expr
expr — An external tool that implements mathematical expressions for shell. It has been mostly subsumed by the POSIX $((1+2)) construct, and the [[ $mystr =~ $myregex ]] construct. GNU autotools still generates code that uses it.)

For Implementing Programming Languages

#antlr
ANTLR — A tool to generate top-down parsers (LL(k), LL(*)). I ported the POSIX shell grammar to ANTLR to machine check it, but it's not used to generate code.

#yacc
yacc — A tool to generate bottom-up parsers. Bash uses yacc, which is a mistake discussed in this AOSA Book chapter on Bash.

#semantic-action
Semantic Action — The "right hand side" of a rule in a parser specification is a semantic action. It's typically a block of in the host language, e.g. C or OCaml.

Yacc and re2c both use the model of semantic actions. ANTLR and Python's pgen.c and pgen2 prefer to materialize a parse tree. This means that there's an extra step to construct an AST.

#re2c
re2c — A tool that compiles regular expressions first to a DFA, and then efficient C code consisting of mostly switch and goto statements. I use it to express multiple lexers in the Oil project.

The best part of it is that it's a library and not a framework.

#zephyr-asdl
Zephyr ASDL — Oil uses this domain-specific language to declare algebraic data types in Python and C++. We use it to represent both the syntax of shell programs and the interpreter's runtime data structures. See What is Zephyr ASDL? and posts tagged ASDL.

This article describes its use in Python. This SourceForge project contains the code.

#clang
Clang — A modular front end for C and C++ that supports IDEs and other tools (as well as the code-generating compiler). Oil has some similarities because we have multiple uses cases for the parser: execution, interactive completion, a tool to convert the osh language to the oil language, and more.

#protobuf
Protocol Buffers — A schema language, serialization format, and set of APIs created and open-sourced by Google.

#spec-test
sh_spec.py — A test framework written for osh that runs shell snippets against many shells. See How I Use Tests.

#wild-test
Wild Tests — A test framework that tortures the OSH parser with real-world shell scripts.

#gold-test
Gold Tests — A type of test that compares the output of OSH and bash (or another existing shell). The assertions are implicit so you don't have to write them.

For Code Improvement

Themes: Correctness, security, performance.

#asan
AddressSanitizer — A compiler tool for detecting memory errors at runtime. That is, it's a kind of dynamic analysis. It solves roughly the same problem as Valgrind, but it's faster. Also known as ASAN.

#afl
American Fuzzy Lop — A fuzzer that uses compiler technology to efficiently explore code paths. In the last few years, it's been used to surface hundreds of bugs in ubiquitous and already well-tested pieces of open-source software. Its Wikipedia page is also helpful.

#perf
Linux perf — User-space tools and kernel APIs for Linux performance analysis. Uses CPU-specific features for accurate measurements.

#flame-graph
Flame Graph — A relatively new technique for visualizing profiler output. It shows how much execution time can be attributed to a particular call stack. Note that a set of function call stacks forms a tree: a function may call multiple functions.

This explains why flame graphs can also be used like treemaps, i.e. to visualize space used in a file system hierarchy.

#bloaty
Bloaty McBloatyFace — A code size profiler for compiled binaries. I used it to measure progress in stripping down the CPython interpreter.

#mypy
mypy — A type checker for Python. You can gradually add types to Python 2 or 3 code, and MyPy will check them for consistency before execution. There are some limitations to the code it understands, but many Python idioms are supported.

#pyannotate
PyAnnotate — A tool that records the types of Python variables at runtime, and then generates approximate static type annotations.

The Unix Shell

Useful Documents

#posix-shell-spec
POSIX Shell Spec: POSIX specification for the shell (sh). It seems that ksh was the dominant shell at the time of standardization, so bash implemented POSIX + a lot of ksh.

#posix-grammar
POSIX Shell Grammar: Subsection of the spec which has a BNF-style grammar.

#google-style-guide
Google Shell Style Guide -- Unofficial shell style guide at Google, which points out some deficiencies in the shell language. (Not all shell scripts at Google attempt to conform to this style.)

#aosa-book-bash
Chapter on Bash in the Architecture of Open Source Applications — An excellent article by bash maintainer Chet Ramey on bash's internal structure.

Shell Language Terms

Trivia about the Unix shell language, including the common ksh/bash extensions.

#here-doc
Here Document — A construct in shell for writing lines of text to be fed to stdin of a process. Perl, Ruby, and PHP borrowed here docs from shell.

#shell-builtin
Shell Builtin — A shell builtin is just like an external command, e.g. /bin/ls, except it's linked into the sh binary. It takes an argv array, returns an exit code, and uses stdin, stdout, and stderr.

#dynamic-scope
Dynamic Scope — A method of resolving variable names. In the case of Unix shell, it means that you look up the stack for variable references, rather than looking only in the current stack frame. Early Lisps used these semantics, but later Lisps switched to lexical scope.

Shell Implementations

#bash
GNU Bash — The most popular shell implementation.

#dash
Debian Almquist Shell — A fork of the Almquist Shell that Debian and Ubuntu use for shell scripts, but not the default login shell. If you look at the busybox ash source code, it is apparent that they are similar. The things I notice most about it are that kebab-case function names aren't allowed, and it has a bug related to readonly and tilde expansion.

#fish
fish — Probably the most popular non-POSIX shell. It has a rich interactive experience.

#mksh
MirBSD Korn Shell — A fork of pdksh (Public Domain Korn Shell). This is the default shell on Android. Testing this shell against others has taught me that many "bash-isms" are actually "ksh-isms". bash implemented many ksh extensions for compatibility.

#zsh
zshzsh is probably the second most popular interactive shell, after bash. It's not POSIX-compliant by default, although it has options to make it POSIX compliant. Apparently, it doesn't split words by default.

#ksh
Korn Shell — ksh was an extension of the Bourne shell, developed at Bell Labs. pdksh and bash cloned many of its features.

#pdksh
Public Domain Korn Shell — A defunct clone of AT&T's Korn shell that survives in at least two forks: the OpenBSD shell and mksh.

Programming Languages

Concepts

#language-composition
Language Composition — When parsing almost any language, it's useful to think of it as a composition of sublanguages. Shell is an extreme case of this, but it's true for Python, JavaScript, HTML, etc.

#DSL
Domain-Specific Language — The Unix shell is glue for DSLs like sed, awk, find, expr, regexes, globs, and more. Oil is implemented with DSLs like re2c and Zephyr ASDL.

Little Languages / DSLs

#sed
sed — A text stream editor using a batch execution model.

#awk
Awk — A classic Unix programming language for text processing.

#extended-glob
Extended Glob — An unusual syntax in ksh and bash that gives globs the power of regular expressions.

#ERE
POSIX Extended Regular Expressions — The flavor of regex that bash supports.

#make
Make — A classic Unix build tool that is also a Turing-complete programming language.

#shell
Shell — An interactive program to control the Unix operating system, as well as a programming language. Oil aims treat shell as a serious programming language.

#M4
M4 — GNU Autotools is written in the text preprocessor language M4. It's similar to the C preprocessor, except that it's Turing-complete. It was designed to support a dialect of Fortran.

Related Languages

#algol-like
ALGOL Family of Languages — C-like imperative languages with functions, loops, conditionals, etc.

#tcl
Tcl — An embedded scripting language that's influenced some alternative shells. It has Lisp-like properties.

#lua
Lua — Lua is an embedded scripting language, which means that the interpreter is a library. It has no global variables, and requires explicit capabilities to I/O. While I don't like Lua the language, this aspect of Lua will influence Oil.

#r-language
R language — A language for statistical computing, including data manipulation, modelling, and visualization.

#ML
ML — ML stands for "meta-language": a language for manipulating languages. The ML family of languages includes OCaml and Haskell, and its distinguishing feature is the data model of algebraic data types. The domain-specific language ASDL uses this data model.

#cpython
CPython — The standard implementation of the Python programming language, written in C.

#python
Python — The popular language that I wrote OSH in.

#ocaml
OCaml — A popular modern implementation of ML. If I hadn't prototyped OSH in Python, OCaml would have been a good choice. The compiler and runtime are well-engineered and well-documented. They may influence OPy.

Algorithms and Data Structures

#cfg
Context-Free Grammar -- A formalism for expressing the syntax of programming languages. Shell can only be partially specified using a CFG; the POSIX grammar is incomplete.

#DFA
DFA — A deterministic finite automaton is a mathematical notion of a state machine. A regular expression can be translated to a DFA via an NFA. You feed the string to the DFA and see if you end up in an "accept" state. That happens if any only if the string matches the regular expressions.

#NFA
NFA — Every regular expression can be translated to an equivalent nondeterministic finite automaton. You can think of it as a state machine which magically "knows" which transition to take at each step. It's unintuitive to many programmers; a DFA is closer to our notion of computation.

#peg
Parsing Expression Grammar -- An alternative formalism to context-free grammars, which may be better-suited to expressing shell syntax.

#lexical-state
Lexical State — A simple parsing technique for dealing with language composition, i.e. "sublanguages" or "dialects". Renamed to lexer modes (because the lexer has other unrelated state).

#lexer-modes
Lexer Modes — A simple parsing technique for dealing with language composition, i.e. "sublanguages" or "dialects". Formerly lexical state. See posts on #lexing.

#precedence-climbing
Precedence Climbing -- A simple algorithm for top-down parsing of expressions. It's a special case of top-down operator precedence parsing.

#tdop-parsing
Top-Down Operator Precedence Parsing -- Also called Pratt parsing, this is a general algorithm for parsing expressions with multiple levels of precedence.

#recursive-descent
Recursive Descent Parsing -- The most widely-used parsing technique. Recursive descent parsers are written by hand, often following a grammar. Each recursive procedure in the parser corresponds to a "production" in a context-free grammar.

They are flexible, e.g. in accomodating ad hoc parsing rules and good error messages.

Recursive descent parsing is "top-down" parsing.

#top-down-parsing
Top-Down Parsing -- Parsing algorithms can be categorized as either top-down or bottom-up. ANTLR uses top-down algorithms, while yacc uses bottom-up algorithms. Pratt parsing is a top-down algorithm and recursive descent is a top-down technique. See LL and LR Parsing Demystified.

#AST
Abstract Syntax Tree — In contrast to an AST, a parse tree is derived only from the rules of the grammar for a language. You don't need to annotate your parser with nontrivial "semantic actions". The exact definition is debatable, but in my usage, an AST has some simplifications or annotations over a parse tree, depending on what you need to do with it: source-to-source translation, interpretation, code generation, etc.

#LST
Lossless Syntax Tree — An syntax tree with enough detail to reproduce the original source code.

#adt
Algebraic Data Types — A data model of sum and product types. This model is particularly convenient for representing the structure of programming languages.

#data-frame
Data Frame — A table data structure with dynamically typed columns. The R language is built around data frames, and the Pandas library borrowed this idea. It's similar to an SQL table, except that it generally lives in memory, rather than on a remote server's disk.

Books

#dsl-book
Domain Specific Languages by Martin Fowler — A book of patterns for implementing DSLs. Discusses lexical state.