A list of topics and anchors that the blog and other docs link to.
Oil Language — A legacy-free dialect of shell with:
echo $myfunc(x, y)
For a taste of the syntax, see The Simplest Explanation of Oil and A Tour of the Oil Language.
It shares the same runtime as OSH, so it's a smooth upgrade from both bash and OSH. Compatibility is selectively broken with Oil Options.
OSH Language — A compatible shell language based on the common use of shell (including POSIX, bash, and others). The design criteria for the language are:
In addition, it has four features that justify a new shell: reliable error handling, safe processing of user-supplied data, lack of "quoting hell", and better error messages and tools. These features are opt-in, as OSH is compatible by default.
Headless Shell — A mechanism to move the interactive shell into another process, outside of Oil's core. The Oil project is focused on a language for automation and glue, as opposed to a user interface.
Egg Expression — Oil's regular expression syntax, which has pattern composition and seamless integration with egrep, awk, and other Unix tools. It resembles Perl-style regex syntax, but literals are quoted and you can use whitespace to make patterns more readable.
mycpp — A tool that translates a subset of statically-typed Python to C++. It translates a large part of the Oil interpreter, but it's not a general-purpose translator.
It depends on MyPy, and you can think of it as a hybrid between the recent mypyc compiler and the old Shed Skin compiler.
OPy — A Python bytecode compiler based on pgen2 and compiler2. This small piece of code allows us to adapt Python to the needs of the Oil project. See Building Oil with the OPy Bytecode Compiler.
As of December 2019, we expect OPy to be replaced by mycpp, which generates faster code.
Boil — (obsolete) The working name for the part of Oil that subsumes GNU Make. No code for this exists yet.
oil-native — The build of Oil translated to C++ with mycpp. The resulting shell is 100% native code: i.e. there's no bytecode. When it's done, it will be the only Oil build, and we'll just call it "Oil".
OVM — A slice of the CPython interpreter, used as Oil's VM while it's being prototyped. It will be replaced with C++ code "metaprogrammed" with Python.
OVM2 — (obsolete) A nascent VM to replace Oil's use of the CPython VM.
OHeap2 — A data format for OVM2 that is like a SmallTalk image or v8 snapshot. Inspired by the first version of oheap.
readline — A line-editing library derived from bash. It has
pylibc — An extension module to expose libc functions to Python. Python implements its own
fnmatch() that are different from the
libc. We may also need
libc's locale-aware string functions.
wwz — A FastCGI program that serves the contents of a zip file. It makes it easy and fast to deploy thousands of small files to a web server, and back them up. We use it for test results, benchmarks, and continuous build logs. This Hacker News comment provides some color. It's a simple Unix-y solution.
Aboriginal Linux — Shell scripts that implement the minimal Linux system that can rebuild itself (discontinued as of April 2017.)
abuild — A 2500-line shell script that builds Alpine Linux packages.
Alpine Linux — A minimal Linux distribution based on musl libc and busybox.
bash-completion — A companion project to bash that provides interactive completion for the common Unix commands. Most Linux distros use it, including Debian and Ubuntu. It consists of tens of thousands of lines of bash code.
Bash Line Editor —
ble.sh gives you a fish-like interactive experience in bash, with
syntax highlighting, completion, and vim-style editing. It's written in pure
bash, and is likely the biggest and most sophisticated shell
in the world!
A long-term goal for Oil is to allow users to customize their shell this way, rather than hard-coding the UI in C++ or Python.
bwk — Some software archaeology I did on Kernighan's Awk, to research how Awk relates to the shell. (One interesting thing: they both don't implement first-class compound data structures, and thus lack garbage collection.)
GNU autotools — A meta-build system that generates
configure shell scripts and Makefiles
BusyBox — A reimplementation of standard Unix command line utilities, commonly used on embedded Linux systems.
debian — One of the oldest and most popular Linux distributions. It uses the
package manager, which wraps
dpkg. Ubuntu is based on Debian.
debootstrap — Debian uses this large shell program to construct its base image from binary packages.
Nix — A purely-functional package manager and Linux distribution. As with nearly all distributions, bash plays a fundamental role in building binary packages.
PyPy — A Python interpreter written in Python (including a restricted subset RPython). It has novel JIT technology and a focus on speed.
tinypy — A interpreter for a subset of Python written in just ~2K lines of C and ~2K lines of Python (using a very dense style). I used some tinypy code for my pratt-parsing-demo, and it inspired the plan for Oil to have a Python interpreter.
Toybox — A reimplementation of standard Unix command line utilities, by the former maintainer of busybox.
Ninja — A "low-level" build system focused on incremental build speed. High level languages like CMake generate Ninja build files.
tmux — A Unix terminal multiplexer which provides a better interactive interface than shell job control. GNU Screen is another popular option.
Smoosh - The Symbolic, Mechanized, Observable, Operational Shell — A formalization of the POSIX shell standard. Source code (in Lem and OCaml) is available.
chroot — A system call that gives a process a view of its own "virtual" file system. Linux container technology like Docker or LXC can be thought of as a "chroot on steroids".
The C Standard Library — The shell communicates with the kernel through the C standard library. Popular implementations include GNU libc and musl libc.
Python tokenize module — A reimplementation of
Parser/tokenizer.c in pure Python. Part of the Python
pgen2 — A reimplementation of
Parser/pgen.c in Python, done for lib2to3.
compiler2 is my name for the deprecated Python 2.7
compiler module. It does the same thing as
Parser/compile.c, but in
byterun — A Python bytecode interpreter loop written in Python, described in the AOSA Book. It does the same thing as
ceval.c in CPython.
dplyr — A "modern" data frame library for R. Part of the Tidyverse. I use it to analyze Oil's code and dependencies.
TidyVerse — Hadley Wickham created this set of R packages. They reinvent R's data structures and standard library through metaprogramming!
Yet Another JSON Library — Oil uses this C library to parse and print JSON. Because Oil has Python's data structures, we use a fork of the py-yajl Python binding to wrap yajl's nice streaming API.
pexpect — A Python library to automate terminal applications like shells,
passwd, etc. We use it to test the interactive shell.
coreutils — The GNU implementation of
mv, etc. It also has versions of
kill, which are typically shadowed by
similar-but-different shell builtins.
grep — A tool to search files for patterns. Prefer using
grep -E) to
grep, because repetition looks like
[0-9]+ rather than
former is more consistent with all other regular expression dialects, including
find — A classic Unix tool that walks a directory tree, filters its entries, and performs actions. GNU findutils implements it.
Many users don't realize that
find is an expression language like
expr or test. It looks nothing like
Awk, but they both apply predicates and actions to a stream.
xargs — A tool that builds and executes command lines from
stdin. A very useful
GNU extension is
xargs -P, which starts processes in parallel.
expr — An external tool that implements mathematical expressions for shell. It has been mostly subsumed by the POSIX
$((1+2)) construct, and the
[[ $mystr =~ $myregex ]] construct. GNU autotools still
generates code that uses it.)
strace — A tool that prints the system calls that another process makes. For example,
strace echo hi will show the
write() syscall, among others. The
contains a small expression language to filter what's printed.
ANTLR — A tool to generate top-down parsers (
LL(*)). I ported the POSIX
shell grammar to ANTLR to machine check it, but it's not used to generate code.
yacc — A tool to generate bottom-up parsers. Bash uses yacc, which is a mistake discussed in this AOSA Book chapter on Bash.
Semantic Action — The "right hand side" of a rule in a parser specification is a semantic action. It's typically a block of in the host language, e.g. C or OCaml.
Yacc and re2c both use the model of semantic actions.
ANTLR and Python's
pgen.c and pgen2 prefer to materialize
a parse tree. This means that there's an extra step to construct an
re2c — A tool that compiles regular expressions first to a DFA, and then efficient C code consisting of mostly
goto statements. I
use it to express multiple lexers in the Oil project.
The best part of it is that it's a library and not a framework.
Zephyr ASDL — Oil uses this domain-specific language to declare algebraic data types in Python and C++. We use it to represent both the syntax of shell programs and the interpreter's runtime data structures. See What is Zephyr ASDL? and posts tagged ASDL.
This article describes its use in Python. This SourceForge project contains the code.
Clang — A modular front end for C and C++ that supports IDEs and other tools (as well as the code-generating compiler). Oil has some similarities because we have multiple uses cases for the parser: execution, interactive completion, a tool to convert the osh language to the oil language, and more.
Protocol Buffers — A schema language, serialization format, and set of APIs created and open-sourced by Google.
sh_spec.py — A test framework written for
osh that runs shell snippets against many
shells. See How I Use Tests.
Wild Tests — A test framework that tortures the OSH parser with real-world shell scripts.
Gold Tests — A type of test that compares the output of OSH and bash (or another existing shell). The assertions are implicit so you don't have to write them.
Themes: Correctness, security, performance.
AddressSanitizer — A compiler tool for detecting memory errors at runtime. That is, it's a kind of dynamic analysis. It solves roughly the same problem as Valgrind, but it's faster. Also known as ASAN.
American Fuzzy Lop — A fuzzer that uses compiler technology to efficiently explore code paths. In the last few years, it's been used to surface hundreds of bugs in ubiquitous and already well-tested pieces of open-source software. Its Wikipedia page is also helpful.
Linux perf — User-space tools and kernel APIs for Linux performance analysis. Uses CPU-specific features for accurate measurements.
Flame Graph — A relatively new technique for visualizing profiler output. It shows how much execution time can be attributed to a particular call stack. Note that a set of function call stacks forms a tree: a function may call multiple functions.
This explains why flame graphs can also be used like treemaps, i.e. to visualize space used in a file system hierarchy.
Bloaty McBloatyFace — A code size profiler for compiled binaries. I used it to measure progress in stripping down the CPython interpreter.
mypy — A type checker for Python. You can gradually add types to Python 2 or 3 code, and MyPy will check them for consistency before execution. There are some limitations to the code it understands, but many Python idioms are supported.
PyAnnotate — A tool that records the types of Python variables at runtime, and then generates approximate static type annotations.
uftrace — A unique and useful tool for user-space function tracing. You tell your C compiler to instrument a binary, run it under
uftrace record, and query the
results. I used it to speed up Oil's parser. I use shell so I can use and
automate tools like
uftrace. Shell helps you write better native code.
Open Container Initiative — A standard for containers based on Docker. Docker is being "refactored away" into something less monolithic and more Unix-y.
Docker — A monolithic toolkit for containers. It has a build tool based on a shell-like DSL, registry push/pull, and a container runtime.
Podman — A container runtime that's part of Red Hat's rewrite / refactoring of the Docker ecosystem. They are making Docker more modular and Unix-y, e.g. by eliminating superfluous daemon.
POSIX Shell Spec: POSIX specification for the shell (
It seems that
ksh was the dominant shell at the time of standardization, so
bash implemented POSIX + a lot of ksh.
POSIX Shell Grammar: Subsection of the spec which has a BNF-style grammar.
Google Shell Style Guide -- Unofficial shell style guide at Google, which points out some deficiencies in the shell language. (Not all shell scripts at Google attempt to conform to this style.)
Chapter on Bash in the Architecture of Open Source Applications — An excellent article by bash maintainer Chet Ramey on bash's internal structure.
Trivia about the Unix shell language, including the common ksh/bash extensions.
Here Document — A construct in shell for writing lines of text to be fed to
stdin of a
process. Perl, Ruby, and PHP borrowed here docs from shell.
Shell Builtin — A shell builtin is just like an external command, e.g.
/bin/ls, except it's
linked into the
sh binary. It takes an
argv array, returns an exit code,
Dynamic Scope — A method of resolving variable names. In the case of Unix shell, it means that you look up the stack for variable references, rather than looking only in the current stack frame. Early Lisps used these semantics, but later Lisps switched to lexical scope.
Oil Procs — In Oil, shell-like functions are declared with the
proc keyword. Think of
them as "procedures" or "processes".
stdout, and return an exit code.
Thompson Shell — The first Unix shell, written by Ken Thompson. It had pipelines and redirects, but it's not a programming language. It's an interactive tool that is notably separate from the Unix kernel.
See the paper in Unix Shell: History and Trivia.
Bourne Shell — A seminal upgrade to the Thompson shell, written by Stephen Bourne. It turned shell into a programming language with loops, conditionals, and functions. It allows you to redirect and pipe the I/O of these compound structures.
All modern Unix shells are descendants of the Bourne shell. That is, it "won" over other efforts like Bill Joy's C shell.
Stephen Bourne: Early Days of Unix and design of sh (2015, YouTube) is a nice historical overview of the project.
GNU Bash — The most popular implementation of Unix shell. It was the first program to run on the Linux kernel, circa 1991. Oil is largely compatible with it. Also see the Wikipedia page for bash.
Debian Almquist Shell — A fork of the Almquist Shell that Debian and Ubuntu use for shell scripts, but not the default login shell. If you look at the busybox
ash source code, it
is apparent that they are similar. The things I notice most about it are that
kebab-case function names aren't allowed, and it has a bug related to
readonly and tilde expansion.
fish — Probably the most popular non-POSIX shell. It has a rich interactive experience.
MirBSD Korn Shell — A fork of pdksh (Public Domain Korn Shell). This is the default shell on Android. Testing this shell against others has taught me that many "bash-isms" are actually "ksh-isms".
bash implemented many
zsh is probably the second most popular interactive shell, after bash. It's
not POSIX-compliant by default, although it has options to make it POSIX
compliant. Apparently, it doesn't split words by default.
Korn Shell — ksh was an extension of the Bourne shell, developed at Bell Labs. pdksh and bash cloned many of its features.
Public Domain Korn Shell — A defunct clone of AT&T's Korn shell that survives in at least two forks: the OpenBSD shell and mksh.
Metaprogramming — A very general term for code that operates on code. Textual code generation, C macros, C++ templates, Python reflection, non-standard evaluation in R, and Lisp macros are all examples of metaprogramming.
In dynamic languages, the metaprogramming language is typically the language itself, while statically-typed languages require a different metaprogramming language. See Type Checking vs. Metaprogramming; ML vs. Lisp.
Metalanguage — In programming, a metalanguage is the language used to describe or implement another language. DSLs are often used as metalanguages. For example,
remodule. It's an abstract program but we cobbled together some concrete tools to express it.
Domain-Specific Language — The Unix shell is glue for DSLs like sed, awk, find, expr, regexes, globs, and more. Oil is implemented with DSLs like re2c and Zephyr ASDL.
Dependency Inversion — A style of programming that makes programs more modular. Most of the program is initialized in
main() and "wired together".
String Hygiene -- A property of programs that means that code isn't confused with data. This is critical for security in distributed systems. Shell injection, SQL injection, and HTML injection (XSS) are examples of security problems arising from the lack of string hygiene. Solutions to the problem include avoiding string concatenation and proper language-specific escaping. avoiding strings.
sed — A text stream editor using a batch execution model.
Awk — A classic Unix programming language for text processing.
Extended Glob — An unusual syntax in ksh and bash that gives globs the power of regular expressions.
*.@(sh|py)is like matching
@(foo|bar)construct allows alternation.
POSIX Extended Regular Expressions — The flavor of regex that bash supports.
grepsupports it with
Make — A classic Unix build tool that is also a Turing-complete programming language.
Shell — An interactive program to control the Unix operating system, as well as a programming language. Oil aims treat shell as a serious programming language.
M4 — GNU Autotools is written in the text preprocessor language M4. It's similar to the C preprocessor, except that it's Turing-complete. It was designed to support a dialect of Fortran.
ALGOL Family of Languages — C-like imperative languages with functions, loops, conditionals, etc.
Tcl — An embedded scripting language that's influenced some alternative shells. It has Lisp-like properties.
Lua — Lua is an embedded scripting language, which means that the interpreter is a library. It has no global variables, and requires explicit capabilities to I/O. While I don't like Lua the language, this aspect of Lua will influence Oil.
R language — A language for statistical computing, including data manipulation, modelling, and visualization.
ML — ML stands for "meta-language": a language for manipulating languages. The ML family of languages includes OCaml and Haskell, and its distinguishing feature is the data model of algebraic data types. The domain-specific language ASDL uses this data model.
CPython — The standard implementation of the Python programming language, written in C.
Python — The popular language that I wrote OSH in.
OCaml — A popular modern implementation of ML. If I hadn't prototyped OSH in Python, OCaml would have been a good choice. The compiler and runtime are well-engineered and well-documented. They may influence OPy.
Context-Free Grammar -- A formalism for expressing the syntax of programming languages. Shell can only be partially specified using a CFG; the POSIX grammar is incomplete.
DFA — A deterministic finite automaton is a mathematical notion of a state machine. A regular expression can be translated to a DFA via an NFA. You feed the string to the DFA and see if you end up in an "accept" state. That happens if any only if the string matches the regular expressions.
NFA — Every regular expression can be translated to an equivalent nondeterministic finite automaton. You can think of it as a state machine which magically "knows" which transition to take at each step. It's unintuitive to many programmers; a DFA is closer to our notion of computation.
Regular Language — The class of formal languages that "regexes" are based on. Perl-style regexes have many non-regular constructs, making them harder to recognize than regular languages.
Every regular language corresponds to a finite automaton that recognizes it. Roughly speaking, a DFA has no memory and looks at each byte of input exactly once.
Eggex encourages the use of regular languages, but it also has clear syntax for Perl-style backtracking constructs.
Parsing Expression Grammar -- An alternative formalism to context-free grammars, which may be better-suited to expressing shell syntax.
Lexical State — A simple parsing technique for dealing with language composition, i.e. "sublanguages" or "dialects". Renamed to lexer modes (because the lexer has other unrelated state).
Lexer Modes — A simple parsing technique for dealing with language composition, i.e. "sublanguages" or "dialects". Formerly lexical state. See posts on #lexing.
Precedence Climbing -- A simple algorithm for top-down parsing of expressions. It's a special case of top-down operator precedence parsing.
Top-Down Operator Precedence Parsing -- Also called Pratt parsing, this is a general algorithm for parsing expressions with multiple levels of precedence.
Recursive Descent Parsing -- The most widely-used parsing technique. Recursive descent parsers are written by hand, often following a grammar. Each recursive procedure in the parser corresponds to a "production" in a context-free grammar.
They are flexible, e.g. in accomodating ad hoc parsing rules and good error messages.
Recursive descent parsing is "top-down" parsing.
Top-Down Parsing -- Parsing algorithms can be categorized as either top-down or bottom-up. ANTLR uses top-down algorithms, while yacc uses bottom-up algorithms. Pratt parsing is a top-down algorithm and recursive descent is a top-down technique. See LL and LR Parsing Demystified.
Abstract Syntax Tree — In contrast to an AST, a parse tree is derived only from the rules of the grammar for a language. You don't need to annotate your parser with nontrivial "semantic actions". The exact definition is debatable, but in my usage, an AST has some simplifications or annotations over a parse tree, depending on what you need to do with it: source-to-source translation, interpretation, code generation, etc.
Lossless Syntax Tree — An syntax tree with enough detail to reproduce the original source code.
Algebraic Data Types — A data model of sum and product types. This model is particularly convenient for representing the structure of programming languages.
Data Frame — A table data structure with dynamically typed columns. The R language is built around data frames, and the Pandas library borrowed this idea. It's similar to an SQL table, except that it generally lives in memory, rather than on a remote server's disk.
Perlis-Thompson Principle — A software architecture concept distilled from statements by Alan Perlis and Ken Thompson. Short definition: Software with fewer concepts composes, scales, and evolves more easily. This is a tradeoff, not a hard rule.
Narrow Waist — The narrow waist (of an hourglass) is a software concept that solves an interoperability problem, avoiding an O(M × N) explosion. All of these are narrow waists:
O(M × N) code explosion — A system may need bespoke code to fill in every cell of a grid, like M algorithms and N data structures, or M languages and N operating systems. This problem can often be mitigated by better software architecture, e.g. with protocols, interchange formats, or intermediate representations.
Application Programming Interface (API) — A software interface specified in a programming language, often with static linking. Contrast with ABI: Application Binary Interface.
Application Binary Interface (ABI) — The "runtime reality" of a software interface, often derived from an API. The Actually Portable Executable project takes this idea to an extreme, building on the x86-64 Linux ABI. It essentially ignores the APIs and "puns" multiple ABIs.
Inter-Process Communication — A type of software composition that involves messages exchanged between processes. It differs from composition via APIs in that the programs on each side of the "wire" aren't compiled and deployed together, aren't synchronized in the same "thread", and may be written in different programming languages.
IPC is similar to networking, but the links are reliable rather than unreliable. RPC abstractions can be built on top of IPC or networking.
Common Gateway Interface — A Unix-y protocol for creating dynamic web content. It was more popular in the 90's, but is still used today. The more complex FastCGI protocol can fix performance problems.
UTF-8 — The best and most popular Unicode encoding. It's backward-compatible with ASCII, so less code has to be rewritten to support Unicode. See blog posts tagged
Quoted String Notation (QSN) — A data format for strings which looks like
'foo \x00 bar\n'. It's an
adaptation of Rust's string literal syntax with two main use cases:
Quoted, Typed Tables — An enhancement of TSV and CSV that is built on QSN. This is the foundation for structured data in Oil. Any language that has a JSON library should also have a QTT library.
QTSV — An old name for QTT.
YAML — A human-editable configuration file syntax that's a superset of JSON. It's quirky, but widely used in the cloud. It confuses values like the string "NO" and the boolean
Domain Specific Languages by Martin Fowler — A book of patterns for implementing DSLs. Discusses lexical state.
Zulip Chat — Zulip is a hybrid of e-mail and chat that Oil users and developers can use. Log in to oilshell.zulipchat.com with Github or Google. I sometimes summarize Zulip threads in blog posts tagged #zulip-links.