|
blog | oilshell.org
Notes for Houston Functional Programmers Talk
2024-05-14
This post has notes for this event:
I'm publishing it mainly so the audience has something to follow along with.
This material isn't that polished, and I hope that the audience will help me improve it! Ask me questions during the talk, and feel free to send feedback to andy at oilshell.org.
This event won't be recorded, but please invite me to speak to small groups interested in the internals of programming languages and Unix.
Right now my main goal is to attract contributors. You can even be paid to work on Oils!

Diagram: re2c output for recognizing a C-style string with backslash escapes
Note to self: skip some of the sections below! There's a lot of detail, and many opportunities for tangents / diversions.
Talk Intro
This talk has both:
- Stories and rationale
- History / software design / philosophy / Unix / cloud / ... (which could take up many hours)
- Code Snippets / Demos
- Concrete experience, lessons learned
- "Things we learned the good / hard way" - a tidbit
Specifically:
- Regular Languages with re2c - background, tidbits, demos
- Algebraic Data Types with Zephyr ASDL
- Unix shell to glue all the generated code together
- Textual metaprogramming (contrast with Lisp metaprogramming)
I think these two topics should appeal to functional programmers - reasoning by sets, rather than by states (over time).
I will probably lean toward the latter, since the former is on the blog.
- Audience question: How many people have heard of the Oils project (aka "Oil shell")?
Links
Output from wc -l for first demo:
54 demo/houston-fp/favorite.re2c.cc
Second demo:
26 demo/houston-fp/demo.asdl
26 demo/houston-fp/demo_main.py
52 total
Automation
197 demo/houston-fp/run.sh
What to take away?
A feeling for what it's like to work on Oils.
Anyone who works on this project will learn some things! I certainly have. (Not in this talk: translating Python to C++, garbage collected runtime, ...)
Tidbits/slogans about regular languages and algebraic data types.
Project Intro
What is Oils? A few slogans
On the home page https://www.oilshell.org/:
- Our upgrade path from bash to a better language and runtime.
- OSH runs your existing shell scripts. It's the most bash-compatible shell, by a mile.
- YSH is for Python and JavaScript users who avoid shell.
- There's an upgrade path from OSH to YSH. But you can use OSH by itself, or YSH by itself.
- New slogan: A Small Tool That Unifies Shell, Python, Regex, JSON, and YAML.
- Old Unix "Sludge" → New Cloud "Sludge"
- Discuss status of each part
Why work on it?
Warning: the project is a weird mix of practical and theoretical.
- What could be more practical than a better bash?
- "Programmer fantasy" of rewriting it all to be simpler
- But reusing existing software via coarse-grained processes
- After a career in Silicon Valley, I'm not happy with how software works. Both the amount and the quality.
- Let's look to the past, for systems that worked over long periods of time.
- A big experiment in #software-architecture.
The project has a lot of metaprogramming for "leverage" -- this has upsides and downsides. Does Oils have the curse of Lisp?, etc.
About me:
- In high school/college, I liked math more than programming.
- Masters in CS, but took math instead (geometry, combinatorics, ...)
- Electronic Arts - 2002-2004
- Google - 2005-2016
- The first time I used Unix!
- So much Unix at Google. Why are we still using Unix? Isn't it archaic and old?
- Posts tagged #software-architecture
Project Status
Briefly:
- 2016-2022 - personal project
- 2022-2024 - essential funding from NLnet, and essential contributors
Demo
Audience Survey
- How many people use shell?
- ... have written 10 lines, 100 lines, 1000 lines?
- How many people have used either JavaScript or Python?
- Oils as a "second language"
- How many people are familiar with how languages are implemented, e.g. interpreters or compilers?
- Anyone who has worked professionally on a language/compiler/interpreter?
- Statically typed vs. dynamic functional programming?
- Oils has influences from both, at the language level, and the metalanguage level
OSH - Two complete implementations
osh -n - totally different internals than bash
- neofetch - 11.5K lines of bash
- running
bin/osh (Python) and osh-native
- Make-a-Lisp - Lisp Interpreter in bash
andy@hoover:~/git/oilshell/oil$ time bin/osh ~/git/other/neofetch/neofetch
_,met$$$$$gg. andy@hoover
,g$$$$$$$$$$$$$$$P. -----------
,g$$P" """Y$$.". OS: Debian GNU/Linux 12 (bookworm) x86_64
,$$P' `$$$. Host: Intel Corporation NUC11PABi5
',$$P ,ggs. `$$b: Kernel: 6.1.0-9-amd64
`d$$' ,$P"' . $$$ Uptime: 58 days, 13 hours, 29 mins
$$P d$' , $$P Packages: 2160 (dpkg)
$$: $$. - ,d$$' Shell: bash 5.2.15
$$; Y$b._ _,d$P' Resolution: 3840x2160
Y$$. `.`"Y$$$$P"' DE: GNOME 43.4 (Wayland)
`$$b "-.__ Theme: Adwaita [GTK2/3]
`Y$$ Icons: Adwaita [GTK2/3]
`Y$$. Terminal: tmux
`$$b. CPU: 11th Gen Intel i5-1135G7 (8) @ 4.200GHz
`Y$$b. GPU: Intel TigerLake-LP GT2 [Iris Xe Graphics]
`"Y$b._ Memory: 12792MiB / 15639MiB
`"""
real 0m1.726s
user 0m1.145s
sys 0m0.357s
~/git/languages/mal/impls/bash$ ./stepA_mal.sh ../../tests/incA.mal
9
~/git/languages/mal/impls/bash$ osh ./stepA_mal.sh ../../tests/incA.mal
9
~/git/languages/mal/impls/bash$ osh ./stepA_mal.sh ../../tests/print_argv.mal a 42 'b c d\'
("a" "42" "b c d\\")
~/git/languages/mal/impls/bash$ ./stepA_mal.sh ../../tests/print_argv.mal a 42 'b c d\'
("a" "42" "b c d\\")
YSH - Polished core, more work ahead
Demos:
- Garbage-collected data structures
- JSON, pretty printing
Two major pain points gone:
- Removed silent string splitting - the "wrong default" - which causes data-dependent bugs
- Shell ignores errors, even if you ask it to handle them (POSIX spec)
Upgrade path:
shopt --set ysh:upgrade
shopt --set ysh:all
shopt --set strict:all
- Analogies to C++, TypeScript, Kotlin, etc. - Oils 2023 FAQ
- Anecdotes: I was on a project that switched from C to C++ in ~2003, >20 years after it was started
- GCC switched to C++ in ~2013, >30 years after it was created
The "Middle Out" Style
Questions that motivate the project
(Background for the "middle out" style - Go through this section QUICKLY. Why is Oils big? Why is it taking along time?)
Remember Oils is a mix of practical and theoretical. Scope has always been a problem. Some open questions:
- Can we statically parse shell? Yes.
- Was not obvious, was the subject of a research paper - Menhir issues
- Treesitter issues
- Our solution: "Lexer Modes" and interleaved parsers
- Will an interpreter written against that code representation be compatible with POSIX and bash?
- Yes. And we have identified the sources of dynamism (aliases, backticks, etc.)
- Can we write the interpreter in a high level language?
- All shells are written in C. Can it be as fast as code written in C?
- What language do we write it in? (A collection of DSLs)
- What's the difference between an executable language spec, and a tool people use? (a "canonical" shell, a better POSIX shell)
- What's the difference between a textbook implementation, and a tool people use? (pedagogy)
- Can we upgrade bash to a language familiar to JS and Python users? (including the experience of the last 30 years)
- All signs point to yes, OSH to YSH. More usage and feedback needed.
Related - https://www.oilshell.org/blog/2021/12/backlog-project.html#oil-is-a-bunch-of-experiments-that-succeeded
The third question turned out to be harder than the first question:
- It involved more "false starts"
- It involved gradual typing -> static typing, writing a garbage collector (not covered in this talk).
A collection of languages to implement
Many interleaved / mutually recursive languages, many interleaved parser / evaluators.
- Old "sludge" - Bourne shell + ksh/bash
- Make, autoconf/m4, CMake, ...
- New "sludge" - shell, JSON, YAML, Go templates, etc.
Oils puts them under the same roof. (Paradox of the project: encourage polyglot programming, but also reduce language cacophony from tiny DSLs.)
Analogy: HTML used to contain Flash code and Java applets, now it contains <video> and WebAssembly
A collection of evolved DSLs, for "compressing" languages
Oil Is Being Implemented "Middle Out" (2022)
A collection of DSLs:
- Regular Languages
- Zephyr ASDL
- pgen2 for the YSH grammar, also borrowed from CPython
- Typed Python with mycpp
All these little compilers/translators are in our source tree:
How did I arrive at this? Write the simplest possible code that works, then refactor.
I think of it as "compression" or "vertical factoring". To reduce repetition and gain consistency.
Oils (OSH + YSH + ...) is 50K-60K lines of source code, compared to 140K lines of C for bash.
Nearly all language implementations use at least 1 or 2 internal DSLs (CPython, Go, etc.) But most don't have two complete runnable implementations. (Exception: PyPy)
Thought experiment: implement as many parsers and evaluators as you can, and then refactor the code to be smaller. What do you end up with?
Oils vs. Crafting Interpreters
(Short section for BACKGROUND)
Audience questions:
- How many people have heard of this book? Read it?
- What's the difference between part 1 and part 2?
Comparison:
- lexer: generated with regular languages, vs. hand-written
- parsing
- AST node types:
- Crafting interpreters uses a little script to generate node types (likewise with dash shell, etc.)
- This is just like ASDL, but there are no sum and product types
- ... (material for later)
Regular Languages - with re2c
What is it? / Slogans
re2c is a tool that generates C state machines (switch and goto) from regular expressions. I heard about from Performance of Open Source Applications: Ninja. (Also used in CommonMark reference implementation, PHP, ...)
- Avoid crawling through backslashes and braces one at time
- If you like algebraic data types, you should like regular languages
- Audience question: What do they have in common?
- Regular Languages Give Non-Deterministic Lookahead
- I've noticed that some programmers aren't comfortable with non-determinism, reasoning by sets rather than by states
$( vs. $(( etc.
- Showing state machine
Demo: My Favorite Regular Expression
My favorite regex is:
"([^"\]|\\.)*"
Many years ago, when reading CPython's tokenize.py module, I was surprised to learn that C-style strings with backslash escapes are regular languages.
(Audience question: Perl-style regexes vs. regular languages?)
Recently: Storing Data in Control Flow (Russ Cox, 2023)
Demos:
- A tiny self-contained C program
- Input that's not handled - exhaustiveness check
- Code that's not executed - unreachable check
What could be improved about the demo:
- Show equivalent imperative C code
- Show exact bash code vs. Oils code
Diversion: In Eggex Syntax
How about explaining it like this?
osh$ var pat = / DQ ( ![DQ Backslash] | Backslash dot )* DQ /
Aside: in Eggex — one level up — it's
osh$ var pat = / '"' ( !['"' r'\'] | r'\' dot )* '"' /
osh$ echo $pat
"([^"\\]|\\.)*"
Matching:
osh$ = r'"foo\n"' => leftMatch(pat)
<Match 0x1c23e>
$ = r'"' => leftMatch(pat)
(Null) null
Refactored:
osh$ var Backslash = r'\'
osh$ var pat = / '"' ( !['"' Backslash] | Backslash dot )* '"' /
then
osh$ var DQ = r'"'
osh$ var pat = / DQ ( ![DQ Backslash] | Backslash dot )* DQ /
osh$ echo $pat
"([^"\\]|\\.)*"
History - Using this abstraction all over
And evolving the abstraction.
- 2016 - Started the project in C++ with re2c
- 2017 - Translator from Python regexes to C extension which wraps code generated by re2c
- 2021 - "Triumph of Lexer Modes" (extended globs)
- 2024 - Unifying JSON and shell string literals with yet more lexer modes ...
- Nice way of writing a JSON parser, and upgrading it.
- Tidbit: Surrogate pairs recognized with regular languages (compare with other JSON parsers)
- Some zsh compatibility (non-deterministic lookahead / "free" lookahead)
Aside: Line between lexing and parsing isn't obvious: https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate
Key idea - Lexer Modes for Interleaved Languages
Algebraic Data Types in Imperative Languages - Zephyr ASDL
What is it? / Slogans
Aside: re2c is also from the 90's, which postdates both Unix shell (~1970) and the complaints quoted in 2019 above (1993-1994).
Tidbit: Case Classes / Union Types
Can use audience help in explaining this.
I also "need" this metalanguage feature now, based on experience from implementing many little languages. We want an "executable spec", so it should be short, but:
- Efficient
- Have good user experience, especially error handling
What's the difference between a textbook implementation and a implementation people use? Contrast with "textbook" Standard ML, e.g. PL Zoo
Technical descriptions:
- Decoupling static type and dynamic tag
loc_t is a static type, Token is a "struct" with a static type, loc_e.Token is a dynamic tag
- Tree vs. Graph shape
Other reference points:
- GADTs in OCaml?
- Rust
into() - not the same I think
- Rich Hickey: Maybe Not - talks about union types vs. sum types
- modeling data first, before types
- static models vs. dynamic reality
Example: Modeling the Structure of Languages
Audience question: What's decl expr stmt ?
Python is roughly expr stmt.
OSH is
command_t (similar to stmt)
word_part_t word_t
arith_expr_t
bool_expr_t
- plus non-recursive sublanguages / dialects, sometimes dynamically parsed
YSH is
- Borrowed from above:
command_t word_part_t word_t
expr_t
re_t aka Eggex
Examples of free-floating / first class variants:
- Token
- SingleQuoted, DoubleQuoted, CommandSub - in words and expressions
But not just these dialects. Also error handling, word evaluation, more.
Demo in Python
- Usages in Oils Code
- Small Python example
- generate code in Python and C++
- Type check
- Run
History: CPython is pretty dynamic, Oils Dynamic → Static
CPython's use of ASDL is pretty dynamic:
- Target language C doesn't have subtyping - e.g.
Token < loc_t
- It's used to reflect on Python's syntax, in Python -
PyObject*
Oils was a dynamic program. Again, we started with the simplest possible code.
- ASDL started out as a schema language for the AST / "lossless syntax tree"
- translated to untyped Python
- Then it became used for many other data structures
- "the missing half" of Python
- Then it I wrote a compiler to statically typed MyPy
- added
Dict[K, V] type
- added "first class variants" / "case classes"
- And then a compiler to C++
Now we have "pleasant refactorings" with static types.
But Algebraic data types without static typing is still useful! Illegal states still not representable.
(Tangent: Why not OCaml on wiki)
Why dynamic?
Possible Future Work
- Ambitious transformations of the entire interpreter, for performance
- Tagged pointers / Collapsing interpreter in space, in addition to time
Conclusion
A Metaprogrammed Shell
We grew the language and the metalanguage at the same time. It can be thought of as:
- An executable design
- Two complete implementations
The "middle out" style is a bunch of custom and evolved DSLs, for code compression:
- Regular languages with re2c
- Algebraic data types with Zephyr ASDL
- Not covered
- mycpp - Typed Python
- pgen2 - Grammar for YSH
- small code generators, e.g. for flag parsing, documentation
Memes to remember:
- Regular Languages
- If you like algebraic data types, you should like regular languages (conditional logic is data, not code)
- My favorite regex
- Eggex makes regexes composable, with nicer syntax
- Algebraic Data Types
- Oils was a dynamic program before it was a static program!
- Case Classes Better Than Sum Types.
- Union Types - dynamic data before static types / models vs. reality. (Repeating the experiment: write static type definitions for shell ahead of time)
Help Wanted
We need help from people interested in language implementation and design.
Major things "done":
- OSH (still can change based on feedback)
- C++ translation
- Core data model of YSH
- The language is converging
- People are writing YSH without me asking
Still TODO:
- Pure functions! In addition to shell-like procs.
- Hay Ain't YAML - Adding the missing declarative part to shell.
- All distros have a declarative part.
- Module System, similar to Python
- Deployment Tools, Tree Shaking
- Tools - using the "lossless syntax tree"
- Formatter
- Syntax Highlighters (vim, TextMate, TreeSitter, etc.)
- Documentation
- Polish
- Interactive Shell / Completion / Headless Shell
(I really want to make an distributed OS / computer with a language-centric interface.)
Links