Reviewing YSH

2023-06-08

I just released Oils 0.16.0, and started writing an announcement called Breaking Renames and YSH.

It's full of minimal breaking changes to prepare for big features in the future. Details like Changed mydict->key to mydict.key may confuse readers, so let's talk about the big picture first.

I started writing a "backlog" post to explain YSH, but it turned into three:

Reviewing YSH — this post reminds us where we are.
Sketches of YSH Features — concrete syntax demos.
Oils Is Exterior-First — our #software-architecture ideas resurface and help us with design problems.

Remember that YSH was called the Oil language until a few months ago, and I've blogged about it for years under the #oil-language tag.

Table of Contents

Shell Should Be More Like Python, JavaScript, and Ruby

Diagram

This is Hard

Sketch: Seven Features of YSH

Design Issue: Too Many Units of Code?

Four Years of Blog Posts

It Stalled in 2022

Three Arguments for Structured Data in Shell

Implementation Status

What's Next?

Shell Should Be More Like Python, JavaScript, and Ruby

This may be our slogan for YSH.

In contrast, Four Features That Justify a New Unix Shell (2020) described why the more conservative OSH is necessary:

It handles errors reliably.

It lets you safely process user-supplied data, like filenames.

It relieves you of quoting hell.

It's statically parsed, which enables better error messages and tools.

YSH naturally extends OSH, and to repeat the slogan:

Shell should be more like Python, JavaScript, and Ruby.

The whole project is called Oils. It consists of OSH, YSH, and JSON-like data languages. It's our upgrade path from bash to a better language and runtime.

Diagram

To explain this, I would draw a diagram with two sets of languages:

String-ish	With Structured Data and GC
Bourne shell, Batch files, Make, CMake, M4/autoconf, ...	Python, JavaScript, Ruby, Perl, Lua, PHP, R, Erlang, ...

Oils is moving shell from the left to the right, with OSH → YSH. (If you're good at drawing diagrams, let me know in the comments!)

If you look at the Alternative Shells wiki page, there's a pretty strong consensus on this design. There are multiple shells influenced by more powerful dynamic languages.

But Oils is the only project that's doing it as an extension of POSIX- and bash- compatible shell.

Crucially, we're not breaking the "string-ish" property of shell. This is because Strings and Bytes are Essential Narrow Waists. I even made diagrams to emphasize this point.

One way to think about it is that serialized structured data is primary, and in-memory representations are secondary, so we don't lose interoperability and composition.

Your shell scripts will remain short, and new shell users will continue to be amazed by its "whipupitude". Shell is universal glue.

The narrow waist post in this series will elaborate on this.

This is Hard

I also want to say that I underestimated how hard creating a good dynamic language is. The problem is that once you add structured data, you have to add a whole bunch of other stuff, like garbage collection, functions with splats, named args, optional args, destructuring assignment, and more. Otherwise the language doesn't "close" algebraically.

String-ish languages are a surprisingly stable point in the language design space — that's probably why you see so many of them (e.g. CMake).

From another perspective: Python and Ruby are much bigger languages than shell, and they each required at least a decade of hard engineering to produce something useful. JavaScript also grew at least 3x in the 2010's to become more useful.

Nevertheless, I think we can make something nice, with help from many contributors and the "middle out" implementation style. The codebase is more open to contribution, and translating YSH to C++ will remove a major blocking "smell".

Sketch: Seven Features of YSH

What does Python-like structured data imply? I described 7 features of YSH in Backlog: Rough Progress Assessments (December 2021), and it's still a good way to describe the language.

Python-like expressions on typed data: 42 + a[i] + mydict.key.

It subsumes 2 weak and malformed languages in bash: arithmetic $((1 + 2)) and boolean [[ x ~= $regex ]].
Eggex: readable, composable regular expressions.
Ruby-like code blocks, including Hay: Custom Languages for Unix Systems.
Procs are enhanced shell "functions", which take argv and return an integer.
Functions operate on typed data, and use the func keyword.

As of this month, we have a design for functions and methods! #language-design > Pure Functions: List of Changes
More shell builtins: argparse, describe.

These is blocked on fully implementing procs and funcs. I think we need to do the harder thing of making the language extensible like Python and Ruby, rather than "hard-coded" like shell.

Related posts in the blog backlog: Growing a Language and Why did Python become the Language of Machine Learning? (2021)
Data Languages.

I realized late last year that the shell- and Rust-compatible Quoted String Notation was a "version 1" to be discarded. We now have a more JSON-compatible design called "J8 notation".

"Packle" also makes sense: binary serialization of object graphs, not just trees.

So all these features relate to structured data, and they're a lot of work!

The next post will "tease" all these features with examples. It won't be complete, because I want to save energy for actually finishing the implementation, and writing YSH reference docs.

Design Issue: Too Many Units of Code?

The profusion of code units — blocks, proc, and func — occupied me for some time, and still need to be fully implemented. I mentioned this issue in the narrow waist posts. Notes:

Shell already has blocks like { echo 1; echo 2; } > out, and we extend them to be unevaluated arguments to procs: cd /tmp { echo $PWD }.
Procs add named parameters to shell functions, with default values. They remove dynamic scope.
- Procs now also take typed arguments, like functions. A block is a typed value (value_t), so it's no longer a special case.
- Procs should lazily evaluate their arguments to support expressions on streaming data like Awk, and tables like R.
If we add typed data, we need functions to provide "closure".
- A key breakthrough was using a new error builtin (keyword?) to return 1 for error. Then we extend return with a typed arg like return (a[i + 1]). Remember that () means expression over typed data.
- Key use cases for functions: escaping HTML, argparse, manipulating lists of files. These are firmly within the domain of shell.

It's not clear if functions should always eagerly evaluate their arguments, or if we should provide control for both procs and funcs.

I believe functions should be pure, and that procs can call funcs, but not the other way around. This gives scripts a "functional-core / imperative-shell" structure, which should simplify the usage of procs and funcs.

Normally, I'm skeptical of language designers trying to "force" users to write code in a certain way. But I think this is an exception that will reduce cacophony.

Also, we already introduced a notion of restricted evaluation with Hay. The evaluation of declarative data is also pure in some sense.

Four Years of Blog Posts

Those 7 points describe YSH currently, but reviewing blog posts tagged #oil-language may help contributors understand the project. Here are a few that stand out:

Oil Language Design Notes #1 (August 2019)

I started a concrete prototype, building on ideas I'd sketched and blogged about for years. Originally OSH and YSH were more separate, but I realized that the shopt mechanism was enough to provide a smooth upgrade path.
You Can Now Try the Oil Language (October 2019) - I released the first version!

At this point, I had prototyped YSH with a "cheap" metacircular evaluator, reusing CPython itself. I thought we could cut down the huge amount of work this way.

But I quickly realized that it wasn't technically feasible for "production". One issue is the PyObject* → value_t problem, which Melvin is dealing with now.

Implementing it from scratch would take years, so I mentally removed structured data from the project!
Ambitions for Unix Shell > The Biggest Cut (January 2020) - So the biggest change is that Oil will be based on strings (and arrays of strings) like shell, rather than having structured data types like Python

This cut allowed me to focus on just OSH:
Four Features That Justify a New Unix Shell (October 2020).

The idea was to just "finish" OSH, and call it a project. We would finish translating it to C++, and finish the garbage collector. Ironically, I got stuck on that problem, and went back to working on Oil/YSH, mostly because it was fun.
More Changes to Oil's Syntax (November 2020) - Lots of fundamental work with
- read --line, write --qsn, ...
- shopt --set { code block }, ...
- ### doc comments
- Removing pass based on user feedback
Your feedback has been essential this whole time! The progress has been intermittent because the project has many big parts, but YSH would be dead without feedback like this. The 0.16.0 release reflects recent feedback, which you'll see shortly.
Recent Progress on the Oil Language (June 2021)

The headless shell makes its first appearance. This isn't technically part of YSH, but it's another thing that "justifies" a brand new shell implementation.

We added lightweight modules. I analyze years- and decades-old complaints about shell, and show how YSH fixes them.
Oil Has Multi-line Commands and String Literals (September 2021) - Features and simplifications that make YSH too good not to "be real"!

It Stalled in 2022

We spent most of 2022 working on the C++ translation and garbage collector. As a result, there was little progress on YSH in 2022.

We finally got the GC working in December. This felt really good, and I got unanimous feedback that readers and contributors still want YSH.

They don't just want the modest, cleaned-up shell OSH. They want structured data. They want shell to be more like Python, JavaScript, and Ruby.

Three Arguments for Structured Data in Shell

Structured data now feels inevitable, because shell has dozens of basic features that can use it. I don't see any design for a string-ish language that would be powerful enough, or familiar enough. (Tcl is probably the best design in this category.)

This reminds me that I sketched a blog post in January, with 3 arguments:

The kernel accepts argv arrays, which are structured data, not just strings. This is what Thirteen Incorrect Ways and Two Awkward Ways to Use Arrays is about. (People still remember this post, years later!)
Make and Ninja are forms of shell, because they use the process interface. They're both centered around build graphs. A graph is not only structured data, but it also requires a garbage collector!

This came from a conversation of with Zack Weinberg regarding a real build system problem he had. If you don't have structured data, then you're left poorly manipulating graphs with strings, as autoconf does.
Annotations on code are structured data. I made this argument at the HotOS Unix Shell Panel in 2021. PaSH, POSH, and their successors use YAML for such annotations.

There's a pretty easy way to associate (distributable) procs and declarative metadata with Hay. See issue #1179. We can still use help!

So again, structured data feels inevitable. We already started implementing it, and that leads to "the whole enchilada", with rich functions and methods.

Implementation Status

I addressed this a bit in the Oils 2023 FAQ. To be honest, the project still feels pretty big.

But we're knocking down problems at a good rate, and the language feels too good not to implement. Contributors are excited about it, and they're moving the project forward. There would be no YSH without contributors like Melvin, Aidan, and Chris.

Our translation process is more solid, with mycpp and ASDL being coordinated by a nice Ninja-based build system. The garbage collector and C++ runtime are not fast, but they can be, and they're definitely enough to support YSH.

I'm pleased that we can run all of Oils with only ~7K lines of native code. I really wouldn't want to maintain 200K lines or even 40K lines of C++. That would pretty much kill the project.

I don't think we have enough help on the C++ runtime, but it's OK if optimization comes somewhat after YSH. If people use it, they will want to make it faster.

I mentioned what's left on the Oils 2023 Roadmap. Looking back on it, many things are done, and many aren't done :-) As usual. The good news is that the 0.16.0 release did almost all the breaking changes to prepare for an ambitious version of YSH.

I'll repeat that issue 636: Oil expression evaluator shouldn't be "metacircular" is a big deal. The YSH evaluator needs to be statically typed, and divorced from CPython. It will grow by several thousand lines.

Melvin has been doing great work on this, and I want to jump in and help. But I still need to write about a month's worth of documentation, which is also important. After writing the docs, we need to implement the new designs for significant parts of YSH:

I don't know how long this will take, but again I think they're too good not to implement. And we have immediate use cases for YSH in the Oils project itself — namely, our extensive #Soil automation, which I've mentioned a few times. At some point I will publish this draft:

#blog-ideas > Programmers Adopt Platforms, Not Languages. Based on this comment and others.

What's Next?

The second post in this series will show concrete examples of existing features and new designs. We've gotten great feedback recently, and are open to more.

The upcoming 0.16.0 release notes will credit many people for their great feedback.

That is, you can still influence the language, either by contributing, or by trying the shell and posting a message on #oil-discuss-public or #oil-discuss on Zulip. I know it's hard with the state of the docs, but the upcoming "month of docs" should help a bit.

Feel free to ask me questions in the comments!