Why Sponsor Oils? | blog | oilshell.org

Sketches of YSH Features

2023-06-11 (Last updated 2023-11-19)

This is the second of three posts about YSH.

  1. Reviewing YSH explained the seven parts of the language, and reviewed its history.

  2. Sketches of YSH Features — This post shows concrete examples, consolidating a few months of brainstorming on our #language-design Zulip stream.

    It's long and detailed, because there's a lot to cover!

  3. Oils Is Exterior-First — our #software-architecture ideas resurface and help us with design problems.


I focus on new designs that need to be implemented:

I also enumerate fourteen use cases for Ruby-like blocks. This leads me to conclude that YSH must be extensible and reflective like Python and Ruby, not "hard-coded" like shell and awk.

Table of Contents
Preliminaries
YSH Upgrades Each Part of Shell With Typed Data
New case statement
String Notation: Code and Data
8 Kinds of String Literal vs. 4
Let's Deconstruct and Augment JSON
Procs and Funcs
Julia-like Signatures "Fix" Python
Proc Arguments Have 4 Styles, But 1 Meaning
-> is a Pun: "Bind Self" or "Threading" Operator
Where Ruby-like Blocks Can Be Useful in Shell
Handle errors: try
Save and restore state: fopen cd shopt shvar
Execute code later: trap describe awk make find xargs
Declare data: Hay ain't YAML, argparse
Structured HTML templating like Markaby
Summary and Conclusions
TODO List
"Shelling Out" Is Still Idiomatic
Appendix
Review of Awk and Make After 6 Years
2023-11 Update: 2 More Use Cases, Giving 16

Preliminaries

For this post, the most salient principle is that syntax and semantics should correspond: things that are the same should look the same, and things that are different should look different.

This sounds obvious, but as most languages grow, they inevitably break the rule (related links in the appendix).

YSH Upgrades Each Part of Shell With Typed Data

Let's review some material in A Tour of YSH first.

We've consistently followed the rule that typed data appears within parentheses (). Luckily, unquoted ( is a syntax error in most parts of shell, which creates a "hole" for us to upgrade the language without breaking it!

YSH has commands with typed arguments:

$ var d = {name: 'bob', age: 30}
$ json write (d)  # d is an expression/name, not a string
{
   "name": "bob",
   "age": 30,
}

If statements with rich conditions:

if (len(mydict) > 0) {  # an expression on typed data
  echo 'non-empty'
}

While loops:

var x = 5
while (x > 0) {  # ditto
  echo $x
  setvar x -= 1
}

And enhanced for loops:

for name in *.py {  # no parens
  echo $name
} 
for i, item in (mylist[1:]) {  # index, value; Python-like slice
  echo "$i $item"
}

New case statement

Aidan just implemented the parser for a new case statement. It lets you write this traditional shell:

case $mystr in  # unbalanced ) is bad for syntax highlighters
  *.cc | *.h) echo 'C++' ;;
  *.py)       echo 'Python' ;;
  *)          echo 'other' ;;
esac

in a nicer way:

case (mystr) {  # parens start the new case statement
  *.cc | *.h { echo 'C++' }
  *.py       { echo 'Python' }
  *          { echo 'other' }
}

It also upgrades shell with typed data, which again use ():

var x = 42  # an integer, not a string
case (x) {  
  (41)   { echo "doesn't match" }
  (42)   { echo 'matches' }
  (else) { echo 'something else' }
}

And egg expressions, which produces a syntax reminiscent of awk;

var time = '3:14'  # a string
case (time) {
  /   d ':' d d / { echo  'M:SS' }
  / d d ':' d d / { echo 'MM:SS' }
  (else)          { echo 'neither' }
}

The case statement isn't fully implemented, because it depends on divorcing the YSH evaluator from CPython.

String Notation: Code and Data

A couple years ago, I designed Quoted String Notation. It "unified" bash's C-style $'\n' literals and Rust string literals into a data language.. As that page says, you can use it to:

After using it, I noticed at least 2 problems:

So we have a new design called J8 Notation, based on two simple tweaks of JSON strings. We add:

J8 strings have the j prefix when these extension are used, which is similar to the r'raw string' prefix.

j"nul = \yff  mu = \u{3bc}"

An upcoming breaking change is write --qsnwrite --j8, and so forth.

We will also add J8 strings to the YSH language itself, deprecating $'\n'. This aids code generation, and makes the language easier to remember.

8 Kinds of String Literal vs. 4

I sketched the solution, so now let's elaborate on the problem. Bash scripts often contain JSON, which means scripts can have up to eight kinds of string literal.

  1. 'single quoted'
  2. "double quoted with $var $(cmd) interpolation"
  3. $'C-escaped with \n'

There are two kinds of here docs:

  1. With interpolation <<EOF, i.e. double quoted evaluation
  2. Without interpolation <<'EOF', i.e. single quoted evaluation.

(I didn't know about this difference before implementing a shell — the 'EOF' rule is confusing.)

Two shell builtins have dynamically parsed string syntaxes:

  1. echo -e '\n'
  2. printf 'x = %s' "$x"

And

  1. JSON embedded in shell scripts

In YSH, we'll have just four syntaxes:

  1. r'single quoted' -- r for "raw" is optional when \ isn't in the string
  2. "double quoted with $var $(cmd) $[expr]"
  3. j"JSON-like with \n \yff and \u{123456}"
  4. Multi-line strings strip leading whitespace, are consistent with single-line literals, and subsume here docs.
cat <<< r'''
hi
'''

cat """
$(echo hi)
"""

# also in expressions, unlike here docs!
var z = j"""
  \yff \u{123456}
  """

Further,

To summarize, we have 3 kinds of string literal, and one rule for multi-line strings. No here docs.


Technically, we could augment J8 Notation so that it can express any kind of string:

echo j"newline \n \$(echo hi) \$[42 + a[i]]"

But I still expect idiomatic YSH to use normal single- and double-quoted strings.

Let's Deconstruct and Augment JSON

Here's some more detail on our proposed data language design. I explained that we'll "extract" JSON strings and augment them with \yff and \u{123456}, giving "J8 strings".

From there, we can derive 2 new formats:

I successfully use regular, untyped TSV in shell, and I expect it to be idiomatic in YSH. So TSV8 can use a "gutter column" to distinguish it from TSV:

!tsv8   name    age
!type   Str     Int
        alice   42
        bob     35

That is, data columns are always indented by 1 tab.


Together, JSON8, TSV8, and J8 strings are called J8 Notation. In practice, it should be easy to create a J8 Notation library from a JSON library:

  1. Add an option to the string parser to recognize the j prefix and \yff \u{123456} extensions.
  2. JSON8 comes for free.
  3. Build a TSV8 parser by splitting on tabs and newlines. Decode cells that start with " or j".

The first step may lead to a tricky strings vs. bytes decision in certain languages, but that's fundamental for correctness.


I mentioned the slogan Tables, Records, and Documents in A Sketch of the Biggest Idea in Software Architecture. What happened to documents?

I don't think we need a custom syntax for them. We should add string escaping to YSH:

# hm, slightly annoying \"
echo "<a href=\"${url|html}\"> click here </a>"

# trailer says to apply HTML escaping to every substitution
echo "<a href=\"$url\"> $anchor </a>"html

Or we can generate HTML from an internal DSL, which I'll show in the section on blocks.

Procs and Funcs

Now let's review new designs for shell-like procs and Python-like functions. Recall that I was on the fence about this profusion of code units. But we have a few simplifications, including:

  1. Making proc and func signatures largely the same.
  2. A new error builtin, which reduces the confusion of returning an error vs. returning a value.
  3. Making funcs pure. They can't call procs.

Julia-like Signatures "Fix" Python

Procs had a specialized and quirky signature syntax, but I now believe both proc and func signatures should look like Julia.

Julia's signatures and argument binding are as powerful as Python's, but without the historical warts. For example, Python 3 introduced both keyword-only params with *, and positional-only params with /.

The solution is to separate named and positional params with a semicolon. This makes named/positional and required/optional into orthogonal dimensions. YSH can use the same syntax:

func f(pos1, pos2=2, ...args ; named1=4, ..kwargs) {
  return (pos1 + pos2 + sum(args) + named1 + sum(kwargs->values()))
}
var s1 = f(1)                             # =>  7 is 1 + 2 + 4
var s2 = f(1, 3, named1=10)               # => 14 is 1 + 3 + 10
var s3 = f(1, 2, 3, 4, named1=10, foo=5)  # => 25

Most signatures will be much simpler, but we'll retain all the power of Python and Julia.

Proc signatures are the same, except they can declare a block param after another semicolon:

proc my-cd (dest ; ; block) {  # no keyword args
  cd $dest (block)  # call the builtin
}

my-cd /tmp {  # trailing block literal arg
  echo $PWD
}

Proc Arguments Have 4 Styles, But 1 Meaning

There are 4 ways to pass arguments to procs and YSH builtins:

my-cp src /tmp                # string args denoted with "words"

json write ({x: 42})          # typed argument

error (status=2, msg="oops")  # named/keyword arg, also typed

cd /tmp {                     # block argument
  echo $PWD
}

But these invocations can all be written in their "desugared" form, as typed arguments:

my-cp ('src', '/tmp')      # strings are quoted in expressions
json ('write', {x: 42})    # also valid
error (2; msg="oops")      # ; separates positional and keyword args
cd ('/tmp', ^(echo $PWD))  # block expression ^() looks like $()

So the syntax is rich, but the semantics are simple. Arguments are bound left to right, with splats like ... picking up the rest, except for the block argument. It's always the last argument, so it's neither positional or named.


In contrast, funcs have only two kinds of arguments — positional and named typed arguments:

var parts = split('spam/eggs/ham', sep='/')

-> is a Pun: "Bind Self" or "Threading" Operator

I like how this turned out. We use dot (.) to access Dict members, and call methods with arrow (->):

var s = mydict.key
if (s->startswith('prefix')) {
  echo 'yes
}

You can also think of -> as the "threading" operator, like Clojure or Elixir:

var s = 'mystring' -> upper() -> replace('X', '_')

Under the hood, -> works like Python:

var m = s->startswith  # bound method, not called
var mybool = m('foo')  # call bound method

It's also similar to Lua, where a colon (:) is roughly the "bind self" operator. JavaScript seems to be more "hard-coded" and non-orthogonal.

Where Ruby-like Blocks Can Be Useful in Shell

This section is long, with fourteen use cases.

But it's still in "sketch" form. After we implement proc argument binding, I should turn it into a blog post. Then contributors can test YSH by writing these DSL-like APIs in YSH itself.

Handle errors: try

Our error handling primitive is a builtin, not a keyword:

try {  # the block is an argument
  run-test-1
  run-test-2
} 
if (_status !== 0) {
  report-failure
}

I've been thinking of adding a "catch" form, which is shorter in the case where you have a a simple command:

try run-test-1 {
  report-failure
}

Save and restore state: fopen cd shopt shvar

Shell uses blocks for redirects, which save and restore the file descriptor table:

{ echo 1
  echo 2 
} > out

YSH has syntactic sugar, putting the filename first:

fopen > out {
  echo 1
  echo 2
}

cd saves and restores the current directory:

cd /tmp {
  echo $PWD
}

shopt saves and restores global options:

shopt --unset errexit {
  false
}

shvar saves and restores variables:

shvar PATH=. {
  my-command
}

Design notes:

Execute code later: trap describe awk make find xargs

The following constructs all save unevaluated code for executing "later".

trap registers event handlers:

trap INT {
  echo "SIGINT $myglobal"
}  

This isn't implemented, and we can use help. We also want to add zsh-like precmd and postexec hooks.

describe could register test blocks:

proc my-cp (src, dest) { cp --verbose $src $dest }

describe my-cp {
  my-cp src dest  # should we turn off error handling here?
  assert ($? === 0) 
}

Note that assert's argument should be lazily evaluated so it can print a good error message.

An awk-like DSL can filter streams with blocks:

BEGIN {
  var x = 50
}
# condition and block are typed args to when
when (weight > 10) {
  setvar x += weight
}
END {
  echo "x = $x"
}

A make-like DSL can specify blocks to execute when files are out of date:

rule {
  outputs = ['grammar.cc', 'grammar.h']
  inputs = ['grammar.y']
  command {
    yacc -C $[_inputs[0]]
  }
}

A find-like DSL could specify blocks to execute when the "cursor" visits a file system location:

fs ~/src ~/git {
  name .git && prune ||  # optimize file I/O with short-circuit
  name '*.py' && { echo "$_name $_mtime" }
}

I'm not sure if we can "pun" && || like this, since in shell they don't have the usual precedence. It might be better to use the expression style:

fs ~/src ~/git (
  name('.git') and prune() or
  name('*.py') and ^( echo "$_name $_mtime" )
)

Related: find and test: How To Read And Write Them

each could be an xargs-like builtin:

fs ~/src ~/git { name *.pyc } | each {
  rm --verbose @items
}

I mentioned this in An Opinionated Guide to xargs.

Declare data: Hay ain't YAML, argparse

You can use YSH syntax to declare data, and build it up with code. See Hay - Custom Languages for Unix Systems.

A key point is that YSH is Lisp-influenced — you can interleave code and data, and you have control over evaluation.


Parsing args could follow the same pattern:

Args :myspec {  # myspec is data created with code
  flag -v --verbose "Show verbose output"
  flag -R --recursive "Copy recursively"
  arg src
  arg dest      
}

var arg, pos = argparse(myspec, ARGV)
cp $[arg.src] $[arg.dest]

We should also auto generate --help, and support auto-completion. We will need help from users.

Structured HTML templating like Markaby

Thanks to technomancy for pointing out Ruby's Markaby templating in a debate about string templating on lobste.rs.

I think that string templating with escaping will be common in YSH. But, since YSH can interleave code and data like Lisp, it can also express more "structured" solutions.

table id=$myid {  # start HTML-like data
  thead {
    tr {
      td class='x' { 'Name' }
      td           { 'Age' }
    }
  } 
  for p in (people) {  # interleave arbitrary code with data
    td { $[p.name] }
    td { $[p.age] }
  }
}

In contrast, structured solutions in Python are awkward because there isn't a clear way to call back into arbitrary code.

So this is basically what Markaby does, but I think it's a bit cleaner! (Or at least this fictional, non-running code looks clean.)

Summary and Conclusions

That was fourteen different use cases for blocks!

So clearly we can't do all this ourselves. YSH has to be a language extensible by users. It needs a smaller "core".

TODO List

Building on the Oils 2023 Roadmap, here's a rough list of things to do:

Funcs, procs, and blocks:

APIs using blocks:

Data languages:


Looking back on this list, it's pretty concrete (although I probably underestimated the translation work). If these features were the only things left, we could implement them in short order. Using typed Python and ASDL is pleasant and productive.

But I still need to write "a month of docs", overhaul the help builtin, produce some kind of demo for the headless shell, and more.

It also feels like the C++ runtime may be put on the back burner, which is unfortunate. I want to look at hash tables and string interning, since that performance issue recently came up with ble.sh.

"Shelling Out" Is Still Idiomatic

I want to conclude this post by reminding readers that composition with OS processes is still the main idiom in YSH, and gives it much of its power.

For example, I like using Ninja with shell. You don't need to write everything in YSH with funcs and blocks.

This is a main point of #software-architecture posts and the narrow waist idea. Every language supports the Unix process interface — whether they like it or not!

The next post will return to these ideas. They help us design YSH and the systems we build with it.

Appendix

Review of Awk and Make After 6 Years

I mentioned awk and make above, so here are comments about my experiments with them.

To summarize:

2023-11 Update: 2 More Use Cases, Giving 16

We're deep in the middle of YSH language design, so here are two more use cases.

Docker-like layers. We use more than 10 Docker images in our Soil continuous build (and Red Hat's podman runtime, for diversity). It would be nice to have more control over layer sharing, which would result in faster image transfer, reduced disk usage, and possibly faster incremental builds. Copying Docker images is expensive for us.

Image oilshell/soil-app-tests {

  FROM oilshell/soil-common

  COPY deps/from-apt.sh /home/uke/tmp/deps/from-apt.sh

  Layer apt {
    mount var-cache-apt {
      type = 'cache'
      target = '/var/cache/apt'
      sharing = 'locked'
    }

    mount var-lib-apt {
      type = 'cache'
      target = '/var/lib/apt'
      sharing = 'locked'
    }

    proc run {
      deps/from-apt.sh layer-locales
      deps/from-apt.sh app-tests
    }
  }

  USER uke

  # ... more layers

}

(adapted from our own deps/Dockerfile.app-tests).

This isn't just a cosmetic change -- we can also use procs to factor code and share layers.

Exterior Schemas vs. Interior Types.

Schemas, type definitions, and IDLs (interface description languages) are common in distributed systems. Our Ruby-like block syntax is flexible enough to support them:

Type Person {
  field name (Str, id=1)
  field age (Int, id=2)
}

Evolution of exterior schemas is a huge design issue. It probably makes sense to have multiple type systems — to be "meta" with respect to types.

YSH is intended to "meet you where you are", e.g. with shell and JSON, rather than forcing new technology and constraints onto your systems.