Sketches of YSH Features

2023-06-11 (Last updated 2023-11-19)

This is the second of three posts about YSH.

Reviewing YSH explained the seven parts of the language, and reviewed its history.
Sketches of YSH Features — This post shows concrete examples, consolidating a few months of brainstorming on our #language-design Zulip stream.

It's long and detailed, because there's a lot to cover!
Oils Is Exterior-First — our #software-architecture ideas resurface and help us with design problems.

I focus on new designs that need to be implemented:

JSON-based data languages, starting with a revised string notation
Pure functions, and a revision of shell-like procs

I also enumerate fourteen use cases for Ruby-like blocks. This leads me to conclude that YSH must be extensible and reflective like Python and Ruby, not "hard-coded" like shell and awk.

Preliminaries

A Tour of YSH is an overview of the language. This post addresses some of the TODOs.
The new designs should be consistent with our Language Design Principles (wiki).

For this post, the most salient principle is that syntax and semantics should correspond: things that are the same should look the same, and things that are different should look different.

This sounds obvious, but as most languages grow, they inevitably break the rule (related links in the appendix).

YSH Upgrades Each Part of Shell With Typed Data

Let's review some material in A Tour of YSH first.

We've consistently followed the rule that typed data appears within parentheses (). Luckily, unquoted ( is a syntax error in most parts of shell, which creates a "hole" for us to upgrade the language without breaking it!

YSH has commands with typed arguments:

$ var d = {name: 'bob', age: 30}
$ json write (d)  # d is an expression/name, not a string
{
   "name": "bob",
   "age": 30,
}

If statements with rich conditions:

if (len(mydict) > 0) {  # an expression on typed data
  echo 'non-empty'
}

While loops:

var x = 5
while (x > 0) {  # ditto
  echo $x
  setvar x -= 1
}

And enhanced for loops:

for name in *.py {  # no parens
  echo $name
} 
for i, item in (mylist[1:]) {  # index, value; Python-like slice
  echo "$i $item"
}

New `case` statement

Aidan just implemented the parser for a new case statement. It lets you write this traditional shell:

case $mystr in  # unbalanced ) is bad for syntax highlighters
  *.cc | *.h) echo 'C++' ;;
  *.py)       echo 'Python' ;;
  *)          echo 'other' ;;
esac

in a nicer way:

case (mystr) {  # parens start the new case statement
  *.cc | *.h { echo 'C++' }
  *.py       { echo 'Python' }
  *          { echo 'other' }
}

It also upgrades shell with typed data, which again use ():

var x = 42  # an integer, not a string
case (x) {  
  (41)   { echo "doesn't match" }
  (42)   { echo 'matches' }
  (else) { echo 'something else' }
}

And egg expressions, which produces a syntax reminiscent of awk;

var time = '3:14'  # a string
case (time) {
  /   d ':' d d / { echo  'M:SS' }
  / d d ':' d d / { echo 'MM:SS' }
  (else)          { echo 'neither' }
}

The case statement isn't fully implemented, because it depends on divorcing the YSH evaluator from CPython.

String Notation: Code and Data

A couple years ago, I designed Quoted String Notation. It "unified" bash's C-style $'\n' literals and Rust string literals into a data language.. As that page says, you can use it to:

Safely transfer arbitrary strings between programs
Print arbitrary filenames to the terminal
Delimit records over an unstructured channel (the framing problem)

After using it, I noticed at least 2 problems:

It's more important to be compatible with JSON, as opposed to bash or Rust code.
It inherited an inconsistent $ sigil from bash, which compromised our "legacy-free" claim.

So we have a new design called J8 Notation, based on two simple tweaks of JSON strings. We add:

Arbitrary bytes with \y00 \yff
- We can't use the existing \xff because it's a synonym for \u00ff.
- This addition makes encode() and decode() total functions with respect data from the Unix kernel. There won't be spurious, data-dependent errors. Your programs will be more reliable.
More general unicode escapes with \u{123456}, moving away from the UTF-16 surrogate pair legacy

J8 strings have the j prefix when these extension are used, which is similar to the r'raw string' prefix.

j"nul = \yff  mu = \u{3bc}"

An upcoming breaking change is write --qsn → write --j8, and so forth.

We will also add J8 strings to the YSH language itself, deprecating $'\n'. This aids code generation, and makes the language easier to remember.

8 Kinds of String Literal vs. 4

I sketched the solution, so now let's elaborate on the problem. Bash scripts often contain JSON, which means scripts can have up to eight kinds of string literal.

'single quoted'
"double quoted with $var $(cmd) interpolation"
$'C-escaped with \n'

There are two kinds of here docs:

With interpolation <<EOF, i.e. double quoted evaluation
Without interpolation <<'EOF', i.e. single quoted evaluation.

(I didn't know about this difference before implementing a shell — the 'EOF' rule is confusing.)

Two shell builtins have dynamically parsed string syntaxes:

echo -e '\n'
printf 'x = %s' "$x"

And

JSON embedded in shell scripts

In YSH, we'll have just four syntaxes:

r'single quoted' -- r for "raw" is optional when \ isn't in the string
"double quoted with $var $(cmd) $[expr]"
j"JSON-like with \n \yff and \u{123456}"
- Again, J8 strings can be used as either code or data.
Multi-line strings strip leading whitespace, are consistent with single-line literals, and subsume here docs.

cat <<< r'''
hi
'''

cat """
$(echo hi)
"""

# also in expressions, unlike here docs!
var z = j"""
  \yff \u{123456}
  """

Further,

echo -e '\n' is deprecated by echo j"\n".
printf 'x = %.2f "$x" is deprecated by "x = ${x %.2f}". We still need to implement this.

To summarize, we have 3 kinds of string literal, and one rule for multi-line strings. No here docs.

Zulip reference: #oil-discuss-public>Unifying String Notation: JSON, CSV, HTML, C, Shell

Technically, we could augment J8 Notation so that it can express any kind of string:

echo j"newline \n \$(echo hi) \$[42 + a[i]]"

But I still expect idiomatic YSH to use normal single- and double-quoted strings.

Let's Deconstruct and Augment JSON

Here's some more detail on our proposed data language design. I explained that we'll "extract" JSON strings and augment them with \yff and \u{123456}, giving "J8 strings".

From there, we can derive 2 new formats:

JSON8 - JSON with J8 strings. It allows arbitrary byte strings in JSON, which may or may not be UTF-8 encoded. The way this works should be clear:

{"key": j"value \yff"}

Don't confuse JSON8 with JSON5. I don't think we'll adopt tempting extensions like trailing commas and unquoted keys, because we'll mainly use JSON8 for data exchange. Hay is for human-edited configuration.

You're of course allowed to extend JSON8 with something like "JSON-5-8", but it's not required.
TSV8 aka Quoted, Typed Tables. This is a simple extension of TSV, where a cell contain any value, including strings with literal tabs, newlines, and binary data. Columns can be annotated with types.

I successfully use regular, untyped TSV in shell, and I expect it to be idiomatic in YSH. So TSV8 can use a "gutter column" to distinguish it from TSV:

!tsv8   name    age
!type   Str     Int
        alice   42
        bob     35

That is, data columns are always indented by 1 tab.

Together, JSON8, TSV8, and J8 strings are called J8 Notation. In practice, it should be easy to create a J8 Notation library from a JSON library:

Add an option to the string parser to recognize the j prefix and \yff \u{123456} extensions.
JSON8 comes for free.
Build a TSV8 parser by splitting on tabs and newlines. Decode cells that start with " or j".

The first step may lead to a tricky strings vs. bytes decision in certain languages, but that's fundamental for correctness.

I mentioned the slogan Tables, Records, and Documents in A Sketch of the Biggest Idea in Software Architecture. What happened to documents?

I don't think we need a custom syntax for them. We should add string escaping to YSH:

# hm, slightly annoying \"
echo "<a href=\"${url|html}\"> click here </a>"

# trailer says to apply HTML escaping to every substitution
echo "<a href=\"$url\"> $anchor </a>"html

Or we can generate HTML from an internal DSL, which I'll show in the section on blocks.

Procs and Funcs

Now let's review new designs for shell-like procs and Python-like functions. Recall that I was on the fence about this profusion of code units. But we have a few simplifications, including:

Making proc and func signatures largely the same.
A new error builtin, which reduces the confusion of returning an error vs. returning a value.
Making funcs pure. They can't call procs.
- Funcs are for programming within a single process.
- Procs may be transparently "remoted" into another process. See Pipelines Support Vectorized, Point-Free, and Imperative Style and #shell-the-good-parts.

Julia-like Signatures "Fix" Python

Procs had a specialized and quirky signature syntax, but I now believe both proc and func signatures should look like Julia.

Julia's signatures and argument binding are as powerful as Python's, but without the historical warts. For example, Python 3 introduced both keyword-only params with *, and positional-only params with /.

The solution is to separate named and positional params with a semicolon. This makes named/positional and required/optional into orthogonal dimensions. YSH can use the same syntax:

func f(pos1, pos2=2, ...args ; named1=4, ..kwargs) {
  return (pos1 + pos2 + sum(args) + named1 + sum(kwargs->values()))
}
var s1 = f(1)                             # =>  7 is 1 + 2 + 4
var s2 = f(1, 3, named1=10)               # => 14 is 1 + 3 + 10
var s3 = f(1, 2, 3, 4, named1=10, foo=5)  # => 25

Most signatures will be much simpler, but we'll retain all the power of Python and Julia.

Proc signatures are the same, except they can declare a block param after another semicolon:

proc my-cd (dest ; ; block) {  # no keyword args
  cd $dest (block)  # call the builtin
}

my-cd /tmp {  # trailing block literal arg
  echo $PWD
}

Proc Arguments Have 4 Styles, But 1 Meaning

There are 4 ways to pass arguments to procs and YSH builtins:

my-cp src /tmp                # string args denoted with "words"

json write ({x: 42})          # typed argument

error (status=2, msg="oops")  # named/keyword arg, also typed

cd /tmp {                     # block argument
  echo $PWD
}

But these invocations can all be written in their "desugared" form, as typed arguments:

my-cp ('src', '/tmp')      # strings are quoted in expressions
json ('write', {x: 42})    # also valid
error (2; msg="oops")      # ; separates positional and keyword args
cd ('/tmp', ^(echo $PWD))  # block expression ^() looks like $()

So the syntax is rich, but the semantics are simple. Arguments are bound left to right, with splats like ... picking up the rest, except for the block argument. It's always the last argument, so it's neither positional or named.

In contrast, funcs have only two kinds of arguments — positional and named typed arguments:

var parts = split('spam/eggs/ham', sep='/')

`->` is a Pun: "Bind Self" or "Threading" Operator

I like how this turned out. We use dot (.) to access Dict members, and call methods with arrow (->):

var s = mydict.key
if (s->startswith('prefix')) {
  echo 'yes
}

You can also think of -> as the "threading" operator, like Clojure or Elixir:

var s = 'mystring' -> upper() -> replace('X', '_')

Under the hood, -> works like Python:

var m = s->startswith  # bound method, not called
var mybool = m('foo')  # call bound method

It's also similar to Lua, where a colon (:) is roughly the "bind self" operator. JavaScript seems to be more "hard-coded" and non-orthogonal.

Where Ruby-like Blocks Can Be Useful in Shell

This section is long, with fourteen use cases.

But it's still in "sketch" form. After we implement proc argument binding, I should turn it into a blog post. Then contributors can test YSH by writing these DSL-like APIs in YSH itself.

Handle errors: `try`

Our error handling primitive is a builtin, not a keyword:

try {  # the block is an argument
  run-test-1
  run-test-2
} 
if (_status !== 0) {
  report-failure
}

I've been thinking of adding a "catch" form, which is shorter in the case where you have a a simple command:

try run-test-1 {
  report-failure
}

Save and restore state: `fopen cd shopt shvar`

Shell uses blocks for redirects, which save and restore the file descriptor table:

{ echo 1
  echo 2 
} > out

YSH has syntactic sugar, putting the filename first:

fopen > out {
  echo 1
  echo 2
}

cd saves and restores the current directory:

cd /tmp {
  echo $PWD
}

shopt saves and restores global options:

shopt --unset errexit {
  false
}

shvar saves and restores variables:

shvar PATH=. {
  my-command
}

Design notes:

Python saves and restores state with context managers rather than Ruby-like blocks.
Ruby also uses blocks for iterators, but iteration in shell and YSH is more Python-like.

Execute code later: `trap describe awk make find xargs`

The following constructs all save unevaluated code for executing "later".

trap registers event handlers:

trap INT {
  echo "SIGINT $myglobal"
}

This isn't implemented, and we can use help. We also want to add zsh-like precmd and postexec hooks.

describe could register test blocks:

proc my-cp (src, dest) { cp --verbose $src $dest }

describe my-cp {
  my-cp src dest  # should we turn off error handling here?
  assert ($? === 0) 
}

Note that assert's argument should be lazily evaluated so it can print a good error message.

An awk-like DSL can filter streams with blocks:

BEGIN {
  var x = 50
}
# condition and block are typed args to when
when (weight > 10) {
  setvar x += weight
}
END {
  echo "x = $x"
}

A make-like DSL can specify blocks to execute when files are out of date:

rule {
  outputs = ['grammar.cc', 'grammar.h']
  inputs = ['grammar.y']
  command {
    yacc -C $[_inputs[0]]
  }
}

A find-like DSL could specify blocks to execute when the "cursor" visits a file system location:

fs ~/src ~/git {
  name .git && prune ||  # optimize file I/O with short-circuit
  name '*.py' && { echo "$_name $_mtime" }
}

I'm not sure if we can "pun" && || like this, since in shell they don't have the usual precedence. It might be better to use the expression style:

fs ~/src ~/git (
  name('.git') and prune() or
  name('*.py') and ^( echo "$_name $_mtime" )
)

each could be an xargs-like builtin:

fs ~/src ~/git { name *.pyc } | each {
  rm --verbose @items
}

I mentioned this in An Opinionated Guide to xargs.

Declare data: Hay ain't YAML, `argparse`

You can use YSH syntax to declare data, and build it up with code. See Hay - Custom Languages for Unix Systems.

A key point is that YSH is Lisp-influenced — you can interleave code and data, and you have control over evaluation.

Parsing args could follow the same pattern:

Args :myspec {  # myspec is data created with code
  flag -v --verbose "Show verbose output"
  flag -R --recursive "Copy recursively"
  arg src
  arg dest      
}

var arg, pos = argparse(myspec, ARGV)
cp $[arg.src] $[arg.dest]

We should also auto generate --help, and support auto-completion. We will need help from users.

Structured HTML templating like Markaby

Thanks to technomancy for pointing out Ruby's Markaby templating in a debate about string templating on lobste.rs.

I think that string templating with escaping will be common in YSH. But, since YSH can interleave code and data like Lisp, it can also express more "structured" solutions.

table id=$myid {  # start HTML-like data
  thead {
    tr {
      td class='x' { 'Name' }
      td           { 'Age' }
    }
  } 
  for p in (people) {  # interleave arbitrary code with data
    td { $[p.name] }
    td { $[p.age] }
  }
}

In contrast, structured solutions in Python are awkward because there isn't a clear way to call back into arbitrary code.

So this is basically what Markaby does, but I think it's a bit cleaner! (Or at least this fictional, non-running code looks clean.)

Summary and Conclusions

That was fourteen different use cases for blocks!

try
fopen cd shopt shvar
trap describe awk make find xargs
Hay, argparse
Markaby-like HTML templating

So clearly we can't do all this ourselves. YSH has to be a language extensible by users. It needs a smaller "core".

TODO List

Building on the Oils 2023 Roadmap, here's a rough list of things to do:

Make YSH statically typed and translatable to C++
- Remove if mylib.PYTHON in many places

Funcs, procs, and blocks:

Properly parse the new proc signature syntax. (I made minimal breaking changes for Oils 0.16.0.)
Restore func keyword
- Implement consistent proc and func param binding
- error builtin
- return (myexpr)
- Later: if users can define methods like s->startswith, func signatures might need Go-like "receivers".
Reflective API:
- eval (myblock)
- Some kind of dynamic name binding for DSLs using blocks. Aidenn0 reminded me that shvar is a form of dynamic binding.
Restrict func evaluation to be pure, like Hay
Hay shouldn't evaluate directly to JSON.

APIs using blocks:

Done:
- fopen cd shopt shvar
- try is done, but could use the enhancement I mentioned.
TODO ourselves:
- argparse and describe because we need them
- We should figure out a way to distribute them as text, not embed YSH code in the binary.
Left to users:
- awk make find args, Markaby-style HTML

Data languages:

Implement J8 strings and a new JSON parser, giving JSON8.
- We can reuse most of our QSN implementation.
- JSON surrogate pairs are the most annoying part: \u1234\u1234.
- TSV8 can come later.

Looking back on this list, it's pretty concrete (although I probably underestimated the translation work). If these features were the only things left, we could implement them in short order. Using typed Python and ASDL is pleasant and productive.

But I still need to write "a month of docs", overhaul the help builtin, produce some kind of demo for the headless shell, and more.

It also feels like the C++ runtime may be put on the back burner, which is unfortunate. I want to look at hash tables and string interning, since that performance issue recently came up with ble.sh.

"Shelling Out" Is Still Idiomatic

I want to conclude this post by reminding readers that composition with OS processes is still the main idiom in YSH, and gives it much of its power.

For example, I like using Ninja with shell. You don't need to write everything in YSH with funcs and blocks.

This is a main point of #software-architecture posts and the narrow waist idea. Every language supports the Unix process interface — whether they like it or not!

The next post will return to these ideas. They help us design YSH and the systems we build with it.

Appendix

Review of Awk and Make After 6 Years

I mentioned awk and make above, so here are comments about my experiments with them.

To summarize:

Awk → Python
- I only use awk for one-liners now, and even that's rare. I would like to use YSH more.
Make → Ninja
- Writing Makefiles was a waste of time; I should have started with Ninja. I think it makes sense for YSH to generate Ninja.

2023-11 Update: 2 More Use Cases, Giving 16

We're deep in the middle of YSH language design, so here are two more use cases.

Docker-like layers. We use more than 10 Docker images in our Soil continuous build (and Red Hat's podman runtime, for diversity). It would be nice to have more control over layer sharing, which would result in faster image transfer, reduced disk usage, and possibly faster incremental builds. Copying Docker images is expensive for us.

Image oilshell/soil-app-tests {

  FROM oilshell/soil-common

  COPY deps/from-apt.sh /home/uke/tmp/deps/from-apt.sh

  Layer apt {
    mount var-cache-apt {
      type = 'cache'
      target = '/var/cache/apt'
      sharing = 'locked'
    }

    mount var-lib-apt {
      type = 'cache'
      target = '/var/lib/apt'
      sharing = 'locked'
    }

    proc run {
      deps/from-apt.sh layer-locales
      deps/from-apt.sh app-tests
    }
  }

  USER uke

  # ... more layers

}

(adapted from our own deps/Dockerfile.app-tests).

This isn't just a cosmetic change -- we can also use procs to factor code and share layers.

Exterior Schemas vs. Interior Types.

Schemas, type definitions, and IDLs (interface description languages) are common in distributed systems. Our Ruby-like block syntax is flexible enough to support them:

Type Person {
  field name (Str, id=1)
  field age (Int, id=2)
}

Evolution of exterior schemas is a huge design issue. It probably makes sense to have multiple type systems — to be "meta" with respect to types.

YSH is intended to "meet you where you are", e.g. with shell and JSON, rather than forcing new technology and constraints onto your systems.