Why Sponsor Oils? | blog | oilshell.org

Oils 0.21.0 - Flags, Integers, Starship Bug, and Speed

2024-03-21

This is the latest version of Oils, a Unix shell. It's our upgrade path from bash to a better language and runtime:

Oils version 0.21.0 - Source tarballs and documentation.

If you're new to the project, see the Oils 2023 FAQ and posts tagged #FAQ.


Big news! As of this release, the C++ tarball passes all spec tests. It's 2x to 50x faster than the Python tarball.

I will have a lot to say about this, but the takeaway is that I now recommend using and testing the C++ tarball. See README-native.txt in oils-for-unix-0.21.0.tar.gz.

The release page also emphasizes that the project is called Oils or "Oils for Unix". It's no longer "Oil". I haven't done a good job of renaming everything (including this site), so I understand the confusion!

Table of Contents
Intro
Contributions
Breaking Changes
Flag Parsing API
Built on top of the ctx builtin
Review: Three language features we needed
Integers - Don't do whatever Python or C++ does
Division
64-bit Integers
Starship Prompt Bug - Pipelines crash after using it
Related zed/zsh bug
Three Features Interact
Performance Mystery - Why can't we celebrate?
Using Every Performance Tool
Here Docs
Summary and Roadmap
More Work Under the Hood
Closed Issues
Conclusion
What's Next?
Appendix: Related Articles
Appendix: Metrics for the 0.21.0 Release
Spec Tests
Benchmarks
Code Size

Intro

Shouldn't this announcement be all about how we finished translating Oil to C++? It's a big milestone!

I planned to write something like Oils 0.21.0 - Two Megabytes of Pure, Native Code.

But I'm not "declaring victory" yet. Despite being 2x to 50x faster, the C++ translation is 5-10% slower than bash and dash on a realistic configure workload.

I'll talk about the mystery at the end of this long post, which goes like this:

  1. A new flag parser, built on a new ctx builtin.
  2. Integers
  3. The "Starship Bug"
  4. Performance Mystery

Also:

  1. Compatibility Fixes, for OSH
  2. Breaking Changes, for YSH
  3. New Docs
  4. Pleasant improvements under the hood

Contributions

Before diving into Flags, Integers, Starship Bug, and Speed, let's credit contributors.

Also:

Breaking Changes

I try to call out breaking changes in each release. OSH is stable, but YSH is still under development.

(1) Removed read --line.

This is because the buffered version is incorrect, and the unbuffered version is slow. You can use shell's read -r line for now, which is unbuffered.

Also see

(2) Rename toJ8() fromJ8() to toJson8() fromJson8()

This is more accurate because J8 Notation is a set of data languages, which includes JSON8.


(3) Tilde substitution is respected in combination with brace expansion, like ~{alice,bob}.

This is technically a breaking change, but it improves compatibility:

The "Closed Issues" section at the end of this post links to more fixes, as does the raw change log.


Now let's talk about Flags, Integers, Starship Bug, and Speed.

Flag Parsing API

You should be able to easily parse command line flags in a shell! POSIX shell has a weak getopts builtin, and Python has a nice but verbose argparse API.

Here's what Python looks like:

parser = argparse.ArgumentParser()

parser.add_argument('filename')           # positional argument
parser.add_argument('-c', '--count')      # option that takes a value
parser.add_argument('-v', '--verbose',
                    action='store_true')  # on/off flag

args = parser.parse_args()
print(args.filename, args.count, args.verbose)

We can emulate this "OOP builder" style in YSH, but I find it too clunky:

var parser = ArgumentParser()

call parser->addArgument('filename')  # call keyword for expressions
call parser->addArgument('-c', '--count')
call parser->addArgument('-v', '--verbose'
                         action='store_true')

var args = parser=>parse_args()
echo "$[args.filename] $[args.count] $[args.verbose]"

Here's our new API, which is still in progress:

parser (&spec) {                # describe the CLI syntax
  arg filename                  # positional argument
  flag -c --count (type=Int)    # flag (TODO: builtin type Int)
  flag -v --verbose
}

var args = parseArgs(spec, ARGV)
echo "$[args.filename] $[args.count] $[args.verbose]"

What do you think of this style? It's a bit like Hay Ain't YAML: were reusing the syntax of shell for declarative data. This means you can also mix code in:

parser (&spec) {
  flag --first
  if (debug) {
    flag --other
  }
}

Built on top of the ctx builtin

What I really like about the flag parser:

  1. It's implemented mostly in YSH, except for a low-level ctx builtin.
  2. It was implemented based on a #language-design discussion

Why is this good? First, there's a lot of variation in CLI syntax, so we want you to be able to write your own flag parsers in YSH. We provide something that should handle 90% of cases, but YSH is powerful enough to handle the remaining 10%.

Second, having other people involved makes YSH feels more real! I'm also glad that my Zulip ramblings were actually implementable, and there wasn't a hidden flaw.

Review: Three language features we needed

I've been saying that we're going for the "whole enchilada" with YSH — a language that's powerful enough to write libraries and internal DSLs.

So what did it take to implement declarative flag parsing in YSH?

  1. value.Place - from Oils 0.19.0 - Dicts, Procs, Funcs, and Places
  2. Procs with block args - mentioned in the same announcement
  3. The ctx builtin

Integers - Don't do whatever Python or C++ does

Division

I re-implemented the / and % operators so that they don't simply delegate to / and % in Python and C++. In particular, we don't use operands that are negative numbers, since Python and C++ differ!

Specifically:

(1) Integer division rounds toward zero (not negative infinity)

(2) The % operator disallows a negative divisor, because it's confusing, and arguably not useful.

ysh$ = -5 % 2
(Int)   -1

ysh$ = -5 % -2
  = -5 % -2
       ^
[ interactive ]:3: fatal: Divisor can't be negative

This means that YSH is different than Python, which chooses to round toward negative infinity:

Related story:

We could provide a separate divrem() with Python-like semantics, but I don't think these operations are common in shell. I think divrem() is a better name than Python's divmod(), because:

64-bit Integers

Like the / and % operators, the size of integers also depended on the host language prior to this release:

That was obviously bad, so we now use a mylib::BigInt type for shell operations, which is distinct from C int. It's currently 64 bits, even on 32-bit platforms, but I'd like to make it arbitrary precision.


Nevertheless, Oils already behaves better than other shells. I wrote a little demo here:

In each language, we generate a program to print the literal 11112222333344445555666677778888999, which is larger than 64 bits.

Look at all this exciting behavior!

$ demo/survey-arith.sh bigint

python3
11112222333344445555666677778888999

node.js
1.1112222333344446e+34

Lua
1.1112222333344e+34

Perl
1.11122223333444e+34

awk
11112222333344446376450007752507392

dash
9223372036854775807

bash
8402588706881157415

mksh
-77842137

zsh
zsh:1: number truncated after 20 digits: 11112222333344445555666677778888999
-7334521740365106061

_bin/cxx-dbg/osh
  echo $(( 11112222333344445555666677778888999 ))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ -c flag ]:1: fatal: Invalid integer constant '11112222333344445555666677778888999'

Summary:

Starship Prompt Bug - Pipelines crash after using it

This bug was reported by several people. It was most common after initializing the Starship prompt, but it also happened without it. The symptom was either these job control error messages:

$ echo foo | head
osh: parent failed to set process group for PID 42996 ...
osh: child failed to set process group for PID 42996 ...
foo

Or a crash like this:

$ echo hi|head
oils I/O error (main): No such process

If you don't care about a single hairy bug that's already fixed, you can skip the whole section below, and move on to the performance mystery.

But I'm writing it up because it touches on #shell-runtime topics that I haven't explained on this blog.

Related zed/zsh bug

It's hard to understand the Starship bug without that background, but luckily Thorsten Ball just explained a similar bug on his blog.

This other bug is: when the Zed editor shells out to zsh -i, why does Ctrl-C stop working in Zed? This is surprising because you wouldn't expect a child process to be able to mess up its parent.

Just like the Starship bug, the zed/zsh bug touches on POSIX job control, and the syscalls you need to implement it.

Thorsten dug deep to the root cause: a "missing" job control syscall after execve(). But the question is why shells behave like that.


I wrote three long comments explaining why. I won't repeat them here, but the outline is:

(If the lobste.rs comments aren't interesting, again feel free to skip to the next section on performance.)

Three Features Interact

The Starship bug is different, but it's another interaction between features in the #shell-runtime. It's actually an interaction between three features, one of them is being job control:

  1. OSH pipelines behave like zsh pipelines: perhaps running the last process in the shell itself (shopt -s lastpipe)
  2. Job control
  3. OSH implements the bash-like DEBUG trap that runs before commands, and Starship uses it.

The interaction was that user-provided code in the DEBUG trap could start and wait() for other processes in between the first and last part of the pipeline. This caused a stray getpgid() call to return ESRCH errno, causing the shell to "crash".

There were a few fixes:

There are some things that are not entirely satisfying about this. But if you read the lobste.rs threads above, you'll see why there don't appear to be clean solutions. Shells, and even kernels, are a bit under-specified.

I would like to explore more exhaustive testing/reasoning strategies in the shell runtime, just like we have exhausting reasoning with Zephyr ASDL and re2c in the front end.


Nonetheless, it now works:

starship works

OK, so that was a big hairy bug that took several days to diagnose and fix. Now we have hairy performance problem that also took several days to diagnose! Writing a shell is hard, at least if you want it to be good.

Performance Mystery - Why can't we celebrate?

I was really happy after fixing the last C++ translation bug. What a long journey!

But I was dismayed that there was a lingering performance issue: Running Python's configure may take 11.3 or 11.4 seconds with dash or bash, but 12.9 seconds with OSH.

This is a 12-14% slowdown, which is actually worse than the 5-6% we see in the CI (for unknown reasons).

You could argue: who cares about 1.6 seconds? But there's a mystery here, and I want to get to the bottom of it.


By the way, I haven't explained all the C++ translation bugs yet. I'll summarize these threads when we "declare victory":

Using Every Performance Tool

I sometimes tell people that one reason I started this project was so I could use every code quality tool available, including performance tools.

It's honestly fascinating to learn what you need to speed "naive" Python code up by more than 50x, by translating it to predictable C++.

So this section is going to show the tools I used, since I think that will be interesting to readers. But I won't explain everything, to save space. Original thread:


First, I ran the perf sampling profiler, which is able to profile across processes:

#
# Overhead       Samples  Command     Shared Object            Symbol
# ........  ............  ..........  .......................

     2.44%          3241  ld          libbfd-2.40-system.so    [.] bfd_link_hash_traverse
     1.45%          1908  cc1         [kernel.kallsyms]        [k] 0xffffffffb34bf8f0
     1.42%          1871  ld          libbfd-2.40-system.so    [.] bfd_elf_link_add_symbols
     1.26%          1740  cc1         [kernel.kallsyms]        [k] 0xffffffffb347a135
     1.22%          1805  osh         [kernel.kallsyms]        [k] 0xffffffffb34bf8f0
     1.19%          1614  cc1         cc1                      [.] _cpp_lex_direct
     1.14%          1658  osh         [kernel.kallsyms]        [k] 0xffffffffb34dded0
     1.09%          1484  cc1         cc1                      [.] ht_lookup_with_hash
     1.08%          1415  ld          libc.so.6                [.] __memmove_evex_unaligned_erms
     1.00%          1341  cc1         cc1                      [.] ira_init_register_move_cost
     0.98%          1326  cc1         cc1                      [.] ira_init
     0.97%          1289  ld          libbfd-2.40-system.so    [.] bfd_hash_lookup
     0.96%          1385  osh         oils-for-unix            [.] MarkSweepHeap::MaybeMarkAndPush

(This is benchmarks/perf.sh profile-cpython-configure - source)

The way I read this is:

I zoomed into the OSH process:

    17.25%          1118  osh      [.] MarkSweepHeap::MaybeMarkAndPush
    11.48%           737  osh      [.] MarkSweepHeap::TraceChildren
     8.22%           543  osh      [.] state::Mem::GetExported
     7.08%           511  osh      [.] MarkSweepHeap::Allocate
     5.75%           355  osh      [.] MarkSweepHeap::Sweep
     1.93%           132  osh      [.] posix::execve

We've seen this problem a lot: allocation and garbage collection are expensive!

It's not the bottleneck in this problem, since osh itself is taking so little time. That is, 40% of 5% is only 2%. Less than 2% of the configure time is spent in Oils memory management.


Don't ask me why, but I then got side-tracked fixing allocation issues, despite what's clear in the profile.

To profile allocations, I've found callgrind very useful. Rather than sampling what the CPU is doing like perf, it traces every function call. Then you can load into the kcachegrind GUI and click around:

callgrind one-byte string allocs

(This is benchmarks/autoconf.sh measure-callgrind - source)

We also use other tools fore precise tracing: uftrace for measurement, and the R language for analysis. It's easier to put in the CI than the GUI, and the reports are custom:

                      task      ex.compute-fib  ex.bashcomp-excerpt         parse.abuild
              total_bytes_           2,206,788            4,025,766            4,173,385
               num_allocs_              70,159              117,374              102,837
      sum_typed_strs_slabs              70,159              117,649              102,837
         num_reserve_calls              32,497               64,900               61,229

       percent_list_allocs               33.1%                21.0%                29.4%
       percent_slab_allocs               23.7%                24.0%                22.6%
     percent_string_allocs                6.3%                24.8%                 8.4%
percent_other_typed_allocs               36.9%                30.4%                39.6%

(This is benchmarks/uftrace.sh)

But clicking around in the GUI is nice when you're in the middle of debugging something unknown. (But I wish I could use one tool, instead of two.)


I found 3 hotspots with more than 400K allocations of one-byte strings, out of 2.3M total. I used a mycpp trick to turn them into integers, eliminating all those GC objects. (We're approaching the point where we are not "hilariously unoptimized"!)

This was a nice win on some workloads, but it was not the bottleneck in the configure workload!

If it's not garbage collection, why are we slow? I had a hunch to add a call to the times builtin at the end of configure script.

==> _tmp/elapsed/bash.times.txt <==
0m0.213s 0m0.477s
0m8.233s 0m2.931s

==> _tmp/elapsed/dash.times.txt <==
0m0.217s 0m0.463s
0m8.281s 0m2.922s

==> _tmp/elapsed/osh.times.txt <==
0m0.360s 0m0.720s
0m8.790s 0m2.960s

The times builtin has a terrible format, which goes like this:

[shell process - user space time]   [shell process - system/kernel time]
[child processes - user space time] [child processes - system/kernel time]

(Contribution idea: add times --tsv8 to Oils)

So the configure workload only spends 360 ms in OSH user space, which is everything our interpreter does, including garbage collection.

But we spend 8.79 seconds child processes, versus ~8.2 for bash and dash. This 500-600 ms is closer to the 1.6 second difference in elapsed time.


Why would child processes take longer under OSH? I had some random theories, like dynamic against linking against libcstd++, which shells written in C don't do.

Here Docs

But the problem clarified even more by our trusty tool strace. I wrote benchmarks/autoconf.sh measure-syscalls, which runs strace -ff, spewing thousands of files:

Total processes:

~/git/oilshell/oil/_tmp/strace$ ls bash.txt* | wc -l
6022
~/git/oilshell/oil/_tmp/strace$ ls dash.txt* | wc -l
6063
~/git/oilshell/oil/_tmp/strace$ ls osh.txt* | wc -l
6336

OSH is starting >300 more processes? Huh?

I actually optimized that, as explained in the lobste.rs thread above. In our test/syscall.sh harness, OSH starts fewer processes than bash, dash, zsh, and every other shell except for yash (from Japan).

I further narrowed it down to processes that don't call execve():

~/git/oilshell/oil/_tmp/strace/all$ ls bash*|wc -l
1723
~/git/oilshell/oil/_tmp/strace/all$ ls dash*|wc -l
1760
~/git/oilshell/oil/_tmp/strace/all$ ls osh*|wc -l
2039

So those 300 processes are not starting executables like cc1 or ld? What's going on?


Well, I know exactly when we fork processes, and HereDocWriter is one of them! I wrote about how shells implement here docs many years ago, when we were younger, the air was cleaner, and the sun brighter:

[bash, zsh, and mksh use temp files], which surprised me.

dash does something more expected and elegant, which is to start cat with one end of a pipe() as stdin, rather than a temp file. Strings longer than PIPE_SIZE will cause write() to block, but I think that just requires a little extra care in the implementation.

So OSH follows dash, forking a process for each here doc, rather than writing it to a temp file and reading it back in. The HereDocWriter and the main shell communicate through a pipe().

And the configure script has a bunch of here docs (C code snippets), so I think this must be the root cause of the slowdown. I had thought that temp files would be slower, but now I can see that there's probably forking and context switch overhead.

Why isn't dash slower too? Not sure, but it could relate to the PIPE_SIZE optimization.


Here's a little intuition. Forking zero processes:

$ time for i in {1..1000}; do true; done

real    0m0.003s
user    0m0.003s
sys     0m0.000s

Forking a thousand processes, without execve(). The ( true ) line is a subshell:

$ time for i in {1..1000}; do ( true ); done

real    0m0.702s
user    0m0.465s
sys     0m0.270s

So each fork() takes ~0.7 milliseconds in bash.

Summary and Roadmap

I want to fix this bug soon, and evaluate performance again. But the bottom line is that I'm fairly sure that we'll meet the performance of bash and dash, with a codebase in typed Python. That's pretty cool, because the codebase is small.

And we may even exceed their speed. A promising sign is that OSH makes fewer syscalls than bash or dash on the configure workload, not just the synthetic workloads. This is despite the here doc problem.

But Oils in C++ is already much faster than Oils in Python, so I recommend using it as of this release. For example, Oils in Python takes 27.7 seconds running CPython configure, and we got all the way down to 12.9 by translating the interpreter to C++!

You can see this from the OSH Runtime benchmarks, which I review in the appendix, as always.


Near- and medium-term work:

More Work Under the Hood

We're almost done with this announcment! Let me mention these things, which will come up in the future:

These files used to be crappy shell scripts, and now they're pretty decent. They find bugs, and they'll help improve our error catalog.

$ wc -l test/*-errors.sh
  868 test/parse-errors.sh
 1131 test/runtime-errors.sh
 1406 test/ysh-parse-errors.sh
  899 test/ysh-runtime-errors.sh
 4304 total

Their output is published on the release quality page.

Closed Issues

A subset of the work we did:

#1854 Brace expansion combined with tilde expansion should behave like bash and zsh, not mksh
#1846 `try eval` is inconsistent
#1826 ysh-0.20.0$ echo b'byte \xff' #=> Traceback
#1825 single backslash \ at end of line could be disallowed within $'...' (and maybe others)
#1822 Simple pipe crashes after initialising starship
#1785 try { } should be the only form for simplicity
#1784 'const c' without null on RHS can be re-declared
#1762 Disallow proc / func keywords in OSH to avoid misleading syntax errors
#1342 / and % operators behave like Python and not bash

Conclusion

This post had something for everybody!

  1. Language and API design - Flag parsing
  2. Language semantics - Integers
  3. Two hairy bugs, which give a peek into the #shell-runtime
  4. Demos of performance tools

What's Next?

Done! Let me know what you think in the comments!

Appendix: Related Articles

These great conversations inspired the pretty printing design doc:

I also think we need to explain UTF-8, since comments over many months show great confusion:

Appendix: Metrics for the 0.21.0 Release

Let's compare this release with the previous one, version 0.20.0.

Spec Tests

A little progress on OSH:

We finished the C++ translation, and blew past the mark! :-)

The difference is that the C++ version correctly allows NUL bytes in strings, while the Python version inherits its behavior from CPython.


New YSH features are reflected here:

Again, we blew past the mark with the C++ translation:

The difference is again inheriting integer behavior from CPython. We may smooth over this difference at some point.

Benchmarks

I did some frontend refactoring, which may have caused a slight slowdown:

However, I already optimized the parser some more after this release.

The Fibonacci benchmark is stable:

After resolving the big mystery above, we are going to close this gap:

Code Size

Oils is still a small program in terms of source code:

And generated C++:

And compiled binary size: