Why Sponsor Oils? | blog | oilshell.org
I haven't published a post in several weeks, but the Oil project continues!
This post started as a free-form project update, but it ended up focusing on recent work to break free of the CPython interpreter.
Let's review the project status before getting into those details.
After the FAQ, these two posts summarize the current state of OSH:
Success with Aboriginal, Alpine, and Debian Linux (January 2018). OSH can run real, hairy shell scripts. If I hadn't achieved this milestone in 21 months, I might have abandoned the project! Oil is a big project, and I'm feeling that right now.
Running Bash Completion Scripts with OSH (October 2018). This is the latest "user-facing" milestone. I discovered that completion scripts are written in a different shell dialect than the difficult Linux scripts above are. OSH can run parts of completion scripts, but there's still more work to do.
Why is this post #7? I wrote Roadmap #5 last year, but I mentioned a new roadmap in August.
So I skipped #6. And I don't think there will be more roadmap posts: they got us to the January milestone, but the project now has more design choices that can't be planned out ahead of time.
I've only published 4 posts since the summer. These drafts are unpublished:
The rest of the lexer series. Unfortunately I'm abandoning this for now. I was eager to publish it because several language implementers told me that they changed their lexers based on the first two posts. And Jacquin at RC read the draft and got something out of it.
But I have to cut something, and this isn't essential to completing the shell.
Book Reviews. I mentioned this idea while at Recurse Center. These drafts are newer than the lexer series, so I hope to publish them. I want to introduce some higher-level ideas in the blog.
In my mind, shell isn't just program you type commands into. It's also about distributed software architecture. I addressed that in this lobste.rs comment on "systems" programming.
Shell Implementation Difficulties. An outline of the hardest problems I've solved. These include:
echo -e '\n'
vs. $'\n'
.Waiter()
: this is a nice abstraction that
solves a tricky problem.A Status Update which addressed:
I've followed through on releasing more often. They're all
pre
releases because I want to make version 0.6.0
a release that's
meaningful to users.
$PS1
expansion. PS1
is a special variable
that specifies the interactive prompt. See notes below.cd -L
and cd -P
. The difference
has to do with symlinks.${!myarray[@]}
to get the keys of an array.
(This means something completely different than ${!mystring}
, which has
also been a problem lately.)Describing the semantics of $PS1
would be good episode of
#shell-the-bad-parts. (Or really "bash-the-bad-parts", because the
issue is bash-specific.)
There are two separate steps to the expansion:
\
codes. For example, \h
becomes your hostname. This
is described in the bash
manual.$
as \$
in the value of any variables. This is
necessary because the resulting string is parsed as code and
re-evaluated, so that variables ${debian_chroot}
can be expanded.Instead, bash should work like this:
${hostname}
should evaluate to the host name in the context of $PS1
.
There are no special \
codes.Then there would be a single phase of expansion, and a single syntax for substitution.
Most other changes in these three releases have been under the hood.
Oil ships with a subset of the CPython interpreter, and I went to great lengths to cut down its size. In effect, I was auditing which parts of CPython we use.
Building Oil with the OPy Bytecode Compiler gives background on this strategy, and motivates it.
Let's explain the slimming-down of Oil through metrics. As usual, there's slightly more code in the core, to implement new features:
But we depend on far less of the Python standard library:
The runpy.py
module is an unnecessary CPython implementation
detail, and I replaced it with some C code. This broke many dependencies on
the standard library, so we're shipping ~23K lines of Python total
(including the 19K lines above, not in addition to them).
This is a great result! OSH can run real shell scripts, but it's still a small program.
Removing Python source also lets us remove native C code that implements certain Python features.
I reduced the lines of native dependencies by ~9K lines:
This still feels too big, but it's doesn't tell the whole story. Due to build changes in this release, many of those ~130K lines are no longer used.
The larger drop in compiled code size dropped shows this:
Here's another view with Bloaty McBloatFace, a nice tool that accounts for code size in ELF files:
The reduction has two components:
.format()
method, and thousands of lines of
code that implement floating point to string conversions.(I also enabled the compiler flags for removing unused code, e.g.
--gc-sections
. I'm not sure why this isn't done by default in GCC and
Clang.)
I wrote a recursive-descent parser parser for
PyMethodDef
declarations, which look like this:
static PyMethodDef marshal_methods[] = {
{"dump", marshal_dump, METH_VARARGS, dump_doc},
{"load", marshal_load, METH_O, load_doc},
...
};
There's at least one of these structures for every object like
stringobject.c
and every module like posixmodule.c
.
After parsing these definitions, I re-printed them in build/oil-defs/ with two filters:
Some metrics about the filtered methods:
OSH 0.6.pre5
used about 431 methods, and this process stripped it down to
128 methods. Most methods that we use are in posixmodule.c
, e.g.
posix.chdir()
, which makes sense for a shell!
Since Oil still passes all its tests, and our test coverage is very high, I'm sure that all this code was unused. And I'm now more convinced that it's possible to write a small Python interpreter for Oil, which I'm calling OVM2. More on that later.
The file ceval.c
in CPython is the core of the bytecode
interpreter. It dispatches about 120 bytecodes like:
LOAD_FAST
to load local variablesBINARY_ADD
to do arithmetic on numbers, or concatenate stringsI reduced Oil's usage of certain bytecodes, so that we're using a simpler subset of Python. This was done in two ways:
import *
, implemented with the IMPORT_STAR
bytecode,
but Oil doesn't.LOAD_CLOSURE
and MAKE_CLOSURE
). I generally
prefer explicit state with classes over closures.Another example was rewriting exponentation the exponentiation operator in shell with multiplication:
$ echo $(( 2 ** 30)) # 1 GB 1073741824
It used to rely on Python's **
operator (the BINARY_POWER
bytecode).
However, Python's implementation accepts arbitrary floats, while shell has only integers. Exponentiation on floating point numbers depends on thousands of lines of C code, so we can remove all of that.
Metrics:
I've further reduced this number below 80 on the master
branch.
(Related reading: Floating Point to Decimal Conversion is Easy. Python's implementation doesn't look easy! Again, I removed all of this code.)
Not only did I reduce the native code size and number of unique bytecodes, I also reduced the total bytecode size:
This reflects the smaller amount of Python source described above, as well as removing Python docstrings.
Some Python docstrings are specified in C, and some are specified in Python. I
removed the latter ones by adding an -emit-docstring=0
flag to the OPy
bytecode compiler.
(NOTE: The "bytecode size" column in the ovm-build metrics should be
"architecture-independent files", as there are data files and .py
source
files in the app bundle.)
There are many tables of metrics in the links above. I should mention that they're manipulated and analyzed with the R language, and in particular the dplyr library.
dplyr provides general data manipulation "verbs" on data frames and is part of the Tidyverse by Hadley Wickham.
In this release, I expanded my usage to include stringr, a related library for string manipulation. Surprisingly, it has some string manipulation concepts that may be useful in shell and the Oil language. Hadley's libraries are an inspiration in composable API design.
I will write more about this later.
This post was a "catch-up" on several topics, but here's a concise takeaway:
Oil is now more like a C program, and less like a Python program.
I removed many parts of CPython that we're not using — in the intepreter itself, as well as in the standard library.
This gives me confidence that Oil can be moved to a VM that we control, which I'm calling OVM2.
There is still a lot to talk about, but I'll leave it for Dev Log #8:
pystrcmp.c
, pymath.c
.socket.gethostname()
with our own copy. It's needed for \h
in
$PS1
.xrange()
everywhere so we can remove range()
. This is probably
overly aggressive, but I really want to simplify OVM and eventually
reimplement it.os
module in favor of the lower-level posix
module. This
removes a compatibility layer in Python that's unnecessary for a Unix shell.
os.chdir()
are now posix.chdir()
. The posix
module is
implemented in posixmodule.c
, and os
is a thin wrapper around it on
Unix.dtoa.c
and pystrod.c
, which convert floats to strings and back.
%.3f
from Python to C.
Python's %
is more general than C's printf()
, but we don't need all of
its power in Oil.**
exponentation operator from Python
(mentioned above).Already mentioned above:
'foo = {}'.format(foo)
.runpy
module and its many dependencies.os
module in favor of the underlying posix
module.PyMethodDef
in dozens of objects and modules, like listobject.c
,
import.c
, and posixmodule.c
.