A Problem with Type Checking

2016-11-30

For the oil prototype, I used Python 3 rather than Python 2 to take advantage of both type annotations and type checking with mypy. The mypy type system seems fairly similar to the C++ type system, so these annotations should help when porting the parser, whether by hand or automatically.

Type annotations would also describe the all-important AST more clearly. Though I lean on the side of dynamic languages, I admit that prior to type annotations, Python offered you essentially no help in the task of structuring and describing an AST. It looks particularly bad if you compare it to ML-stye languages.

Unfortunately, before mypy provided any value, I hit a snag:

$ mypy osh/lex.py
osh/lex.py:51: error: "Id" has no attribute "Lit_EscapedChar"
osh/lex.py:52: error: "Id" has no attribute "Ignored_LineCont"
osh/lex.py:60: error: "Id" has no attribute "VSub_Name"
... more than 100 errors ...

This is because the core/id_kind.py file, which I wrote about yesterday as the backbone of the AST, uses metaprogramming. Specifically, it's a simple form of dynamic code generation: construct a variable name as a string, and then call setattr(Id, id_name, id_value) and setattr(Kind, kind_name, kind_value) to attach constant values to an "enum".

I use this technique not only for consistency and readability, but also:

to annotate boolean [[ operator tokens with operand types: e.g. -eq takes integers; -z takes a string; -f takes a path
to annotate IDs with raw characters, e.g. Ignored_LineCont goes with '\\\n'.

I like that Python lets me do these things dynamically, using Python as its own meta-language. Unfortunately, mypy restricts Python and removes this benefit.

A Possible Solution

It appears that I can silence these errors by generating mypy stubs. But I would prefer a solution with a better user experience, i.e. one that doesn't require modifying a working program purely to satisfy the type checker.

I found an interesting mechanism in Bob Nystrom's blog post Type Checking a Dynamic Language. In it, he writes:

Magpie is primarily a dynamic language, so it follows in Python’s footsteps. There is no special definition/statement dichotomy, and classes are created, extended, and modified imperatively at runtime.

Then he describes the evaluation model like this:

Evaluate the scripts dynamically
Type-check.

Once the interpreter has finished evaluating the scripts, it looks in the global scope to see if you’ve defined a function called main(). Doing so is the trigger that says, "I want to type-check."

Run main()

In other words, code run at the top level is dynamic metaprogramming, and then code called from main() is statically checked.

I like this approach a lot, and as far as I know, it's novel. I can also see other use cases for it. A few years ago, I wrote a toy protobuf library in Python that uses metaprogramming instead of textual code generation, and it would benefit from exactly this model.

I speculate that it would also be useful for typed data analysis in the style of the R language, where the names of fields and their types may come from data, and thus can't be known at "compile-time".

Lessons for Oil

Shell is neither statically- nor dynamically-typed because it's untyped. To a first approximation, everything is a string.

Although oil borrows its semantics from shell, it will have at least these types:

Str -- a string
Array -- because arrays of strings are fundamental to the problem domain
Int -- because bash uses arithmetic expressions in nine places
Bool -- because the boolean expression language [[ is part of the shell language

And because it has the philosophy of catching errors early (static parsing), optional type checking is a natural fit.

In summary, this post showed a downside of static checking: state of the art type checkers can't understand an elementary metaprogramming pattern without help. I hope to use the idea from Magpie in the design of the oil language itself.