Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

Notes on Unicode in Shell

Table of Contents
Philosophy
A Mental Model
Program Encoding
Data Encoding
List of Unicode-Aware Operations in Shell
Tips
Implementation Notes

Philosophy

Oil's is UTF-8 centric, unlike bash and other shells.

That is, its Unicode support is like Go, Rust, Julia, and Swift, as opposed to JavaScript, and Python (despite its Python heritage). The former languages use UTF-8, and the latter have the notion of "multibyte characters".

A Mental Model

Program Encoding

Shell programs should be encoded in UTF-8 (or its ASCII subset). Unicode characters can be encoded directly in the source:

echo 'μ'

or denoted in ASCII with C-escaped strings:

echo $'[\u03bc]'

(Such strings are preferred over echo -e because they're statically parsed.)

Data Encoding

Strings in OSH are arbitrary sequences of bytes, which may be valid UTF-8. Details:

List of Unicode-Aware Operations in Shell

Where bash respects it:

Local-aware operations:

Other:

Tips

Implementation Notes

Unlike bash and CPython, Oil doesn't call setlocale(). (Although GNU readline may call it.)

It's expected that your locale will respect UTF-8. This is true on most distros. If not, then some string operations will support UTF-8 and some won't.

For example:

TODO: Oil should support LANG=C for some operations, but not LANG=X for other X.


Generated on Tue Jun 21 17:20:28 EDT 2022