source | all docs for version 0.8.1 | all versions | oilshell.org

Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

Notes on Unicode in Shell

Table of Contents

List of Unicode-Aware Operations in Shell

Philosophy

Oil's unicode support is unlike that of other shells because it's UTF-8-centric.

In other words, it's like newer languages like Go, Rust, Julia, and Swift, as opposed , JavaScript, and Python (despite its Python heritage). The latter languages use the notion of "multibyte characters".

In particular, Oil doesn't have global variables like LANG for libc or a notion of "default encoding". In my experience, these types of globals cause correctness problems.

List of Unicode-Aware Operations in Shell

${#s} -- length in code points
${s:1:2} -- offsets in code points
${x#?} and family (not yet implemented)

Where bash respects it:

[[ a < b ]] and [ a '<' b ] for sorting
${foo,} and ${foo^} for lowercase / uppercase

This is a list of operations that SHOULD be aware of Unicode characters. OSH doesn't implement all of them yet, e.g. the globbing stuff.

Length operator counts code points: ${#s}
- TODO: provide an option to count bytes.
String slicing counts code points: ${s:0:1}
Any operation that uses glob, because it has ? for a single character, character classes like [[:alpha:]], etc.
- case $x in ?) echo 'one char' ;; esac
- [[ $x == ? ]]
- ${s#?} (remove one character)
- ${s/?/x} (note: this uses our glob to ERE translator for position)
printf '%d' \'c where c is an arbitrary character. This is an obscure syntax for ord(), i.e. getting an integer from an encoded character.

List of operations that depend on the locale (not implemented):

String ordering: [[ $a < $b ]] -- should use current locale? TODO: compare with sort command.
Lowercase and uppercase operators: ${s^} and ${s,}
Prompt string has time, which is locale-specific.
In bash, printf also has time.

Other:

The prompt width is calculated with wcswidth(), which doesn't just count code points. It calculates the display width of characters, which is different in general.

Generated on Mon Sep 28 00:26:23 PDT 2020