QSN: A Familiar String Interchange Format

QSN ("quoted string notation") is an interchange format for byte strings. Examples:

''                           # empty string
'my favorite song.mp3'
'bob\t1.0\ncarol\t2.0\n'     # tabs and newlines
'BEL = \x07'                 # byte escape
'mu = \u{03bc}'              # char escapes are encoded in UTF-8
'mu = μ'                     # represented literally, not escaped

It's meant to be emitted and parsed by programs written in different languages, as UTF-8 or JSON are. It's both human- and machine-readable.

Oil understands QSN, and other Unix shells should too.

Table of Contents
Why?
More Use Cases
Specification
A Short Description
An Analogy
Full Spec
Advantages over JSON Strings
Implementation Issues
Compatibility With Shell Strings
How does a QSN Encoder Deal with Unicode?
Design Notes
Example: set -x format (xtrace)

Why?

QSN has many uses, but one is that Oil needs a way to safely and readably display filenames in a terminal.

Filenames may contain arbitrary bytes, including ones that will change your terminal color, and more. Most command line programs need something like QSN, or they'll have subtle bugs.

For example, as of 2016, coreutils quotes funny filenames to avoid the same problem. However, they didn't specify the format so it can be parsed. In contrast, QSN can be parsed and printed like JSON.

More Use Cases

Specification

A Short Description

  1. Start with Rust String Literal Syntax
  2. Use single quotes instead of double quotes to surround the string. This is mainly to to avoid confusion with JSON.

An Analogy


     JavaScript Object Literals   are to    JSON
as   Rust String Literals         are to    QSN

But QSN is not tied to either Rust or shell, just like JSON isn't tied to JavaScript.

It's a language-independent format like UTF-8 or HTML. We're only borrowing a design, so that it's well-specified and familiar.

Full Spec

TODO: The short description above should be sufficient, but we might want to write it out.

Advantages over JSON Strings

Implementation Issues

Compatibility With Shell Strings

In bash, C-escaped strings are displayed $'like this\n'. A subset of QSN is compatible with this syntax. Examples:

$'\x01\n'  # removing $ makes it valid QSN

$'\0065'   # octal escape is invalid in QSN

How does a QSN Encoder Deal with Unicode?

The input to a QSN encoder is a raw byte string. However, the string may have additional structure, like being UTF-8 encoded.

The encoder has three options to deal with this structure:

  1. Don't decode UTF-8. Walk through bytes one-by-one, showing unprintable ones with escapes like \xce\xbc. Never emit escapes like \u{3bc} or literals like μ. This option is OK for machines, but isn't friendly to humans who can read Unicode characters.

Or speculatively decode UTF-8. After decoding a valid UTF-8 sequence, there are two options:

  1. Show escaped code points, like \u{3bc}. The encoded string is limited to the ASCII subset, which is useful in some contexts.

  2. Show them literally, like μ.

QSN encoding should never fail; it should only fall back to byte escapes like \xff. TODO: Show the state machine for detecting and decoding UTF-8.

Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.

Design Notes

The general idea: Rust string literals are like C and JavaScript string literals, without cruft like octal (\755 or \0755 — which is it?) and vertical tabs (\v).

Comparison with shell strings:

Comparison with Python's repr():

In-band signaling is the fundamental problem with filenames and terminals.

Example: set -x format (xtrace)

When arguments don't have any spaces, there's no ambiguity:

$ set -x
$ echo two args
+ echo two args

Here we need quotes to show that the argv array has 3 elements:

$ set -x
$ x='a b'
$ echo "$x" c
+ echo 'a b' c

And we want the trace to fit on a single line, so we print a QSN string with \n:

$ set -x
$ x=$'a\nb'
$ echo "$x" c
+ echo $'a\nb' c

Here's an example with unprintable characters:

$ set -x
$ x=$'\e\001'
$ echo "$x"
+ echo $'\x1b\x01'

Generated on Fri Aug 21 12:40:56 PDT 2020