QSN: A Familiar String Interchange Format

QSN ("quoted string notation") is an interchange format for byte strings. Examples:

''                           # empty string
'my favorite song.mp3'
'bob\t1.0\ncarol\t2.0\n'     # tabs and newlines
'BEL = \x07'                 # byte escape
'mu = \u{03bc}'              # char escapes are encoded in UTF-8
'mu = μ'                     # represented literally, not escaped

It's meant to be emitted and parsed by programs written in different languages, as UTF-8 or JSON are. It's both human- and machine-readable.

Oil understands QSN, and other Unix shells should too.

Table of Contents
Why?
More Use Cases
Specification
A Short Description
An Analogy
Full Spec
Advantages Over JSON Strings
Implementation Issues
How Does a QSN Encoder Deal with Unicode?
Which Bytes Should Be Hex-Escaped?
Reference Implementation in Oil
Appendices
Design Notes
Related Links
set -x example

Why?

QSN has many uses, but one is that Oil needs a way to safely and readably display filenames in a terminal.

Filenames may contain arbitrary bytes, including ones that will change your terminal color, and more. Most command line programs need something like QSN, or they'll have subtle bugs.

For example, as of 2016, coreutils quotes funny filenames to avoid the same problem. However, they didn't specify the format so it can be parsed. In contrast, QSN can be parsed and printed like JSON.

More Use Cases

Specification

A Short Description

  1. Start with Rust String Literal Syntax
  2. Use single quotes instead of double quotes to surround the string. This is mainly to to avoid confusion with JSON.

An Analogy


     JavaScript Object Literals   are to    JSON
as   Rust String Literals         are to    QSN

But QSN is not tied to either Rust or shell, just like JSON isn't tied to JavaScript.

It's a language-independent format like UTF-8 or HTML. We're only borrowing a design, so that it's well-specified and familiar.

Full Spec

TODO: The short description above should be sufficient, but we might want to write it out.

Advantages Over JSON Strings

Implementation Issues

How Does a QSN Encoder Deal with Unicode?

The input to a QSN encoder is a raw byte string. However, the string may have additional structure, like being UTF-8 encoded.

The encoder has three options to deal with this structure:

  1. Don't decode UTF-8. Walk through bytes one-by-one, showing unprintable ones with escapes like \xce\xbc. Never emit escapes like \u{3bc} or literals like μ. This option is OK for machines, but isn't friendly to humans who can read Unicode characters.

Or speculatively decode UTF-8. After decoding a valid UTF-8 sequence, there are two options:

  1. Show escaped code points, like \u{3bc}. The encoded string is limited to the ASCII subset, which is useful in some contexts.

  2. Show them literally, like μ.

QSN encoding should never fail; it should only fall back to byte escapes like \xff. TODO: Show the state machine for detecting and decoding UTF-8.

Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.

Which Bytes Should Be Hex-Escaped?

The reference implementation has two functions:

In theory, only escapes like \' \n \\ are strictly necessary, and no bytes need to be hex-escaped. But that strategy would defeat the purpose of QSN for many applications, like printing filenames in a terminal.

Reference Implementation in Oil

You can see Oil's implementation in qsn_/qsn.py. It has an encoder with the various UTF-8 strategies. As of August 2020, the decoder is incomplete.

Note that we also have options to emit shell-compatible strings, which you probably don't need.

That is, C-escaped strings in bash look $'like this\n'. A subset of QSN is compatible with this syntax. Example:

$'\x01\n'  # A valid bash string.  Removing $ makes it valid QSN.

Something like $'\0065' is never emitted, because QSN doesn't contain octal escapes. It can be encoded with hex or character escapes.

Appendices

Design Notes

The general idea: Rust string literals are like C and JavaScript string literals, without cruft like octal (\755 or \0755 — which is it?) and vertical tabs (\v).

Comparison with shell strings:

Comparison with Python's repr():

Related Links

In-band signaling is the fundamental problem with filenames and terminals. Code (control codes) and data are intermingled.

set -x example

When arguments don't have any spaces, there's no ambiguity:

$ set -x
$ echo two args
+ echo two args

Here we need quotes to show that the argv array has 3 elements:

$ set -x
$ x='a b'
$ echo "$x" c
+ echo 'a b' c

And we want the trace to fit on a single line, so we print a QSN string with \n:

$ set -x
$ x=$'a\nb'
$ echo "$x" c
+ echo $'a\nb' c

Here's an example with unprintable characters:

$ set -x
$ x=$'\e\001'
$ echo "$x"
+ echo $'\x1b\x01'

Generated on Thu Sep 3 10:28:33 PDT 2020