QSN: A Familiar String Interchange Format

QSN ("quoted string notation") is a data format for byte strings. Examples:

''                           # empty string
'my favorite song.mp3'
'bob\t1.0\ncarol\t2.0\n'     # tabs and newlines
'BEL = \x07'                 # byte escape
'mu = \u{03bc}'              # Unicode char escape
'mu = μ'                     # represented literally, not escaped

It's an adaptation of Rust's string literal syntax with a few use cases:

Oil uses QSN because it's well-defined and parsable. It's both human- and machine-readable.

Any programming language or tool that understands JSON should also understand QSN.

Table of Contents
Important Properties
More QSN Use Cases
Specification
A Short Description
An Analogy
Full Spec
Advantages Over JSON Strings
Implementation Issues
How Does a QSN Encoder Deal with Unicode?
Which Bytes Should Be Hex-Escaped?
List of Syntax Errors
Reference Implementation in Oil
Appendices
Design Notes
Related Links
set -x example

Important Properties

More QSN Use Cases

Specification

A Short Description

  1. Start with Rust String Literal Syntax
  2. Use single quotes instead of double quotes to surround the string. This is mainly to to avoid confusion with JSON.

An Analogy


     JavaScript Object Literals   are to    JSON
as   Rust String Literals         are to    QSN

But QSN is not tied to either Rust or shell, just like JSON isn't tied to JavaScript.

It's a language-independent format like UTF-8 or HTML. We're only borrowing a design, so that it's well-specified and familiar.

Full Spec

TODO: The short description above should be sufficient, but we might want to write it out.

Advantages Over JSON Strings

Implementation Issues

How Does a QSN Encoder Deal with Unicode?

The input to a QSN encoder is a raw byte string. However, the string may have additional structure, like being UTF-8 encoded.

The encoder has three options to deal with this structure:

  1. Don't decode UTF-8. Walk through bytes one-by-one, showing unprintable ones with escapes like \xce\xbc. Never emit escapes like \u{3bc} or literals like μ. This option is OK for machines, but isn't friendly to humans who can read Unicode characters.

Or speculatively decode UTF-8. After decoding a valid UTF-8 sequence, there are two options:

  1. Show escaped code points, like \u{3bc}. The encoded string is limited to the ASCII subset, which is useful in some contexts.

  2. Show them literally, like μ.

QSN encoding should never fail; it should only fall back to byte escapes like \xff. TODO: Show the state machine for detecting and decoding UTF-8.

Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.

Which Bytes Should Be Hex-Escaped?

The reference implementation has two functions:

In theory, only escapes like \' \n \\ are strictly necessary, and no bytes need to be hex-escaped. But that strategy would defeat the purpose of QSN for many applications, like printing filenames in a terminal.

List of Syntax Errors

QSN decoders must enforce (at least) these syntax errors:

Separate messages aren't required for each error; the only requirement is that they not accept these sequences.

Reference Implementation in Oil

The encoder has options to emit shell-compatible strings, which you probably don't need. That is, C-escaped strings in bash look $'like this\n'.

A subset of QSN is compatible with this syntax. Example:

$'\x01\n'  # A valid bash string.  Removing $ makes it valid QSN.

Something like $'\0065' is never emitted, because QSN doesn't contain octal escapes. It can be encoded with hex or character escapes.

Appendices

Design Notes

The general idea: Rust string literals are like C and JavaScript string literals, without cruft like octal (\755 or \0755 — which is it?) and vertical tabs (\v).

Comparison with shell strings:

Comparison with Python's repr():

Related Links

set -x example

When arguments don't have any spaces, there's no ambiguity:

$ set -x
$ echo two args
+ echo two args

Here we need quotes to show that the argv array has 3 elements:

$ set -x
$ x='a b'
$ echo "$x" c
+ echo 'a b' c

And we want the trace to fit on a single line, so we print a QSN string with \n:

$ set -x
$ x=$'a\nb'
$ echo "$x" c
+ echo $'a\nb' c

Here's an example with unprintable characters:

$ set -x
$ x=$'\e\001'
$ echo "$x"
+ echo $'\x1b\x01'

Generated on Wed May 3 15:38:09 EDT 2023