Why Sponsor Oils? | source | all docs for version 0.23.0 | all versions | oilshell.org

J8 Notation - Fixing the JSON-Unix Mismatch

J8 Notation is a set of text interchange formats. It's a syntax for:

strings / bytes
tree-shaped records (like JSON)
line-based streams (like Unix)
tables (like TSV)

It's part of the Oils project, and is intended to solve the JSON-Unix Mismatch: the Unix kernel deals with bytes, while JSON deals with Unicode strings (plus UTF-16 errors).

It's backward compatible with JSON, and built on top of it.

But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils. Any language understands JSON should also understand J8 Notation.

(Note: J8 replaced the similar QSN design in January 2024. QSN wasn't as compatible with both JSON and YSH code.)

Quick Picture

There are 3 styles of J8 strings:

 "hi 🙂 \uD83D\uDE42"      # JSON-style, with surrogate pair

b'hi 🙂 \yF0\y9F\y99\y82'  # Can be ANY bytes, including UTF-8

u'hi 🙂 \u{1F642}'         # nice alternative syntax

They all denote the same decoded string — "hi" and two U+1F642 smiley faces:

hi 🙂 🙂

Why did we add these u'' and b'' strings?

We want to represent any string that a Unix kernel can emit (argv arrays, env variables, filenames, file contents, etc.)
- J8 encoders emit b'' strings to avoid losing information.
u'' strings are like b'' strings, but they can only express valid Unicode strings.

Now, starting with J8 strings, we define the formats JSON8:

{ name: "Alice",
  signature: b'\y01 ... \yff',  # binary data
}

J8 Lines:

  doc/hello.md
 "doc/with spaces.md"
b'doc/with byte \yff.md'

and TSV8:

!tsv8   size    name
!type   Int     Str
        42        doc/hello.md
        55       "doc/with spaces.md"
        99      b'doc/with byte \yff.md'

Together, these are called J8 Notation.

(JSON8 and TSV8 are still to be fully implemented in Oils.).

Goals

Fix the JSON-Unix mismatch: all text formats should be able to express byte strings.
- But it's OK to use plain JSON in Oils, e.g. when filenames are known to be strings.
Provide an option to avoid the surrogate pair / UTF-16 legacy of JSON.
Allow expressing metadata about strings vs. bytes.
Turn TSV into an exterior data frame format.
- Unix tools like awk, cut, and sort already understand tables informally.

Non-goals:

"Replace" JSON. JSON8 is backward compatible with JSON, and sometimes the lossy encoding is OK.
Resolve the strings vs. bytes dilemma in all situations.
- Like JSON, our spec is syntactic. We don't specify a mapping from J8 strings to interior data types in any particular language.

Reference

See the Data Notation Table of Contents in the Oils Reference.

TODO / Diagrams

Diagram of Evolution
- JSON strings → J8 Strings
- J8 strings as a building block → JSON8 and TSV8
Venn Diagrams of Data Language Relationships
- If you add the left "gutter" column, every TSV is valid TSV8.
- Every TSV8 is also syntactically valid TSV. For example, you can import it into a spreadsheet, and remove/ignore the gutter column and type row.
- TODO: make a screenshot and test it
Doc: How to turn a JSON library into a J8 Notation library.
- Issue: an interior type that can represent byte strings.

J8 Strings - Unicode and bytes

Let's review JSON strings, and then describe J8 strings.

Review of JSON strings

JSON strings are enclosed in double quotes, and may have these escape sequences:

\"   \\   \/
\b   \f   \n   \r   \t
\u1234

Properties of JSON:

The encoded form must also be valid UTF-8.
The encoded form can't contain literal control characters, including literal tabs or newlines. (This is good for TSV8, because it means a literal tab is always a field separator.)

J8 Description

There are 3 styles of J8 strings:

JSON strings j"", which may be written ""
b'' strings
u'' strings

b'' strings have these escapes:

\yff                # byte escape
\u{1f926}           # code point escape.  UTF-16 escapes like \u1234
                    # are ILLEGAL
\'                  # single quote, in addition to \"
\"  \\  \/          # same as JSON
\b  \f  \n  \r  \t

(JSON-style double-quoted do not add the \' escape. Except for the optional j prefix, they remain the same.)

Examples:

b''
b'hello'
b'\\'
b'"double" \'single\''
b'nul byte \y00, unicode \u{1f642}'

u'' strings have all the same escapes, but not \yff. This implies that they're always valid unicode strings. (If JSON-style \u1234 escapes were allowed, they wouldn't be.)

Examples:

u''
u'hello'
u'unicode string \u{1f642}'

A string without a prefix, like 'foo', is equivalent to u'foo':

 'this is a u string'  # discouraged, unless the context is clear

u'this is a u string'  # better to be explicit

What's representable by each style?

These relationships might help you understand the 3 styles of strings:

Strings representable by u''
= All Unicode Strings (no more and no less)

⊂

Strings representable by "" (JSON-style)
= All Unicode Strings ∪ Surrogate Half Errors

⊂

Strings representable by b''
= All Byte Strings

Examples:

The JSON message "\udd26" represents a string that's not Unicode — it has a surrogate half error. This string is not representable with u'' strings.
The J8 message b'\yff' represents a byte string. This string is not representable with JSON strings or u'' strings.

Assymmetry of Encoders and Decoders

A few things to notice about J8 encoders:

They can emit only "" strings, possibly using the Unicode replacement char U+FFFD. This is a strict JSON encoder.
They must emit b'' strings to preserve all information, because U+FFFD replacement is lossy.
They never need to emit u'' strings.
- This is because "" strings (and b'' strings) can represent all values that u'' strings can. Still, u'' strings may be desirable in some situations, like when you want \u{1f642} escapes, or to assert that a value must be a valid Unicode string.

On the other hand, J8 decoders must accept all 3 kinds of strings.

YSH has 2 of the 3 styles

A nice property of YSH is that the u'' and b'' strings are valid code:

echo u'hi \u{1f642}'  # u respected in YSH, but not OSH

var myBytes = b'\yff\yfe'

This is useful for correct code generation, and simplifies the language.

But JSON-style strings aren't valid in YSH. The two usages of double quotes can't really be reconciled, because JSON looks like "line\n" and shell looks like "x = ${myvar}".

J8 Strings vs. POSIX Shell Strings

When the encoded form of a J8 string doesn't contain a backslash, it's identical to a POSIX shell string.

In this case, it can make sense to omit the u'' prefix. Example:

shell_string='hi 🙂'

var ysh_str = u'hi 🙂'

var ysh_str =  'hi 🙂'  # same thing

An encoded J8 string has no backslashes when the original string has all these properties:

Valid Unicode (no non-UTF-8 bytes).
No ASCII control characters. All bytes are 0x20 and greater.
No backslashes or single quotes. (All other required escapes are control characters.)

JSON8 - Tree-Shaped Records

Now that we've defined J8 strings, we can define JSON8, an obvious extension of JSON.

(Not implemented yet.)

Review of JSON

See https://json.org

[primitive]     null   true   false
[number]        42  -1.2e-4
[string]        "hello\n"
[array]         [1, 2, 3]
[object]        {"key": 42}

JSON8 Description

JSON8 is like JSON, but:

All strings can be J8 strings — one of the 3 styles describe above.
Object/Dict keys may be unquoted, like {age: 42}
- Unquoted keys must be a valid JS identifier name matching the pattern [a-zA-Z_][a-zA-Z0-9_]*.
Trailing commas are allowed on objects and arrays: {"d": 42,} and [42,]
End-of-line comments. We use # to be consistent with shell.

Example:

{ name: "Bob",  # comment
  age: 30,
  sig: b'\y00\y01 ... \yff',  # trailing comma, binary data
}

J8 Lines - Lines of Text

J8 Lines is another format built on J8 strings. Each line is either:

An unquoted string, which must be valid UTF-8. Whitespace is allowed, but not other ASCII control chars.
A quoted J8 string (JSON style "" or J8-style b'' u'')
An ignored empty line

In all cases, leading and trailing whitespace is ignored.

For example, 6 strings with weird characters could be represented like this:

  dir/with spaces.txt       # unquoted string must be UTF-8
 "dir/with newline \n.txt"  # JSON-style 
b'dir/with bytes \yff.txt'  # J8-style
u'dir/unicode \u{3bc}'
                            # ignored empty line
 ''                         # empty string, not ignored
 'dir/unicode \u{3bc}'      # no prefix implies u''

Note that J8 strings always occupy one physical line, because they can't contain unescaped control characters, including newlines.

J8 Lines can be viewed as a simpler case of TSV8, described in the next section.

https://jsonlines.org/ allows not just strings, but any value like {} and []. We could define an obvious "JSON8 Lines" format, which is different than "J8 Lines".

TSV8 - Table-Shaped Text

Let's review TSV, and then describe TSV8.

Review of TSV

TSV has a very short specification:

https://www.iana.org/assignments/media-types/text/tab-separated-values

Example:

name<TAB>age
alice<TAB>44
bob<TAB>33

Limitations:

Fields can't contain tabs or newlines.
There's no escaping, so unprintable bytes in field values result in an unprintable TSV file.
Spaces are easy to confuse with tabs.

TSV8 Description

TSV8 is like TSV with:

A !tsv8 prefix and required column names.
An optional !type line, with types Bool Int Float Str.
Other optional column attributes.
Rows of data, each starting with an empty "gutter" column.

Example:

!tsv8   age     name    
!type   Int     Str     # optional types
!other  x       y       # more column metadata
        44        alice
        33        bob
         1       "a\tb"
         2      b'nul \y00'
         3      u'unicode \u{3bc}'

Types:

[Bool]      false   true
[Int]       JSON numbers, restricted to [0-9]+
[Float]     same as JSON
[Str]       J8 string (any of the 3 styles)

Rules for cells:

They can be any of 4 forms in J8 Lines:
1. Unquoted
2. JSON-style ""
3. u''
4. b''
Leading and trailing whitespace must be stripped, as in J8 Lines.

TODO: What about empty cells? Are they equivalent to null? TSV apparently can't have empty cells, as the rule is [character]+, not [character]+.

Column attributes:

!format could be Instant / Duration?

Design Notes

TODO: This section will be filled in as we implement TSV8.

Null Issues:
- Are bools nullable? Seems like no reason, but you could be missing
- Are ints nullable? In SQL they probably are
- Are floats nullable? Yes, like NA in R.
- Decoders can use a parallel typed column to indicate nulls?
It's OK to use plain TSV in YSH programs as well. You don't have to add types if you don't want to.

Summary

This document described an upgrade of JSON strings:

J8 Strings (in 3 styles)

And data formats that built on top of these strings:

JSON8 - tree-shaped records
J8 Lines - Unix streams
TSV8 - table-shaped data

Appendix

Future Work

We could have an SEXP8 format for:

Concrete syntax trees, with location information
Textual IRs like WebAssembly

FAQ

Why are byte escapes spelled `\yff`, and not `\xff` as in C?

Because in JavaScript and Python, \xff is a code point, not a byte. That is, it's a synonym for \u00ff, which is encoded in UTF-8 as the 2 bytes 0xc3 0xbf.

This is exactly the confusion we want to avoid, so \yff is explicitly different.

One of Chrome's JSON encoders also has this confusion.

Why have both `u''` and `b''` strings, if only `b''` is technically needed?

A few reasons:

Apps in languages like Python and Rust could make use of the distinction. Oils doesn't have a string/bytes distinction (on the "interior"), but many languages do.
Using u'' strings can avoid hacks like WTF-8, which is often required for round-tripping arbitrary JSON messages. Our u'' strings don't require WTF-8 because they can't represent surrogate halves.
u'' strings add trivial weight to the spec, since compared to b'' strings, they simply remove \yff. This is true because encoded J8 strings must be valid UTF-8.

Why not use double quotes like `u""` and `b""`?

J8-style strings could have used double quotes. But single quotes make the new styles more visually distinct from "", and it allows '' as a synonym for u''.

Compared to "" strings, '' strings don't have a UTF-16 legacy.

How do I write a J8 encoder and decoder?

The list of errors at ref/chap-errors.html may be a good starting point.

TODO: describe the Oils implementation.

Should a J8 number be mapped to an Int, Float, or Decimal type?

J8 Notation is like JSON: it only specifies the syntax of messages on the wire.

The mapping of text to types is left to implementers, and depends on the programming language:

Languages like C, C++, and Rust have different sizes of ints and floats
Languages like JavaScript favor floats
It's valid to map to a Decimal type, if the language runtime supports it

OSH and YSH happen to use Int and Float, but this is logically separate from J8 Notation.

Glossary

J8 Strings - the building block for JSON8 and TSV8. There are 3 similar syntaxes: "foo" and b'foo' and u'foo'.
JSON strings - double quoted strings "foo".
J8-style strings - either b'foo' or u'foo'.

Formats built on J8 strings:

J8 Lines - unquoted and J8 strings, one per line.
JSON8 - An upgrade of JSON.
TSV8 - An upgrade of TSV.

Generated on Sun, 25 Aug 2024 12:30:01 -0400

J8 Notation - Fixing the JSON-Unix Mismatch

Quick Picture

Goals

Reference

TODO / Diagrams

J8 Strings - Unicode and bytes

Review of JSON strings

J8 Description

What's representable by each style?

Assymmetry of Encoders and Decoders

YSH has 2 of the 3 styles

J8 Strings vs. POSIX Shell Strings

JSON8 - Tree-Shaped Records

Review of JSON

JSON8 Description

J8 Lines - Lines of Text

Related

TSV8 - Table-Shaped Text

Review of TSV

TSV8 Description

Design Notes

Summary

Appendix

Related Links

Future Work

FAQ

Why are byte escapes spelled `\yff`, and not `\xff` as in C?

Why have both `u''` and `b''` strings, if only `b''` is technically needed?

Why not use double quotes like `u""` and `b""`?

How do I write a J8 encoder and decoder?

Should a J8 number be mapped to an Int, Float, or Decimal type?

Glossary

J8 Notation - Fixing the JSON-Unix Mismatch

Quick Picture

Goals

Reference

TODO / Diagrams

J8 Strings - Unicode and bytes

Review of JSON strings

J8 Description

What's representable by each style?

Assymmetry of Encoders and Decoders

YSH has 2 of the 3 styles

J8 Strings vs. POSIX Shell Strings

JSON8 - Tree-Shaped Records

Review of JSON

JSON8 Description

J8 Lines - Lines of Text

Related

TSV8 - Table-Shaped Text

Review of TSV

TSV8 Description

Design Notes

Summary

Appendix

Related Links

Future Work

FAQ

Why are byte escapes spelled \yff, and not \xff as in C?

Why have both u'' and b'' strings, if only b'' is technically needed?

Why not use double quotes like u"" and b""?

How do I write a J8 encoder and decoder?

Should a J8 number be mapped to an Int, Float, or Decimal type?

Glossary

Why are byte escapes spelled `\yff`, and not `\xff` as in C?

Why have both `u''` and `b''` strings, if only `b''` is technically needed?

Why not use double quotes like `u""` and `b""`?