Home

Oil: Structured Data and Escaping

2017-09-29

Based on last post, here is my recommendation:

Idea for binary data: \0

multiple dispatch:

read(n) # default stdin, equivalent to read(stdin, n) read(fd, n)

semantics are: dyanmic dispatch based on var types. But can also be done statically. Because arity can be determined statically?

Left-to-right syntax might be better:

git log > tmp.bin
open tmp.bin | expectNulBytes(20) 
open tmp.bin | escape-segments

The od ... | grep ... | wc -l solution is clever in a way, but ultimately obscure and inefficient.

Of course, you can write an external tool in Python or Perl to perform such checks, but it's not a big stretch for the shell to handle this. The task falls firmly within the domain of string processing.

Another alternative:

set n += s | count('\0')  # Pythonic methods
set n += count(s, '\0')  

Oil Goals

PROBLEM WITH BASH:

$ listing=$(find . -print0 ) echo $listing # stops at first NUL

How Oil Can Help

Also you can do JSON output like this:

--format '{"subject": "%s", "author" "%an"}'

# json-escaped strings, which is pretty much like c-escaped
--format '{"subject": "%Js", "author" "%Jan"}'

# json-escaped strings, which is pretty much like c-escaped
--format '{"subject": "%J{subject}", "author" "%J{author_name}"}'

This feels like too much logic to put in every tool.  There are more ugly mini-languages.  It's harder

The % language is simple enough for C.

In Oil, you can wrap up the %s with a spec maybe
stat --help-format stat --format-schema

How Unix Tools Can Interface With Oil

What's the minimum set of things thaey need to support? - I think nothing. find and git log already support it. There is - only thing unfortunate is that git uses %xHH while find uses \0 and octal \001. - I noted that stat already does it, but let's get rid of stat. - What about binary data? I think the minimum solution is for binary data to be pointers in tmpfs.

What is the nice thing about this? If there are any tools that don't follow the style, I can submit patches to add %0 or \0 or whatever. I can even maintain such patches.

Of course, I'm not ruling out JSON. Shell is about heterogeneity. So if someone wants to make a complete universe of JSON tools, that is fine. In fact these complete universes already exist:

The nice thing is that because Oil is a shell, it gets all these tools "for free".

I think of these as algebraic monoids (not monads). It's a single data type and an operation around them.

Monoids support point-free programming.

Hacker News Comments: There were too extreme.

Dialectic:

Sys Admins vs. Software Engineers. I want to bridge these worlds.

Data Science vs. Software Engineers.

Addendum: Unicode utf-8 is good

Problem:

JSON Deficiencies

Solutions:

find coudl be builtin

untrusted = $[git log --pretty="format:\x01%s\x02"] assert(untrusted.count('\x01') == 2) assert(untrusted.count('\x02') == 2)

I don't like method format?

But then | needs very high precedence?

assert(untrusted|count('\x01') == 2) assert(untrusted|count('\x02') == 2)

assert(untrusted->count('\x01') == 2) assert(untrusted->count('\x02') == 2)

No this looks more lke namespace?

assert(untrusted:count('\x01') == 2) assert(untrusted:count('\x02') == 2)

But this doesn't work unless you know exactly how many commits there are! You could do that by counting NUL bytes I guess.

quotation of string

fs . ( type == 'f' && printf "-- $size $path -- " )

what about git log though?

git log --json | while read entry { with (entry) { echo "" echo " $(description -> htmlEscape)"
echo " $(author -> htmlEscape)"
echo "" } }

desc|htmlEscape

desc -> htmlEscape

-> could be the vectorized one? Fill it?

Discredited / fallen out of favor:

- using an "API" for structured data
- structured HTML templates.  Rewriting HTML in s-expressions syntax, etc.
  JSX is the best of both worlds?

Resolving ambiguity incorrecty leads to security bugs, for example: GIFAR
bug.  A GIF that is also a JAR.

Open problem: should Oil involve %s strings?

I think I want to provide quotations of strings:

gitLog = $[git log --structured]
for (entry in gitLog) {
  echo "<tr>"
  echo "  <td>$(description -> htmlEscape)</td>"   
  echo "  <td>$(author -> htmlEscape)</td>"   
  echo "</tr>"
}

The challenge here is that I don't know what format git log should output.  It
could output JSON I suppose.

On the one hand, JSON has existed for over a decade, nobody has added it to git
or coreutils.  There have been dozens of attempts (including my own) to add
structured data, but it hasn't panned out.

(PowerShell does it wrong)

On the other hand, if there were a shell (like Oil), that natively supported
JSON, then maybe there would be some incentive.

Honestly CSV over pipes might be more useful.  It is a table after all.

I have a wiki page called "structured data over pipes" that addresses this.


From this problem, you might generalize three different styles:

(1)  The Unix way: Just use `git log`.  Hope that you don't have any special
  characters.  Suffer from security problems.

(2) Pedantic way:

- Use some kind of "API".  This takes forever, and is also slow.

(3) Oil way

Bash is almost there!  To preserve whipuptitude, I would use my solution, but
then abort the program if there aren't the right number of % characters.

Admittedly, Oil did nothing in this example.  I solved the problem entirely
using git and bash.  But I would like idiomatic Oil shell scripts to respect
the meaning of strings, and not confuse code and data.


I probably won't get to this post, because I have a bunch backlogged.  But I
still would like structured data over pipes.

Structured Data Over Pipes.  This is something Oil should have.  Elvish doesn't
do it over pipes?  Channels are pipes of structured data?



TODO: bash 4.4. might allow quoting/escaping?  Avoid eval?

But I think the issue is more subtle than that.  We can have the debate after I
explain the trick.



Bash supports C-style escapes that you can use instead.  As always, they come
with a weird syntax:

--> syntax sh
local format=$'<td>\x00%s\x01</td>'

The leading $ means that C-style escapes are respected.

So here is a way to safetly print filenames to an HTML page.

$ touch '<script>alert("hi")</script>'

$ find . -printf $'<td>\x01%P\x02</td>' | escape-fragments

Although, notice that I changed the special bytes to 0x01 and 0x02. This is because bash strings are NUL-terminated, so $'\x00abc' has length zero, not 4.

I've noticed that software engineers tend to think of shell-style text processing as the "wrong way", while using an API with structured data is the "right way". After seeing some of the [horrible shell scripts][debootstrap] at the foundation of every Linux distro, I can see why they feel that way.

An HTML changelog is nice to have, not an essential part of the prjoect, so let's just get it done!

I like my solution, at least for this particular task. It may or may not be secure against adversarial input, but Git commit descriptions will be reviewed, and it wouldn't be a bad idea to write a pre-commit hook that rejects unprintable bytes like 0x01 in the description.