Why Sponsor Oils? | blog | oilshell.org

Oil Has Multi-line Commands and String Literals

2021-09-19

This post describes two new syntaxes that make Oil programs easier to read and write. Let me know what you think in the comments!

Table of Contents
Multi-Line Commands With A ... Prefix
Multi-Line String Literals: """ and ''' and $'''
An Orthogonal Design Is Easier to Use and Remember
But Nothing Is Perfect
They Can Be Combined
Reminder: Doc Comments With ###
What's Next?
Please Test Oil 0.9.2
Appendices
How Are Multi-Line Commands Parsed?
How Are Multi-Line Strings Parsed?
Review of Syntax Proposals

Multi-Line Commands With A ... Prefix

In Proposed Changes to Oil's Syntax (November 2020), I mentioned this problem with shell:

cat file.txt \    
  | sort \          # I can't put a comment here
  | cut -f 1 \
    # And I can't put one here
  | grep foo

That is, documenting long commands is hard because you can't mix \ line continuations and comments. I just released Oil 0.9.2, which solves this problem:

... cat file.txt    
  | sort            # Comment to the right is valid
  | cut -f 1 
    # Comment on its own line is valid
  | grep foo
  ;                 # Explicit terminator required

In the multiline context started by the ... prefix:

The appendix describes how this is implemented.

I've tagged this post #real-problems, since this mechanism solves a problem that multiple shell users have encountered. For example, see this January Reddit thread on Shell Scripts Are Executable Documentation.

Multi-Line String Literals: """ and ''' and $'''

In June's post Recent Progress on the Oil Language, I wrote that Oil has Python-like multi-line string literals, but enhanced like the Julia language.

Here are examples from the Oil Language Tour.

Double-quoted multi-line strings allow interpolation with $:

sort <<< """
  var sub: $x
  command sub: $(echo hi)
  expression sub: $[x + 3]
  """
# =>
# command sub: hi
# expression sub: 9
# var sub: 6

In single-quoted multi-line strings, every character is literal, including $:

sort <<< '''
  $2.00  # literal $, no interpolation
  $1.99
  '''
# =>
# $1.99
# $2.00

C-style multi-line strings interpret character escapes:

sort <<< $'''
  C\tD
  A\tB
  '''
# =>
# A        B
# C        D

An Orthogonal Design Is Easier to Use and Remember

(This section is long and relies on shell expertise. If you only care about using Oil, as opposed to understanding the design, feel free to skip it.)

These string literals are better than shell's here doc syntax in three ways:

(1) Leading whitespace is stripped in a more useful way.

(2) Multi-line strings are consistent with regular strings with respect to $var interpolation and character escapes like \n.

(3) Multi-line strings can be used in either commands or redirects.

In contrast, here docs can't be used directly with commands like echo, and the alternative causes too much I/O.

To elaborate, recall that this use of the <<< "here string" operator works in bash and OSH:

$ tr a-z A-Z <<< 'hello'
HELLO

And remember that the sort examples above used the <<< operator and not the << "here doc" operator. This is because Oil's multi-line strings are actually string literals!

Another consequence of this is that you can use a multi-line string directly in a command, as part of argv:

echo '''
  one
  two
  three
'''
# =>
# one
# two
# three

In shell, regular strings can span multiple lines, but there's no way to strip leading whitespace, which makes code hard to read:

echo 'one
two
three'
# =>
# one
# two
# three

You could use a here doc and cat:

# This does too much I/O for a simple task
cat <<EOF
one
two
three
EOF

For such a simple task, this is inefficient in two ways:

  1. It causes I/O because Shells Use Temp Files to Implement Here Documents. (Oil doesn't use disk I/O, but it does start a "here doc writer" process.)
  2. It starts an external process cat rather using the echo builtin.

To recap, I like this design because it's more orthogonal in at least 3 dimensions:

  1. Whether whitespace is stripped
  2. Whether $var and $\n are respected
  3. Whether the string is used in a command or redirect

Also note:

But Nothing Is Perfect

However, Oil's string literal syntax still has a "wart": you can't put (statically-parsed) character escapes like \n in double quoted strings.

Unfortunately, this is not orthogonal design. (We even document the warts for you; most languages don't.)

I've lived with this for awhile and think it's OK. I believe it's important to keep not just the Oil language small, but also the combined OSH+Oil "surface area". In other words, I'm happy with 6 kinds of string literal (3 x 2 for the multiline variants), but I would not like 8, 10, or 12 kinds.

As always, I welcome contributions in this direction. However I'd also suggest that this isn't the issue to start with — it's one of the most difficult design issues.

They Can Be Combined

This ugly example combines multi-line commands and multi-line strings, and gives our parsing algorithms a workout! There's no reason for this in production code, but it illustrates the principle.

var x = 'one'

# print 3 args without separators
... write --sep '' --end '' --  
    """
    $x
    """         # 1. Double Quoted Multi-Line String
    '''
    two
    three
    '''         # 2. Single Quoted Multi-Line String
    $'four\n'   # 3. C-style string with explicit newline
  | tac         # Reverse
  | tr a-z A-Z  # Uppercase
  ;
# =>
# FOUR
# THREE
# TWO 
# ONE

Reminder: Doc Comments With ###

I also described Oil's doc comment feature in November of last year:

The line below a proc can have a special ### comment, and its value can be retrieved with pp proc.

proc restart(pid) {
  ### Restart server by sending it a signal
  
  kill $pid
}

What's Next?

A Tour of the Oil Language describes both of these features, and it was discussed on Hacker News a few days ago.

A few familiar questions about the project came up, so I drafted Blog Backlog: FAQ, Project Review, and the Future.

But I might just cut to the chase with What To Expect From Oil in the Near Future.

Please Test Oil 0.9.2

Try this feature out and tell me if there are any bugs! That is the main purpose of these blog posts.

Oil version 0.9.2 - Source tarballs and documentation.

Appendices

How Are Multi-Line Commands Parsed?

These notes for are contributors and people who want to reimplement the Oil language. I used our style of #parsing-shell to implement the subtle multi-line command syntax. It falls slightly outside what you'll see in textbooks on parsing.

First, here's an unusual fact: Oil has two levels of tokenization due to the inherent structure of the shell language.

  1. The Lexer outputs Token objects, and the WordParser consumes them.
  2. The WordParser outputs word_t objects (compound_word or Token), and the CommandParser consumes them.

To parse multi-line commands, we look for the ... prefix word at the start of an AndOr production in the shell grammar. This production handles chains like cd / && ls | wc -l && echo OK.

If we see ..., then we use a Python context manager to flip a flag on the WordParser to enter multi-line mode. When it's in this mode, it treats newlines and blank lines differently. (Python context managers are translated to C++ constructors and destructors by mycpp).

Because ... is a unusual command prefix, I don't expect this to break existing shell code. So multi-line commands are valid in both bin/osh and bin/oil.

(Productivity note: I search the code for symbols like WordParser with grep $name */*.py.)

How Are Multi-Line Strings Parsed?

On the other hand, '''foo''' already has a meaning in shell. It's three string literals side by side using implicit concatenation.

  1. ''
  2. 'foo'
  3. ''

We take advantage of this to parse multi-line string literals when shopt --set parse_triple_quote is on. That is, we do not have tokens for ''', """, and $'''. Instead, we actually look for an empty string at the start of a word, then switch into another WordParser mode, and strip whitespace when we're done.

This is unusual, but it means that OSH and Oil share the same command and word lexer modes. This is a desirable property for keeping the upgrade path from OSH to Oil smooth, and I think it will make syntax highlighters and other tools easier to write.

Review of Syntax Proposals

This post described two syntax features, which happen to be the first two in Proposed Changes to Oil's Syntax (November 2020).

What about the others?