Sexpr Encoding

00001012920516 · Info · (manual) · #api #manual #reference #zettelstore

A zettel representation that is a s-expression (also known as symbolic expression).

It is (relatively) easy to parse and contain all relevant information of a zettel, metadata and content. For example, take a look at the Sexpr encoding of this page, which is available via the Info sub-page of this zettel:

If transferred via HTTP, the content type will be text/plain.

Syntax of s-expressions

There are only two types of elements: atoms and lists.

A list always starts with the left parenthesis ((, U+0028) and ends with a right parenthesis (), U+0029). A list may contain a possibly empty sequence of elements, i.e. lists and / or atoms. Before the last element of a list of at least to elements, a full stop character (., U+002E) signal a pair as the last two elements. This allows a more space economic storage of data.

There are three syntactic forms for an atom: numbers, symbols and strings.

A number is a non-empty sequence of digits (0 ... 9). The smallest number is 0, there are no negative numbers.

A symbol is a non-empty sequence of printable characters, except left or right parenthesis. Unicode characters of the following categories contains printable characters in the above sense: letter (L), number (N), punctuation (P), symbol (S). Symbols are case-insensitive, i.e. ZETTEL and zettel denote the same symbol.

A string starts with a quotation mark (", U+0022), contains a possibly empty sequence of Unicode characters, and ends with a quotation mark. To allow a string to contain a quotations mark, it must be prefixed by one backslash (\, U+005C). To allow a string to contain a backslash, it also must be prefixed by one backslash. Unicode characters with a code less than U+FF are encoded by by the sequence \xNM, where NM is the hex encoding of the character. Unicode characters with a code less than U+FFFF are encoded by by the sequence \uNMOP, where NMOP is the hex encoding of the character. Unicode characters with a code less than U+FFFFFF are encoded by by the sequence \UNMOPQR, where NMOPQR is the hex encoding of the character. In addition, the sequence \t encodes a horizontal tab (U+0009), the sequence \n encodes a line feed (U+000A).

Atoms are separated by Unicode characters of category separator (Z).