2024-11-02 - No YAML, No Recutils ================================= Back around 2011 i wrote a private database, and at some point migrated it to a file based format. Each record is a YAML file. The data is presented in a vertical format with one line per field, plus some multi-line blocks defined by indentation. Trivial to edit in any text editor. I used Tcl and the yaml module from libtcl to process the data. Recently i found the No YAML web site and i decided it was time for a change. No YAML I briefly considered CSV, TSV, and JSON. They are all mature, standardized formats, but in my opinion they fall short when it comes to editing in a vertical format in a plain text editor. I looked at GNU Recutils, since several folks wrote about it their phlogs. Like my YAML files, recutils presents the data in a vertical format with one line per field, plus it can do multi-line blocks via line continuations. The format is fine for my purposes. One of my requirements is that i want this to work on FreeDOS too. Recutils requires filesystem support for ACL, which is too fancy for DOS. The format is simple enough, but the source code is surprisingly complex. It does a fraction of what sqlite3 does, and in a less portable, less robust way. I tried making my own format based on ASCII control codes. I could use Control-^ (the RS character) as the record separator, Control-_ (the US character) as the unit separator AKA the field separator, and Control-X (the CAN or Cancel character) to discard all text since the beginning of the field. This format works in ed(1) and the calvin vi clone on DOS. However, the control characters are a little ridiculous to look at and type in. I did not want to foist such an eye-sore on my future self. csvtofsv also uses ASCII control codes as delimiters I tried another format based on something i found online. Like my YAML files, this format has one file per record and one line per field, presented in a vertical format. The record separator is an empty line. The field separator is the EOL (end of the line). Each field has a name, a colon character, a space, and optionally a value. Any line that begins with whitespace is a line continuation from the previous field. This format is trivial to process in AWK. No special parser required. I converted my private database to this format. Exporting the whole AWK database to CSV took 0.3 seconds, compared to 3 seconds in the Tcl & YAML version. I added one feature: inline blocks of multi-line text. The block format is the same as the line contination format, except the initial value is a backslash character. For example, here is a line continuation. fieldname: First sentence. Second sentence. Third sentence. When this value is read, the EOL and indentation are removed. That's why this can also be represented without continuation. fieldname: First sentence. Second sentence. Third sentence. Here is an inline block of multi-line text. fieldname: \ Line 1 of 3. Line 2 of 3. Line 3 of 3. When this value is read, the indentation is removed, but the EOL is preserved. The value contains multiple lines. Time for me to shut up and show them the code. Below are two small AWK scripts to convert from gopher lawn format to TSV and back. lawn2tsv.awk tsv2lawn.awk p.s. In theory, if i wanted to migrate the data, i could use uncsv to convert between TSV and CSV. GNU recutils can import and export CSV. I was told that the gopher lawn format resembles the Header Fields format in email standards. See section 2.2 of RFC 5322. The VCARD format is also similar. See section 6.10 of RFC 6350 for Extended Properties and Parameters. I could have abused this format but i think it is too complex for my purposes. tags: bencollver,technical,unix Tags ==== bencollver technical unix