Notes on writing a LaTeX(-like) beautifier

S. Gilles

2017-02-17

Background

  These are my notes written while creating llf[0]. They may be
  interesting, but they are mostly to jog my memory in case I ever have
  a chance to extend it or rewrite it. There is no overarching moral. At
  most, this is an anecdote.

  Back when I used Emacs, AUCTeX was invaluable, mostly for the
  auto-formatting. I require that my LaTeX files be visually consistent,
  and I do not have the memory or desire to do this myself; I'd prefer
  a computer to do it for me. So I would regularly run C-c C-q C-e
  (or whatever the command is).

  To make a long story short, Emacs is slow, so I needed a standalone
  reformatter.

  The classic solution is latexindent.pl. But that won't make an
  intelligent effort to break lines longer than 72 characters, so
  that's out. A few searches for ‘latex beautifier’ and ‘latex
  reformatter’ later, and it seems that such a thing doesn't exist.
  Well, that shouldn't be hard to make, right?

  This is a collection of examples which are sufficiently irritating
  that they caused me to make design decisions.

Roadblock 1: Turing Completeness

  The first thing I discovered is a bunch of posts saying that parsing
  TeX is Turing Complete. It's obligatory for anyone who writes anything
  about parsing TeX to give an example, so here's mine. It's not clever.

    \documentclass{article}
    \usepackage{forloop}
    \begin{document}
    
    \newcommand{\sievetest}[1]{
        \newcounter{j}
        \newcount\t
        \catcode`\~=3
        \forloop[1]{j}{2}{\value{j} < #1}{
            \t=#1
            \divide\t by \value{j}
            \multiply\t by \value{j}
            \ifnum\t=#1 \catcode`\~=9 \fi
        }
    }
    
    \sievetest{101}
    
    ~\pi~
    \end{document}

  If you \sievetest a prime, the ‘\pi’ is in math mode (because the
  catcode of ‘~’ makes it act the way you would expect ‘$’ to),
  and everything is fine. If you \sievetest a non-prime, then the document
  fails to parse, because the catcode of ‘~’ got reset.

  The point is that parsing is at least as hard as evaluating, and
  evaluating is Turing-complete; there exist[1] BASIC interpreters
  written in LaTeX.

Solution 1: “So don't do that”

  I've never needed to manipulate a catcode raw in my life. I've used
  \makeatletter and \makeatother a few times, but I've never wanted to
  use @ when it wasn't a letter. So my solution is to not beautify LaTeX
  or TeX, but a language which is sort of like them. It just so happens
  that the .tex files I write are at the intersection of this language
  and LaTeX.

Roadblock 2: Brackets, verbatim, and also verbatim

  The first set of problems are simply in picking out the grammar.
  For example, the \left and \right control sequences introduce some
  rather irritating challenges. Without hardcoding names like \left
  and \right, how can a parser correctly figure out what to do with the
  square brackets in following situation?

    \documentclass{article}
    \begin{document}
      \[ \left[ \int_A \omega \right] \]

      \begin{itemize}
      \item[$\left. \big| \right]$] Foo].
      \end{itemize}
    \end{document}

  (That's a trick question. The above isn't valid LaTeX: you'd need to
  put the name of the item in {} for it to work.)

  Furthermore, whatever the beautifier is, it has to handle this:

    \documentclass{article}

    \begin{document}
      \begin{verbatim}
    These lines,
    which would not \normally{ be valid LaTeX,
    must
    be preserved. And this must show: \begin{verbatim}.
      \end{verbatim}

      And this line, which stretches out so long it should split, \verb( must be split at the right place(, which is actually way back there.
    \end{document}

   The contents of the first verbatim block must be preserved as-is,
   including the unclosed bit that looks like a control sequence, but
   actually isn't. The inline \verb command also needs to be handled
   specially, not least of which includes the custom delimiter, which
   in this case is ‘(’.

   Exercise for the reader: Why does this document work (at least under
   pdflatex from texlive-2016, when encoded as UTF-8)

     \documentclass{article}
     \begin{document}
     My verbatim delimiters \verb⑨ ARE WRONG ⑼ !
     \end{document}

   while the following doesn't?

    \documentclass{article}
    \begin{document}
    My verbatim delimiters \verbℂ ARE WRONG 𝔸 !
    \end{document}

  (Hint: it has nothing to do with combining, BIDI, or invisibility.)

  Answer (ROT13): Fvapr qbphzrag vachg vf abg fcrpvsvrq, cqsyngrk vf
  gerngvat gur punenpgref olgr ol olgr. Gurersber, gur qryvzvgre vf
  gur svefg olgr bs gur rapbqrq pbqrcbvag, abg gur jubyr pbqrcbvag.
  PVEPYRQ QVTVG AVAR naq CNERAGURFVMRQ QVTVG AVAR obgu unir gur fnzr
  svefg olgr jura rapbqrq nf HGS-8, ohg QBHOYR-FGEHPX PNCVGNY P naq
  ZNGURZNGVPNY QBHOYR-FGEHPX PNCVGNY N qb abg. Guvf vf cebonoyl orpnhfr
  pbzcyrk ahzoref jrer zber jvqryl qrznaqrq guna nqryr evatf va 1993,
  jura Havpbqr 1.1 jnf fgnaqneqvmrq, fb gur qbhoyr-fgehpx nycunorg jnf
  nqqrq cvrprzrny naq vf fpnggrerq bire gur pbqrcbvag enatr.

Solution 2: LPEG

  There's only a minor amount of black magic here. LPEG makes writing
  grammars relatively painless. Verbatim environments are the easy part;
  treating them specially is as simple as the following.

    (Cg("\\begin{verbatim}", "beginbit") *
      Cg((P(1) - #P("\\end{verbatim}"))^0, "content") *
      Cg("\\end{verbatim}", "endbit"))

  Next comes the \verb handling, probably the second ugliest piece of
  code in llf today. It looks something like:

    function f (s, i, delim, c) return delim == c end
    backslash_verb =
      (Cg("\\verb", "beginbit") *
       Cg(V"verbposdelims", "delim") *
       Cg((P(1) - Cmt(Cb("delim") * C(1), f))^0, "content") *
       Cg(Cmt(Cb("delim") * C(1), f), "endbit"))

  Finally, the bracket mess. It turns out that \left and \right are,
  in fact, very special control sequences, so hardcoding them is
  appropriate. Detecting when ‘[’ and ‘]’ need to be balanced
  is a little harder: a beautifier needs to handle

    \begin{theorem}[Euclid]
    \newcommand{\foo}[3][2]{bar}
    \item[\texttt{const char *foo[10]}]
    \item[foo]bar]

  This last inspires the ugliest piece of code in llf: when recursing into
  the contents of balanced braces, the entire body match is duplicated,
  but without allowing square brackets (recursing back into curly brackets
  puts the parser back into normal state). This allows greedy matching
  to work for all four cases.

Roadblock 3: TikZ and pseudocode

  Some LaTeX environments don't really work well as blobs of
  paragraphs. Personally, I notice this most when working with TikZ and
  pseudocode. Whatever the beautifier is, it has to handle something like


    \documentclass{article}
    \usepackage{algorithm}
    \usepackage{algpseudocode}
    \usepackage{tikz}
    \usepackage{pgfplots}
    \usepgfplotslibrary{patchplots}
    \begin{document}
      \begin{algorithm}
        \begin{algorithmic}
          \Function{Foo}{$a$}
            \If{$a = 3$}
              \State \Return 0
            \Else
              \State \Call{Bar}{}
              \State \Return -1
            \EndIf
          \EndFunction
        \end{algorithmic}
      \end{algorithm}
      \begin{tikzpicture}
        \begin{axis}[xmin=-3, xmax=5, ymin=-10, ymax=10]
          \addplot[color=black,dashed,domain=-10:10, samples=50]
          plot ({-1}, {\x});
          \addplot[color=red,domain=0.1:10, samples=50]
          plot ({-1 - \x}, {1/(\x*\x)});
          \addplot[patch,mesh,patch type=quadratic spline,color=red]
          coordinates { (-0.6,10) (0.7,-5) (0,0) };
          \addplot[color=red,domain=0.7:1, samples=50] plot ({\x},
          {-31.18 + 102.22*\x - 128.88*\x*\x + 51.84*\x*\x*\x});
          \addplot[color=red,domain=1:3, samples=50] plot ({\x}, {-6 +
            (\x-1)*(\x-1)});
          \addplot[color=red,domain=0.25:5, samples=50] plot
          ({\x+2.75}, {-1/(2*\x)});
          \draw (axis cs: -3,0) -- (axis cs:5,0);
          \draw (axis cs: 0,-10) -- (axis cs:0,10);
          \draw (axis cs: 3,-2) node[circle,fill,label=below right:IP,
            inner sep=1pt] {};
          \draw (axis cs: 1,-6) node[circle,fill,label=below right:min,
            inner sep=1pt] {};
        \end{axis}
      \end{tikzpicture}
    \end{document}

Solution 3: “Make the user do it”

  Since LPEG practically forced the choice of  Lua, it's extremely easy
  to expose configuration to the user. That makes it extremely easy to
  allow the user to fill in the control sequences that should adjust
  indentation, the control sequences that should always start their own
  lines, and so on.

Conclusion:

  This is actually the fourth tool I've written to do this task.
  The other three became untenable for performance issues (a Perl version
  with an unfortunate choice of grammar library), naivety in design
  (a Flex version with some fundamental misconceptions about LaTeX
  structure), or overambition (an attempt to implement the TeX parsing
  algorithm from scratch).

[0] git://repo.or.cz/llf.git
[1] http://tug.org/TUGboat/tb11-3/tb29greene.pdf