Notes on writing a LaTeX(-like) beautifier S. Gilles 2017-02-17 Background These are my notes written while creating llf[0]. They may be interesting, but they are mostly to jog my memory in case I ever have a chance to extend it or rewrite it. There is no overarching moral. At most, this is an anecdote. Back when I used Emacs, AUCTeX was invaluable, mostly for the auto-formatting. I require that my LaTeX files be visually consistent, and I do not have the memory or desire to do this myself; I'd prefer a computer to do it for me. So I would regularly run C-c C-q C-e (or whatever the command is). To make a long story short, Emacs is slow, so I needed a standalone reformatter. The classic solution is latexindent.pl. But that won't make an intelligent effort to break lines longer than 72 characters, so that's out. A few searches for ‘latex beautifier’ and ‘latex reformatter’ later, and it seems that such a thing doesn't exist. Well, that shouldn't be hard to make, right? This is a collection of examples which are sufficiently irritating that they caused me to make design decisions. Roadblock 1: Turing Completeness The first thing I discovered is a bunch of posts saying that parsing TeX is Turing Complete. It's obligatory for anyone who writes anything about parsing TeX to give an example, so here's mine. It's not clever. \documentclass{article} \usepackage{forloop} \begin{document} \newcommand{\sievetest}[1]{ \newcounter{j} \newcount\t \catcode`\~=3 \forloop[1]{j}{2}{\value{j} < #1}{ \t=#1 \divide\t by \value{j} \multiply\t by \value{j} \ifnum\t=#1 \catcode`\~=9 \fi } } \sievetest{101} ~\pi~ \end{document} If you \sievetest a prime, the ‘\pi’ is in math mode (because the catcode of ‘~’ makes it act the way you would expect ‘$’ to), and everything is fine. If you \sievetest a non-prime, then the document fails to parse, because the catcode of ‘~’ got reset. The point is that parsing is at least as hard as evaluating, and evaluating is Turing-complete; there exist[1] BASIC interpreters written in LaTeX. Solution 1: “So don't do that” I've never needed to manipulate a catcode raw in my life. I've used \makeatletter and \makeatother a few times, but I've never wanted to use @ when it wasn't a letter. So my solution is to not beautify LaTeX or TeX, but a language which is sort of like them. It just so happens that the .tex files I write are at the intersection of this language and LaTeX. Roadblock 2: Brackets, verbatim, and also verbatim The first set of problems are simply in picking out the grammar. For example, the \left and \right control sequences introduce some rather irritating challenges. Without hardcoding names like \left and \right, how can a parser correctly figure out what to do with the square brackets in following situation? \documentclass{article} \begin{document} \[ \left[ \int_A \omega \right] \] \begin{itemize} \item[$\left. \big| \right]$] Foo]. \end{itemize} \end{document} (That's a trick question. The above isn't valid LaTeX: you'd need to put the name of the item in {} for it to work.) Furthermore, whatever the beautifier is, it has to handle this: \documentclass{article} \begin{document} \begin{verbatim} These lines, which would not \normally{ be valid LaTeX, must be preserved. And this must show: \begin{verbatim}. \end{verbatim} And this line, which stretches out so long it should split, \verb( must be split at the right place(, which is actually way back there. \end{document} The contents of the first verbatim block must be preserved as-is, including the unclosed bit that looks like a control sequence, but actually isn't. The inline \verb command also needs to be handled specially, not least of which includes the custom delimiter, which in this case is ‘(’. Exercise for the reader: Why does this document work (at least under pdflatex from texlive-2016, when encoded as UTF-8) \documentclass{article} \begin{document} My verbatim delimiters \verb⑨ ARE WRONG ⑼ ! \end{document} while the following doesn't? \documentclass{article} \begin{document} My verbatim delimiters \verbℂ ARE WRONG 𝔸 ! \end{document} (Hint: it has nothing to do with combining, BIDI, or invisibility.) Answer (ROT13): Fvapr qbphzrag vachg vf abg fcrpvsvrq, cqsyngrk vf gerngvat gur punenpgref olgr ol olgr. Gurersber, gur qryvzvgre vf gur svefg olgr bs gur rapbqrq pbqrcbvag, abg gur jubyr pbqrcbvag. PVEPYRQ QVTVG AVAR naq CNERAGURFVMRQ QVTVG AVAR obgu unir gur fnzr svefg olgr jura rapbqrq nf HGS-8, ohg QBHOYR-FGEHPX PNCVGNY P naq ZNGURZNGVPNY QBHOYR-FGEHPX PNCVGNY N qb abg. Guvf vf cebonoyl orpnhfr pbzcyrk ahzoref jrer zber jvqryl qrznaqrq guna nqryr evatf va 1993, jura Havpbqr 1.1 jnf fgnaqneqvmrq, fb gur qbhoyr-fgehpx nycunorg jnf nqqrq cvrprzrny naq vf fpnggrerq bire gur pbqrcbvag enatr. Solution 2: LPEG There's only a minor amount of black magic here. LPEG makes writing grammars relatively painless. Verbatim environments are the easy part; treating them specially is as simple as the following. (Cg("\\begin{verbatim}", "beginbit") * Cg((P(1) - #P("\\end{verbatim}"))^0, "content") * Cg("\\end{verbatim}", "endbit")) Next comes the \verb handling, probably the second ugliest piece of code in llf today. It looks something like: function f (s, i, delim, c) return delim == c end backslash_verb = (Cg("\\verb", "beginbit") * Cg(V"verbposdelims", "delim") * Cg((P(1) - Cmt(Cb("delim") * C(1), f))^0, "content") * Cg(Cmt(Cb("delim") * C(1), f), "endbit")) Finally, the bracket mess. It turns out that \left and \right are, in fact, very special control sequences, so hardcoding them is appropriate. Detecting when ‘[’ and ‘]’ need to be balanced is a little harder: a beautifier needs to handle \begin{theorem}[Euclid] \newcommand{\foo}[3][2]{bar} \item[\texttt{const char *foo[10]}] \item[foo]bar] This last inspires the ugliest piece of code in llf: when recursing into the contents of balanced braces, the entire body match is duplicated, but without allowing square brackets (recursing back into curly brackets puts the parser back into normal state). This allows greedy matching to work for all four cases. Roadblock 3: TikZ and pseudocode Some LaTeX environments don't really work well as blobs of paragraphs. Personally, I notice this most when working with TikZ and pseudocode. Whatever the beautifier is, it has to handle something like \documentclass{article} \usepackage{algorithm} \usepackage{algpseudocode} \usepackage{tikz} \usepackage{pgfplots} \usepgfplotslibrary{patchplots} \begin{document} \begin{algorithm} \begin{algorithmic} \Function{Foo}{$a$} \If{$a = 3$} \State \Return 0 \Else \State \Call{Bar}{} \State \Return -1 \EndIf \EndFunction \end{algorithmic} \end{algorithm} \begin{tikzpicture} \begin{axis}[xmin=-3, xmax=5, ymin=-10, ymax=10] \addplot[color=black,dashed,domain=-10:10, samples=50] plot ({-1}, {\x}); \addplot[color=red,domain=0.1:10, samples=50] plot ({-1 - \x}, {1/(\x*\x)}); \addplot[patch,mesh,patch type=quadratic spline,color=red] coordinates { (-0.6,10) (0.7,-5) (0,0) }; \addplot[color=red,domain=0.7:1, samples=50] plot ({\x}, {-31.18 + 102.22*\x - 128.88*\x*\x + 51.84*\x*\x*\x}); \addplot[color=red,domain=1:3, samples=50] plot ({\x}, {-6 + (\x-1)*(\x-1)}); \addplot[color=red,domain=0.25:5, samples=50] plot ({\x+2.75}, {-1/(2*\x)}); \draw (axis cs: -3,0) -- (axis cs:5,0); \draw (axis cs: 0,-10) -- (axis cs:0,10); \draw (axis cs: 3,-2) node[circle,fill,label=below right:IP, inner sep=1pt] {}; \draw (axis cs: 1,-6) node[circle,fill,label=below right:min, inner sep=1pt] {}; \end{axis} \end{tikzpicture} \end{document} Solution 3: “Make the user do it” Since LPEG practically forced the choice of Lua, it's extremely easy to expose configuration to the user. That makes it extremely easy to allow the user to fill in the control sequences that should adjust indentation, the control sequences that should always start their own lines, and so on. Conclusion: This is actually the fourth tool I've written to do this task. The other three became untenable for performance issues (a Perl version with an unfortunate choice of grammar library), naivety in design (a Flex version with some fundamental misconceptions about LaTeX structure), or overambition (an attempt to implement the TeX parsing algorithm from scratch). [0] git://repo.or.cz/llf.git [1] http://tug.org/TUGboat/tb11-3/tb29greene.pdf