EPUB sucks S. Gilles 2017-07-01 I have a large pile of epub files sitting around. After encountering a few rendering glitches on my physical reader (as well as some typos), I decided to recompile a few of them to perform some minor edits, and maybe add some covers to the Project Gutenberg ones while I was at it. I thought that since EPUB was built from well-known, widely used technology like ZIP and XHTML, it would proabably be usable. As it turns out, EPUB is a terrible, terrible format. In no particular order: o Conceptually, EPUB is “just a bunch of (X)HTML files in a ZIP, with some metadata”. But there are absurdly strict restrictions on that ZIP. The “mimetype” file has to come first in the ordering, it must not be compressed, and it must have no metadata. For the rest of the file, directory entries must be supressed. This is so that libmagic(5) style detection works easier, and is how we get the magic zip -X0 foo.epub mimetype zip -Xur9D foo.epub * sequence of commands. However, the structure of ZIP files already provides a trivial solution: arbitrary archive comments can be added at a predicatable location (the end) to a ZIP file. These have been abused to create hosts of polyglot files already. It's equally as easy to check for an uncompressed mimetypeapplication/epub+zip sequence as the beginning of the file as it is to ensure that the file ends with the bytes ^@this_is_an_epub_file_okay? so the EPUB spec could simply have mandated a particular comment and stayed out of micromanaging ZIP arguments. The EPUB choice sucks because it makes recreating the epub a nightmare: generic tools like archivemount can't be trusted to faithfully preserve an EPUB file while operating on it. o Absurd redundancy of structure. The file I'm looking at now has: - a META-INF/container.xml file, which contains a hierarchy of files, but the hierarchy is trivial and just points to content.opf. - inside content.opf, there is a object, containing a list of all content-bearing files in the EPUB - below that, there is a object which is just a list of ids previously established by the manifest. - right below that is a object, which doesn't use the ids from manifest. - a toc.ncx file with exactly one entry, - and that entry is a link to a table of contents created in HTML as part of the book, bypassing the whole point of toc.ncx. The lists all objects in the file, which is rather useless even in a streaming situation. The designates an ordering of a subset of the , and the applies special types to elements of . These could easily be compressed into one object. That object could be the toc.ncx file. EPUB 3 makes this even worse, I hear. o Poorly constructed website. Searching for epub specifications brings up , but since EPUB 3 exists, I tried changing that URL to . But that URL gives a 301, pointing to itself for an endless loop. I'm not even sure how someone managed to do that. (The actual specification for 3.0.1 is at .) o Investigating that specification yields the following example for the title of LOTR. Instead of something like The EPUB spec has opted for the following: The Fellowship of the Ring main The Lord of the Rings collection THE LORD OF THE RINGS, Part One: The Fellowship of the Ring expanded … Those jury-rigged id= and refines= tags are nothing short of insane genius: it takes skill to start with XML and build a specification in which it is possible to write a property that modifies the wrong object, or no object at all. o Specifying a cover for an music album, to a music player, is pretty simple. You put a file called cover.jpg in the directory, or perhaps folder.jpg if you belong to that camp. For an EPUB document, the fastest way I can figure out is the following: Put cover.jpg in the root, add it to the of content.opf, add something like to the of content.opf, making sure that the content tag is the same as the id from the manifest. When that doesn't work, make a cover.xhtml file with appropriate css to reference cover.jpg as an element (make sure to see whether ‘width=100%’ or ‘height=100%’ is appropriate, since there's no way to easily scale-to-fit preserving aspect-ratios in lowest-common-denominator HTML+CSS), add THAT to the manifest, then add it to the element of content.opf via something like Just to save you some time, you can't put in the guide directly: each entry must be an “OPS Content Document”, which is their name for “An XHTML document that conforms to our DTD”. This format was not meant for humans to work with, it was meant for companies to charge other companies to churn out.