Parallelizing EPUB polishing

With Ubuntu 12.04 came a useable version of Calibre (0.8.38). It included the plugin that I wrote for extracting text from OCR-ed DJVU files, so there was no real need to run Calibre from source anymore.

I was in general very happy using Calibre from the .deb installation, although I did use my own scripts for copying the contents for the three different ebook readers my children, my girlfriend and I use. Those scripts e.g. take care of transforming the author name so that “Anthon van der Neut” comes under the N in the ebook-reader index, as it is supposed to be.

Not everyone has the habit to store the author name as it is supposed to (“Firstname Lastname”) and convert on the fly for the bookreader. Nor do many bookreaders seem to understand the opf:file-as="Lastname, Firstname" for the creator entry. Therefore many EPUB files for free books have the author name in the incorrect “Lastname, Firstname” format. This is of course not the author’s name, but a mangling of it to allow proper sorting on stupid ebook devices.

BibTex did a better job at sorting names back in 1985...

Unfortunately these EPUB files with mangled names also end up in repositories of free books and eventually in my Calibre database.

The problem with Calibre’s metadata editing

One thing annoyed me the last couple of months, and that was that when you edit the metadata of an EPUB file that these changes don’t end up in the EPUB file. That is, unless you convert from EPUB to EPUB which is a timeconsuming process.

So my corrected or added author and title information did end up in the metadata.opf files but most of the time not in the corresponding .epub.

As I had done parsing and updating of these (XML) files in the past I had an item on my ToDo list for some time to look into updating the EPUB file form the metadata.opf.

So when I finally got annoyed enough to try and tackle this item, I looked first to update Calibre, as there might have been file format changes in the mean time, that could invalidate anything I was planning to do (been there, done that).

Installing Calibre 1.2

The new Calibre gets installed in a nice way, you copy a (long) commandline that first downloads a small python program and executes it. This program then determines which version you need (32bit or 64bit) and downloads that and does the install.

But the old version (which I had de-installed using apt-get remove calibre) kept being selected when running calibre. I can only assume this was because calibre was still running in the background, switching to the foreground even when running the newer version, just installed.

Parallel to this I had been reading through the Changelog and came once more across ebook-polish. I remember that the first time that I read that I thought: “Calibre’s author (Kovid Goyal) doesn’t have a name that sounds Polish. Why a special polish version of the software?”

ebook-polish

Closer reading explained that this is not about a Polish version of anything, but about polishing ebooks. One of the options to that program is:

-o OPF, --opf=OPF     Path to an OPF file. The metadata in the book is
                      updated from the OPF file.

Bingo, exactly what I needed. But wat looked even better, was that, after finally having the 1.2 version in the lower left corner, I noticed that you can add the “Polish books” option to the toolbar.

Polishing gone wrong

For extracting just the reviewed EPUB books (and not the broken/ugly ones, nor the PDFs and other formats) the scripts mentioned before use the custom metadata tag EPUB.

After installing the polishing option in the toolbar (it is not in the menu and I have not looked if you can add it there as well), I selected the ebooks by this tag and hit the polish button.

The scheduling takes a bit of time, but the conversion itself seemed to be fast. But a 1300 books or so take a while so I switched desktops and continued working. An hour later I found that Calibre had stopped working because a message queue had overrun...

Running things in parallel

The backgrounding mechanism of Calibre did not seem up to the task of doing this. So I got back to my original idea of doing this myself, now with the help of ebook-polish to do the hard work.

One of the relatively easy optimisations I had thought about was only doing the polishing when the metadata.opf is newer than the .epub file. Another was using gnu parallel to make better use of the 8 kernel processor in my desktop machine.

You can do some smart replacing on the input lines to paralled, and I probably could have used that to create the filename to be passed to the -o option of ebook-polish using something like:

find EPUBs_to_be_updated | parallel ebook-polish -o {//}/metadata.opf {}

which would probably need proper quoting etc.

However generating the list of epubs to be updated required some more intelligence in comparing file timestamps and I was going to write a Python program for that anyway. I might as well use that script itself (with a commandline option to distinguish it from the ‘normal’ invocation) to do the approriate call to ebook-polish.

The result is a relatively small program ebook_parallel_polish, that creates the list of outdated EPUB files, pipes that list into:

parallel ebook_parallel_polish -u {}

(actually the program is filled in from sys.argv[0]) and with the -u option the program expects one filename as parameter, determines the full path of the metadata file and hands both to ebook-polish.

The program gets the Library base path from Calibre’s configuration file, has an -f option to do all files (and not just the outdated ones.

The result of half an hour of trying, hacking and testing is that 1329 .epub files got converted in 3 minutes and 25 seconds, and an update of 44 files after correcting some more authors less than 7 seconds real time.

I now run the update before copying any (new) files to the ebook reader, so they are now always as up to date as the information in Calibre itself.

Posted on 2013-09-10.