11 February 2025

Reformatting paragraphs in Project Gutenberg ebooks

I would be the last person to criticise the vast repository of public domain literature at Project Gutenberg. However, some titles there are formatted in a way I don’t much like. Paragraphs may either be block formatted (as in this post) or have white space between each pair of indented paragraphs.

The fix is simple and needs only a modicum of computer skills.

There are two basic file formats for ebooks. The industry standard is epub; the Amazon Kindle uses mobi. (The latter is an older format, but readable by all Kindle devices.) An epub is just a zip file containing text and the instructions for displaying it. The innards of an epub are complicated; luckily we do not have to delve into them too deeply.

If you use a Kindle, there is an extra step involved to convert your newly tweaked epub to mobi (more of which later).

Sigil is a free-to-use application for editing epub files. The left-hand column shows all the files contained in the epub. The middle column shows the editable material (whether the text of the book or the instructions for its display). The right-hand column, when preview mode is selected, shows how the text will look on an ereader; it can also display the Table of Contents.

Click to enlarge

The files we are interested in are in the folders ‘Text’ and ‘Styles’. As one would expect, ‘Text’ contains the text of the book. It is formatted as HTML – a file extracted from this folder will display in any browser, but use only that browser’s defaults for text size, heading style, etc.

The instructions for displaying the text in an ereader reside in the ‘Styles’ folder. The file or files there are in CCS (cascading style sheet) format. We need to edit part of a style sheet in order to modify the appearance of the text.

Having downloaded your epub from Project Gutenberg, load it into Sigil. Open the ‘Text’ folder. Click on one or more of the files there to make its content appear in the Preview window on the right. If the formatting of the paragraphs is not to your taste (in this example, there is extraneous space between them), open the ‘Styles’ folder and find the style sheet which defines the properties of paragraphs – Styles/OEBPS/0.cc in this case.

In HTML, paragraphs are enclosed by the <p> tag. The statement we are looking for is this one:

It causes the text to be indented by 1 em. The ‘em’ is a printer’s measure adopted for CSS; you may find ‘px’ (pixels) used instead. Here the top (‘margin-top’) and bottom of the paragraph are respectively to have an inserted space of 0.25 em. By changing these values to 0em, the extraneous white space between paragraphs will disappear.

The margin property is explained here; you may find ‘margin-top’, ‘margin-right’, ‘margin-bottom’ and ‘margin-left’ abbreviated, as the article explains.

Kindle users will need to convert the epub file to the mobi format. This is easily done with Kindle Previewer (Windows and Mac) or calibre (Windows, Mac, Linux).

No comments: