More progress with Sandall

[posted by Gavin Robinson, 3:26 pm, 5 January 2008]

My project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War has made very good progress this week. I’ve now uploaded a new HTML version. This features links to page images and a working index: if you click on a page number in the index it takes you to the corresponding part of the text. The whole book is still on one page as I haven’t worked out how to split it yet but it’s an improvement over the previous interim version. Below are more details of what I’ve done and how I’ve done it.

The first part of getting the index to work was adding <ref> tags to the page numbers. This was easy to do using a regular expression in jEdit.

Find: (?<=, )[0-9]+(?=,|<)

Replace with: <ref target=”p$0″>$0</ref>

This was based on the assumption that every page reference in the index would be preceeded by a space and a comma, and followed by a comma or a closing tag (thankfully index terms were usually followed by commas which made things much easier). In practice there were a few page numbers which got missed because of exceptions to this rule. Where the whole index term was in double quotes, the closing quote came after the comma. That’s easy to allow for in future. In one case the comma was missing in the original text, and in another it had been mis-scanned as a full stop. There were a few other inexplicable cases where the regex should have matched but apparently didn’t. I’m not sure what was going on there but it didn’t happen very often.

The regex generated <ref> tags with target attributes pointing to the relevant page numbers. Next I had to give the <pb> elements a corresponding id so that the refs would have something to point to. There was a bit of a dilemma about whether <pb> tags at the start of a chapter should go inside or outside the chapter <div1>. In the end I avoided it by adding the id attribute to the <div1> instead and getting rid of the <pb>. As well as ids for internal linking I added facs attributes to allow linking to page images. This was all done automatically with a Python script which parses an XML document, pulls out all the <pb> elements, loops through them adding consecutively numbered id and facs attributes to each one that doesn’t already have them (I did the front matter manually because it’s very short and has its own sequence of roman numerals rather than page numbers in the main sequence). Then the attributes of the first <pb> in each chapter are copied to the <div1> and the <pb> is deleted. Finally the XML is written back to a file.

from xml.dom import minidom
import codecs

book = minidom.parse('sandall2.xml')
pagebreaks = book.getElementsByTagName('pb')
i = 1
for x in pagebreaks:
    if x.hasAttribute('xml:id') == False and x.hasAttribute('facs') == False:
        x.setAttribute('xml:id', 'p' + unicode(str(i)))
        x.setAttribute('facs', unicode(str(i).zfill(3) + '.png'))
        i += 1

chapters = book.getElementsByTagName('div1')
for y in chapters:
    divbreaks = y.getElementsByTagName('pb')
    y.setAttribute('xml:id', divbreaks[0].attributes['xml:id'].value)
    y.setAttribute('facs', divbreaks[0].attributes['facs'].value)
    y.removeChild(divbreaks[0])

bookfile = codecs.open('sandall3.xml', 'w', 'utf-8')
book.writexml(bookfile)
bookfile.close()

There were two things that caught me out here. First, the built-in open() function only works with ASCII text. If you want to write unicode XML to a file you need to use codecs.open() and specify utf-8 in the parameters. Second, I forgot that I’d omitted a chapter, so the pages in the back matter went out of sync, but that was easy to fix manually as there aren’t very many pages there.

Once I’d remembered that Major Teall’s epilogue was missing I added a <gap> element to represent it and indicate why it’s missing.

Next I wrote some XSL to transform the XML document into HTML, and some CSS to style the HTML. It’s very easy to get HTML that works but getting completely valid HTML is going to be very difficult and might not be worth bothering with. One particular problem is that I’ve transformed TEI <pb> elements into HTML <a> elements which link to corresponding page images. These can occur anywhere in the text, including in the middle of a list. HTML doesn’t like <a> being inside <ul> but not inside <li>. Moving the <a> inside the next <li> would be inconvenient and not true to the structure of the original document. Putting the <a> inside its own <li> would be even worse and would screw up the layout. So I might just have to live with slightly invalid HTML.

I also needed to clean up the HTML output file afterwards. Removing superfluous white space reduced the file size by about 25%! As this is from my own XSLT I can’t blame anyone else this time. I think part of the problem is the way that oXygen prettifies XML by indenting with spaces and inserting line breaks instead of wrapping lines.

The next step is to work out how to get the XSLT to split the output into separate files for each chapter, and how to keep the internal links working. Then I can start marking up people, places, dates, and abbreviations. As well as the HTML version, I’ve uploaded the XML file, XSLT style sheet, CSS style sheet, and XML schema in case anyone is interested. I’ve now deleted some unused elements from the schema.

No Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.