Unexpected Progress

[posted by Gavin Robinson, 5:35 pm, 11 July 2007]

It’s been a long time since I wrote anything about my First World War digitization projects, but I now have some progress to report: today I published an interim version of Sandall’s History of 5th Lincolnshire Regiment. It’s still a work in progress, and there’s a lot more to be done, but you can see it here. It’s just a plain HTML version (and not strictly valid HTML), and the whole text is on one page (at least it makes it easy to search the whole text with your browser’s Find feature!), there’s no name linkage yet, no page images online, and no mechanism for submitting corrections. However, even in this form it should be useful to people who are researching the battalion and can’t get hold of the original book. More details on what I’ve done and how I’ve done it below.

That I haven’t posted about this project since February might suggest that it’s been slow and difficult, but actually I was just busy with other things (writing articles, applying for jobs, organising the Military History Carnival). Until then, things were going surprisingly smoothly and quickly. I was hoping that I’d be able to get everything finished before my Oxygen free trial finished, but because I was doing other things it’s now expired, and although a licence doesn’t cost much, I need the money for other things (especially work on peer reviewed things which might get published in a “proper” journal and look better on my CV than a self-published digital edition). Fortunately the last thing I did on the project in February was try an XSLT transform to make a test HTML version of the text. Initially this just showed me that I needed to play around with the transform parameters to get what I wanted, but I’ve now decided that it’s good enough to form the basis of an interim edition. (In theory I could use the free version of Saxon to do more transforms, but it’s a scary command line application!)

The HTML file that came out of the transform was over 900K but I cleaned it up using Find/Replace and regular expressions in jEdit, getting it down to 385K. That might still take a while to download if you’re on dial-up but it’s not too bad for a whole book. Every paragraph element had been given a name attribute, which isn’t necessary for this version as only chapters are linked to from the contents, so I stripped them all out. There were also some xmlns attributes which didn’t appear to be serving any purpose and must have added a huge amount to the file size, and a huge amount of superfluous white space.

As well as getting the file size down, I needed to make some other adjustments. Some of these were down to the transform settings that I’d used in Oxygen for the test run, but others showed up some possible limitations of TEI/XML. The master XML document preserves the original line breaks and hyphenation. I also left these in during the test transform but in practice an HTML version without line breaks is more useful so I used jEdit to take out all the <br> elements. Here I encountered a potential problem. Following the advice of the TEI guidelines I used the ­ entity ­ to represent soft hyphens. In practice this isn’t very helpful. If the HTML version is to keep the original line breaks then the soft hyphens need to be converted to hard hyphens so that they display properly, but if the original line breaks aren’t kept then the hyphens need to be removed along with the breaks. I’m not sure if XSLT can actually do this. My understanding is that it only deals with XML tags, but I could be wrong there. If it can’t do anything with entity references then there would be a need for some extra finding and replacing, but there might be anyway depending on how good I can get the output of the XSLT. Maybe it would be better to have an XML element which represents a soft hyphen, or add an attribute to the <lb> elements to indicate that they’re preceeded by a soft hyphen. If the line breaks and hypens could be dealt with properly by XSLT then it would make sense to not have a space before any <lb> elements in the source XML.

I found some erroneous paragraph breaks in the middle of sentences. These turned out to be mistakes in the master XML file. I’m not sure how they came about, but they’re easy to track down and fix with a regexp (just find any <p> element immediately preceeded by a comma or a letter). The XSLT had used <em> tags to highlightwords and phrases marked up with <term> or <distinct> but I changed these to double quotes to match the text. In future (if I ever digitize another book) I’m intending to leave in the original quotes rather than marking up their meaning.

It was quite ironic to discover that the XSLT had automatically inserted a copyright statement at the end of the text! (Copyright is more than just a law: it’s an insidious ideology) I corrected it manually to make it clear that the work is in the public domain. Other manual adjustments included moving the contents to before the preface (more convenient for readers) and adding some CSS to limit the width of the text divs.

So now there’s a basic version of the text online in a form which should be of some use to at least some people. It’s certainly an improvement on taking requests on the Great War Forum and posting the excerpts that people ask for. There’s still a lot to be done but it’s good to be making some tangible progress again.

10 Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.