Unexpected Progress
It’s been a long time since I wrote anything about my First World War digitization projects, but I now have some progress to report: today I published an interim version of Sandall’s History of 5th Lincolnshire Regiment. It’s still a work in progress, and there’s a lot more to be done, but you can see it here. It’s just a plain HTML version (and not strictly valid HTML), and the whole text is on one page (at least it makes it easy to search the whole text with your browser’s Find feature!), there’s no name linkage yet, no page images online, and no mechanism for submitting corrections. However, even in this form it should be useful to people who are researching the battalion and can’t get hold of the original book. More details on what I’ve done and how I’ve done it below.
That I haven’t posted about this project since February might suggest that it’s been slow and difficult, but actually I was just busy with other things (writing articles, applying for jobs, organising the Military History Carnival). Until then, things were going surprisingly smoothly and quickly. I was hoping that I’d be able to get everything finished before my Oxygen free trial finished, but because I was doing other things it’s now expired, and although a licence doesn’t cost much, I need the money for other things (especially work on peer reviewed things which might get published in a “proper” journal and look better on my CV than a self-published digital edition). Fortunately the last thing I did on the project in February was try an XSLT transform to make a test HTML version of the text. Initially this just showed me that I needed to play around with the transform parameters to get what I wanted, but I’ve now decided that it’s good enough to form the basis of an interim edition. (In theory I could use the free version of Saxon to do more transforms, but it’s a scary command line application!)
The HTML file that came out of the transform was over 900K but I cleaned it up using Find/Replace and regular expressions in jEdit, getting it down to 385K. That might still take a while to download if you’re on dial-up but it’s not too bad for a whole book. Every paragraph element had been given a name attribute, which isn’t necessary for this version as only chapters are linked to from the contents, so I stripped them all out. There were also some xmlns attributes which didn’t appear to be serving any purpose and must have added a huge amount to the file size, and a huge amount of superfluous white space.
As well as getting the file size down, I needed to make some other adjustments. Some of these were down to the transform settings that I’d used in Oxygen for the test run, but others showed up some possible limitations of TEI/XML. The master XML document preserves the original line breaks and hyphenation. I also left these in during the test transform but in practice an HTML version without line breaks is more useful so I used jEdit to take out all the <br> elements. Here I encountered a potential problem. Following the advice of the TEI guidelines I used the entity to represent soft hyphens. In practice this isn’t very helpful. If the HTML version is to keep the original line breaks then the soft hyphens need to be converted to hard hyphens so that they display properly, but if the original line breaks aren’t kept then the hyphens need to be removed along with the breaks. I’m not sure if XSLT can actually do this. My understanding is that it only deals with XML tags, but I could be wrong there. If it can’t do anything with entity references then there would be a need for some extra finding and replacing, but there might be anyway depending on how good I can get the output of the XSLT. Maybe it would be better to have an XML element which represents a soft hyphen, or add an attribute to the <lb> elements to indicate that they’re preceeded by a soft hyphen. If the line breaks and hypens could be dealt with properly by XSLT then it would make sense to not have a space before any <lb> elements in the source XML.
I found some erroneous paragraph breaks in the middle of sentences. These turned out to be mistakes in the master XML file. I’m not sure how they came about, but they’re easy to track down and fix with a regexp (just find any <p> element immediately preceeded by a comma or a letter). The XSLT had used <em> tags to highlightwords and phrases marked up with <term> or <distinct> but I changed these to double quotes to match the text. In future (if I ever digitize another book) I’m intending to leave in the original quotes rather than marking up their meaning.
It was quite ironic to discover that the XSLT had automatically inserted a copyright statement at the end of the text! (Copyright is more than just a law: it’s an insidious ideology) I corrected it manually to make it clear that the work is in the public domain. Other manual adjustments included moving the contents to before the preface (more convenient for readers) and adding some CSS to limit the width of the text divs.
So now there’s a basic version of the text online in a form which should be of some use to at least some people. It’s certainly an improvement on taking requests on the Great War Forum and posting the excerpts that people ask for. There’s still a lot to be done but it’s good to be making some tangible progress again.

Comment by Ben Brumfield — 9:22 pm, 11 July 2007 [permanent link to this comment]
Congratulations! The resulting document looks great — I found myself engaged by Private Gross and the Stokes mortar.
I wonder if you might post the XSLT you used? Other people converting TEI to HTML might either be able to learn from it or suggest ways to incorporate some of your size reduction code into it.
Regarding your line-break issue, I’ve run into something similar. In my intermediate XML format, I use TEI’s
lbtags to represent line breaks in the original document. At render time, I have the choice of converting thoselbtags to eitherbrtags or nothing, depending on user setting or view. (For per-page rendering I preserve line breaks, but for bloggy-style multi-page rendering I’m wrap lines.So this intermediate code
<?xml version='1.0' encoding='ISO-8859-15'?><page>
<p>A <link link_id='188' target_id='26' target_title='clear'>clear</link> day & <link link_id='189' target_id='19' target_title='cold'>colde</link>
<lb/><link link_id='190' target_id='16' target_title='Benjamin Franklin Brumfield, Sr.'>Ben</link> & <link link_id='191' target_id='9' target_title='Sally Joseph Carr Brumfield'>Josie</link> and Hellen & Virginia
<lb/>come to the House Half passed
<lb/>two oclock this morning
Is rendered (link goes to dev site and may be problematic) as
A clear day & colde
Ben & Josie and Hellen & Virginia
come to the House Half passed
two oclock this morning
Comment by Ben Brumfield — 9:25 pm, 11 July 2007 [permanent link to this comment]
One other line-break suggestion — would you consider replacing the LB tags with a newline when you’re switching to word-wrap mode? The HTML should render the same, but anyone who does a “view source” on the document will find a much more readable text.
Comment by Ben Brumfield — 9:28 pm, 11 July 2007 [permanent link to this comment]
Another suggestion — could you add page break anchors to the document, then hotlink the pagenumbers in the index? I found myself wanting to click the numbers to find out where people were mentioned. Your eventual markup-based indexing may supplant the index within the document, but it seems like stripping out the page numbers loses information.
Comment by Gavin Robinson — 9:10 am, 12 July 2007 [permanent link to this comment]
I am intending to include page numbers and links from the original index but so far I’ve found it’s more difficult than I thought. I’ve kept pb elements in the XML source to mark the original pages, and giving them number attributes should be easy, but I haven’t worked out how to do it automatically. Oxygen doesn’t seem to have a tool for doing it. I thought I might be able to write a beanshell macro in jEdit but couldn’t get it to work as the replace function seems to screw up if you call it inside a loop. And that was before I realised that I haven’t used pb elements at the start of chapters. Now I have to decide whether to put pb elements at the start of chapters, or give the chapter div elements an id corresponding to the page number. I also need to include page numbers to make it easier to link to page images. I’m going to start learning Python soon anyway because I need it for other things, so that might offer a way of doing it. I also wonder if there’s any way of doing it with XSLT but I don’t know enough about that yet.
I just used a default TEI XSLT stylesheet which came bundled with Oxygen. I haven’t looked at it, but from this experience I’ll probably need to change a few things in it (and even more when I get to linking names together and linking to page images). Oxygen also allows you to set some parameters before you do a transform so I need to do some more experiments with those.
Comment by Ross Mahoney — 11:35 am, 12 July 2007 [permanent link to this comment]
Looks good. You might want to contact John Bourne at the Centre for First World War Studies as he is running an eBooks project. You can see the works he is looking at bringing back at http://www.firstworldwar.bham.ac.uk/ebooks/index.htm
Comment by Gavin Robinson — 12:24 pm, 12 July 2007 [permanent link to this comment]
Thanks, I didn’t know about that but it looks really interesting. I’m hoping that when I’ve got this book finished I can work on ways of scaling things up, involving more people, and digitizing more books. I’m also wondering whether I could get a grant from anywhere to help with the costs.
Pingback by Brave new world « Mercurius Politicus — 6:50 pm, 18 July 2007 [permanent link to this comment]
[...] post was partly inspired by Investigations of a Dog, where some really interesting thoughts on the internet and history have been posted. Thank you [...]
Comment by mercuriuspoliticus — 7:06 pm, 18 July 2007 [permanent link to this comment]
This post and its predecessors were absolutely fascinating. It’s great to be able to read both about the theoretical and practical issues of transferring sources online in this way. As someone who is just finding his way into learning HTML and figuring out other possibilities the internet has to offer, it was really inspiring to learn what you’re up to. I’ll definitely be following the project with interest.
Comment by Gavin Robinson — 5:45 pm, 19 July 2007 [permanent link to this comment]
Thanks for your encouragement. There are so many exciting possibilities that one of the major problems is deciding where to start and finding time to do it.
Pingback by Investigations of a Dog » TEI Update — 11:29 am, 2 January 2008 [permanent link to this comment]
[...] first attempt at transforming the text to HTML using the TEI style sheets wasn’t very promising. I thought [...]