TEI Update
Nearly a year ago I started a project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War, and in summer I published an interim version. I would’ve finished it a long time ago if I didn’t have anything else to do, but original work in peer reviewed printed journals has to come first because it’ll look better on my CV. Now I’ve got time to do some more work on it, and having had a break from it I can reassess what I’m trying to do. Below is an update on what’s new.
First of all I discovered from searching the Times Online archive that Major Teall died in 1939, so I’ll be able to publish his epilogue to the book when it comes into the public domain in 2010. I’m still not sure about the photos. I’ve fought my way through the 1988 Copyright Act, which seems to say that copyright in any photos published before 1988 or taken before 1957 exists until it would have expired under the 1956 Copyright Act, which is 50 years after publication for published works, or 50 years after the death of the author for unpublished works. If I’ve understood correctly that would put the photos from this book (published in 1922) in the public domain. I’m still going to leave them out for now because it’s an extra complication, and because they didn’t scan very well.
Since I last did some work on this project, the new version of the TEI guidelines (P5) has been released. Not everyone will need to upgrade from P4 to P5, but I want to use some of the new features. There are now elements and attributes designed for linking to page images, which will be very useful. The upgrade was quite easy as most existing elements haven’t changed much. P5 is based on a schema rather than a DTD. oXygen 9.1 (which I’ve just bought a licence for - only $48 for the academic version, and well worth it) comes with TEI P5 schemas, style sheets, and templates pre-loaded. Because I need to use the modules for names and dates, and digital facsimiles, I need to generate my own schema, but this is easy using Roma, a user-friendly online tool.
As well as making the move to P5, I decided to make some other changes to the XML before moving on to the next stage. I removed the original line breaks, which were marked up with <lb> elements, and soft hyphens, which were represented with entity references. This means I’ve lost something from the original text which won’t be easy to get back, but I don’t really have any use for these line breaks and they were causing more trouble than they were worth. Rather than define a new element to represent soft hyphens just in case someone somewhere might want to see the text with original line breaks, I decided to get rid of the problem. Anyone who does want to see the original layout will be able to look at the page images anyway. Removing these elements was easy using Find and Replace with regular expressions in jEdit.
I also decided to get rid of the various elements which were used to represent double quotes. These were <term>, <distinct>, <foreign>, and <soCalled>. I already knew that I wasn’t going to be using these elements for anything. There are so many possibilities in TEI that it’s easy to get carried away and start marking up features just because you can. Now I’ve decided to only use markup which is required by the TEI guidelines, or which has a specific purpose for the resource I’m trying to create.
My first attempt at transforming the text to HTML using the TEI style sheets wasn’t very promising. I thought then that I just needed to play around with the parameters to make things better, but now, having played around with the parameters quite a lot, I don’t think it’s going to be much use. I was always going to have to write some custom XSL to deal with people, places, and dates, but it looks like the TEI XSL won’t even give me a basic HTML version split into separate files for each chapter. Although it supposedly does this, it looks like it’s based on the assumption that internal links will use query strings, whereas I just want plain old relative links to plain old HTML files at this stage. It also puts a lot of junk into the HTML whatever parameters you set. For example, if you want an id for each paragraph it doesn’t add an id attribute to the <p> tag, it inserts an empty anchor at the start of every paragraph! And every <a> tag has an XML namespace declaration, even if you’ve asked for HTML 4! So I’ll have to get better at XSL and write my own style sheet from the ground up.
The next step is to generate id and facs attributes based on page numbers for every <pb> element, and put in <ref> tags in the index to point to these page numbers. I also need to decide whether to keep the original table of contents or generate one automatically using XSLT. After that I can either write some XSLT and publish another interim version, or move straight on to marking up people, places, and dates.
