Digital History Project: Update
Another project update. Things have been slightly delayed because I have an article to rewrite (which means I’m slightly closer to getting published) but I’ve still been making some progress. This weekend I’ll be proofreading Sandall’s book. When that’s done I’ll be able to export the text and start tagging it with XML. But first I’ve been looking through the TEI guidelines, picking out the tags I think I’ll need, and working out how I think I’m going to use them. This is crucial because there are often different ways to mark up the same text and it’s important to be consistent. It’s also important to only apply tags which will actually be useful to users, because there’s an awful lot of potential to waste time marking up text in microscopic detail that no-one has any use for. As I do the proofreading I’ll also be looking at the structure of the text and the features in it that will need marking up, and revising the provisional tagging guidelines if necessary. Once I’m happy with the tag set and the guidelines for using them I’ll post it all (but be warned: it won’t be very interesting!). Even then I’m expecting to find some unexpected situations once I start trying to insert the tags.

Comment by Ben Brumfield — 8:30 pm, 7 March 2007 [permanent link to this comment]
What’s been your impression of the TEI stuff? I’ve been fairly impressed, but further investigation has shown me that they just don’t seem to apply to the collaborative manuscript transcription software I’m working on.
Comment by Gavin Robinson — 10:03 am, 8 March 2007 [permanent link to this comment]
TEI is very much aimed at printed books, and the tags for manuscript primary sources seem to be mostly for manuscript drafts of literary works. I think it would need a lot of extending to be able to cope with the whole variety of historical manuscripts out there. Even then, the basic assumptions about the structure of a text might not be appropriate for every kind of manuscript. I can see stand-alone documents consisting of a single folio (eg letters, pay warrants, bills, receipts, petitions) being quite problematic. If each one is treated as a text in its own right it needs a TEI header with all the necessary information which leads to a lot of unnecessary duplication. You could put related documents into collections within a single file with one header. I’ll probably do that with my great-grandad’s letters as there aren’t very many of them, but it becomes a problem when you’re dealing with a large body of material. For example the class SP28 (aka Commonwealth Exchequer Papers) at the UK National Archives, which was the main source for my PhD, probably contains around 400,000 folios, at least 25% of them single folio documents (although many were bound up into volumes in the 20th century). I think putting the whole class into one TEI document would be impractical so it would have to be split up into volumes. That’s disappointing because one of the attractions of digitization is that it promises freedom from having to arbitrarily collect documents into volumes.
Comment by Ben Brumfield — 7:18 pm, 19 March 2007 [permanent link to this comment]
What’s your opinion about automating the creation of the TEI header for small works that are related? That way you could keep each folio separate as its own work, but only specify the particular differences between one delta and another.
In other words, while I understand the problems of duplicating the header entry information when you’re creating a document, are there really problems with the file-per-letter approach from the perspective of the reader manipulating the transcribed documents?
Comment by Gavin Robinson — 12:14 pm, 20 March 2007 [permanent link to this comment]
It’s easy enough to automatically copy the info into the header. I’m just thinking that it would be a waste of space, especially with the kind of detail you need for manuscripts, eg explaining all the transcription and encoding decisions. With something like a pay warrant, the header would be much bigger than the text, and multiplied by thousands of them could amount to a lot of extra storage space. It might be more convenient to have an external header which could be shared by many texts, kind of like a CSS style sheet, but I don’t think you can do that with the present TEI, and it might create problems of its own.
Comment by Ben Brumfield — 1:58 pm, 22 March 2007 [permanent link to this comment]
Perhaps it’s because of my background — I’m a computer programmer and not a researcher — but I really don’t think that optimizing for filesize is the right thing to do here. Storage is cheap, but the labor of disassembling bad design isn’t.
Would you mind if I sent you email with some of my thoughts and questions about TEI?
Comment by Gavin Robinson — 2:16 pm, 22 March 2007 [permanent link to this comment]
No, go ahead.
All of this is speculation because I haven’t actually tried using TEI on manuscripts yet. I’m just trying to imagine what some of the drawbacks might be. It seems to work really well for books so far.
Comment by Ben Brumfield — 8:46 pm, 27 March 2007 [permanent link to this comment]
Thanks — I’ve sent you a note via your webform.