Sandall: The End of the Beginning

[posted by Gavin Robinson, 4:14 pm, 1 February 2008]

Having made good progress with my project to digitize Sandall’s History of 5th Lincolnshire Regiment in the last month I’m going to leave it for a while. This month I haven’t read any books or articles, haven’t written anything other than blog posts and computer code, and have only occasionally thought about historiography and theory. I kind of like it like that but I have other things to get on with now.

I’ve made some small changes since the last post. Dates now have tool tips, so if you hover over them you can see the full date. The place name index is a bit more user-friendly. I’ve replaced the hash values with query strings in the incoming links so that the Exhibit page filters the list down to the place passed in the query instead of displaying a box with the details. This means that you just have to click on “Map” to go straight to map view with only that place displayed. Once you’re there you can easily take the filter off again to see all the other places. The map view is also zoomed out further by default so that you can see Britain and Egypt. That means that you have to zoom in a long way to get to France and Flanders but I think it’s less confusing than not being able to see Grimsby or Alexandria unless you zoom out.

So the site is now in a satisfactory condition with lots of cool features, and now that I’ve worked out how to do everything I could probably get another book to the same stage within a few weeks. But there are still lots of features that could, and probably should, be added. See below for more details.

Text Snippets

Ben Brumfield pointed out that the lists of references to a person or place in the index would be much better if they included a snippet of surrounding text from the target page (like Google does) to provide some context. I definitely agree with this, but I haven’t worked out how to do it yet. Ideally I’d like to do it in the XSLT so that I don’t have to introduce an extra step in the process, but I haven’t yet looked into whether that’s possible, or whether it’s unnecessarily difficult compared to running a Python script over the HTML output files.

Feedback Mechanism

This is one of the most important features, and something that I’ve wanted for a long time. The experience of digitizing a text has just reinforced my view that it’s futile to try to eliminate all scanning errors before publication. The perfect is the enemy of the good. In this case the output from FineReader was good enough. I spotted a few more errors during markup, but not many. I really think the way to go is to use the OCRd text with whatever automated checks you have time for and not worry about whether the text is perfect. The chances are that end users will spot the scannos in the course of reading. If you give them an easy to use feedback mechanism that lets them report errors with the minimum trouble it’s good for everyone.

The feedback mechanism that I originally had in mind was like a cross between Distributed Proofreaders and a wiki. The correction page would present users with a page image and a text box containing the page text, which they could edit and submit. Once approved by a moderator the changes would go live. Now I’m not sure that this is exactly what I need. Although the text almost certainly contains scannos that I’ve missed, there might not be enough to justify developing such a system. The major technical challenge would be feeding the changes back into the TEI XML file so that the website and the original file stay in sync. If I could accept branching between the XML and HTML versions it wouldn’t be too difficult to adapt a wiki to do something like this. However, this kind of feedback is only good for simple errors like scannos. I don’t think it could help people to point out possible errors in record linkage and name identification, where it’s easier to tell than to show what’s wrong (unless the users know TEI quite well).

I was also thinking about using CommentPress as a feedback mechanism. I was really excited about this when I first heard about it but it’s given me so much trouble that I’ve lost all confidence in it. Is it a good idea badly implemented? Or is the idea fundamentally flawed? Maybe it’ll get better in time, but right now it’s not for me. I have some ideas for a Wordpress hack of my own, but I’m not sure if I’ll ever have time to try it.

Wiki

Whatever I use for the feedback mechanism for the text itself, I want a wiki for biography pages. This would allow for a lot more information than is currently displayed by Exhibit on the index page. On reflection it would have been useful to have a wiki earlier in the process, especially when I was researching and disambiguating people, as I could have put biographical details straight into it as soon as I found them.

Valid XHTML

The main obstacle to this is the problem I found with <a> elements inside <ul> elements but not in <li> elements. This is a case where the HTML specification conflicts with the semantics of the documents I’m trying to represent, and something’s got to give. If I decide that validity is more important then I should be able to adjust the XSLT code to move the <a> inside the preceding <li>. I’ve already had to do a similar thing where a page break occurs inside a name because one <a> inside another doesn’t work.

Icons

It would look nicer, and take up less space, if the “view page image” links were icons instead of text.

Dates

All the dates in the text are already marked up in the XML but I haven’t done very much with them yet. As I’m already using Exhibit it might be possible to make a timeline.

Metadata

Although I’ve made it easy for humans to find people and places mentioned in the text, you’d have to be quite clever with scrapers and regular expressions to get them automatically (or you could just download the XML source). I’d like to follow best practice for providing metadata but I’m not sure what that is yet (is anyone?). A good start might be to embed some microformats in the HTML pages. And as it’s a book I should really put in some COINS data so that Zotero can grab it.

Less Mark-up

I’ve already taken out some of the XML tags that I originally put in but which weren’t really any use. I might take that even further. Considering the number of blatant mistakes I found when marking up and linking people, places, and dates, it seems increasingly futile to try to correct Sandall even for internal consistency. Therefore <sic> and <corr> tags are likely to disappear in future as my editorial policy changes to preserving the original text regardless of how wrong it is. Correct identifications are supplied in name attributes anyway, so there’s no need to correct the content of these tags.

And More Mark-up

So far I’ve marked up dates, people, and places. Right now places are limited to named settlements that can be seen on Google Maps, but this could all change in future. There are many other things mentioned, such as hills, woods, farms, and fortifications, which could potentially be geotagged and displayed on the map. This would need a lot of extra research as many of these things can’t be found using Google alone. I could also mark up organizations such as regiments and battalions, and provide an index similar to the person and place indexes.

Better Name Identification

Talking of extra research, there is still work to be done on identifying and disambiguating people and places. Some of the personal names in the index have question marks after them because they might be the same as a similarly named person in another entry, or they might not. There are also several people and places which don’t appear in the index yet as I haven’t been able to identify them at all. With people these are referring strings rather than actual names (eg “a wounded man”, “his platoon officer”). If I can identify them at all it will only be through checking the battalion war diary. The places are all named, but don’t appear on Google Maps. In many cases I can work out roughly where they should be from the context, but not with enough certainty to display them on the map. Finally there are some very obscure people who are named and can be disambiguated from anyone else in the text but who I can’t positively identify in any other source. Most of them are junior officers who are only mentioned briefly (and often only by surname) as joining the battalion on a certain day but aren’t mentioned again because they never did anything noteworthy like winning a medal or getting killed. I would at least like to know their forenames. Getting to the bottom of this is going to take a lot of digging in officers’ correspondence files at Kew (if they ever finish the building work…).

More Precise Schema

Now that I’m further on with markup I have a better idea of which tags I will and won’t need in future. I should remove all unused tags from the schema as this would make it smaller. It would also make mark-up more efficient as oXygen displays a list of every tag that is allowed at a certain point, which you can double click on to insert that tag around the current selection. This is obviously easier if the list isn’t full of superfluous tags that you have no intention of using.

And then…

I also need to think about how to digitize my great-grandad’s letters, as this was always going to be part of this project. It won’t take very long as there isn’t a large amount of material, and I can re-use a lot of the techniques and code that I’ve developed for Sandall’s book. The main differences are manual typing instead of OCR, and deciding how to represent the structure of the postcards.

Then what? I think I should write a detailed step by step guide to digitization using these techniques to help anyone else who wants to follow in my footsteps. It would also be good to digitize another book in the same way and see how quickly it can be done. I’ve probably spent more time working out how to do things than I have actually doing them. The history of 1/5th Leicestershire Regiment, which was in the same brigade as 1/5th Lincs, is available on Project Gutenberg and would make a nice companion to Sandall. The techniques that I’ve used would also transfer to a battalion war diary quite easily. The main difference is that it’s manuscript, so no OCR, but very long. Therefore it might be useful to introduce an element of collaboration to the text capture.

But now I need to get back to the English Civil War for a while. Even there I should be able to make good use of some of the things I’ve learnt doing this project.

No Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.