Digital Express

[posted by Gavin Robinson, 8:00 pm, 8 February 2008]

Having decided to leave my 5th Lincolnshire First World War project for a while, I got an offer I couldn’t refuse: someone from the Great War Forum sent me a transcript of the battalion’s medal citations from the regimental archive so that I could publish them on my site and link them in to the index of people that I’d created for the book. The document contains information that can’t be found elsewhere, as although awards of the Military Medal were listed in the London Gazette, full citations were not normally published. There are also three awards not mentioned in Sandall’s list, and citations for 10 people who were recommended for awards but turned down.

I received the list as a Word file with no semantic markup on Wednesday morning, started working on it on Thursday morning, and published it on the web this afternoon. It looks very basic but it’s not bad for two days, and it’s all linked in to the index of people for Sandall’s book. First of all I copied the text into jEdit and used Find and Replace to insert some basic TEI XML markup. Then I pasted it into a new TEI document in oXygen. With the automatic validation it was easy to track down and correct errors in the markup, so by lunch time I had a completely valid TEI file. In the afternoon I spent about 3 or 4 hours on linking records by inserting key attributes into <persName> tags. In most cases I already had the keys that I used for linking names in Sandall, but sometimes I had to change them in the light of new evidence from the citations, such as full names of people who I previously only knew by their initials. This also allowed me to clear up some ambiguities . This morning I finished the linkage by creating new keys for the 13 people not mentioned by Sandall, then got started on writing some XSLT. That was easy as I could copy or adapt a lot of the code from the style sheet for Sandall. As well as generating the HTML version of the citations, this XSLT generates an extra JSON file which is imported into the Sandall index of people to allow linking the citations. Again this only required some minor adjustments to the Exhibit page. After some testing and corrections I had a live site up this afternoon.

This demonstrates the potential value of the techniques I’ve been using for marking up texts, but it also raises some problems for digital history. I decided to trust a transcript from a random person off the internet. I have no way of knowing how accurate the transcript is, or even if the source document really exists! It could be Hugh Trevor Roper and the “Hitler Diaries” all over again. Therefore I’m going to think more carefully before putting myself in this situation again. There’s also a possibility that I’ve miscalculated the copyright situation. Based on internal evidence and comparison with other documents my best guess is that the list was created by the army and is therefore under Crown Copyright (and being unpublished and available for inspection in a public record repository should come under waiver of Crown Copyright), but without seeing the original it’s hard to be sure. I might be wrong, and even if I’m right the holders of the manuscript might not agree. So technology makes some things easier, but there are other problems that it can’t solve.

Sandall: The End of the Beginning

[posted by Gavin Robinson, 4:14 pm, 1 February 2008]

Having made good progress with my project to digitize Sandall’s History of 5th Lincolnshire Regiment in the last month I’m going to leave it for a while. This month I haven’t read any books or articles, haven’t written anything other than blog posts and computer code, and have only occasionally thought about historiography and theory. I kind of like it like that but I have other things to get on with now.

I’ve made some small changes since the last post. Dates now have tool tips, so if you hover over them you can see the full date. The place name index is a bit more user-friendly. I’ve replaced the hash values with query strings in the incoming links so that the Exhibit page filters the list down to the place passed in the query instead of displaying a box with the details. This means that you just have to click on “Map” to go straight to map view with only that place displayed. Once you’re there you can easily take the filter off again to see all the other places. The map view is also zoomed out further by default so that you can see Britain and Egypt. That means that you have to zoom in a long way to get to France and Flanders but I think it’s less confusing than not being able to see Grimsby or Alexandria unless you zoom out.

So the site is now in a satisfactory condition with lots of cool features, and now that I’ve worked out how to do everything I could probably get another book to the same stage within a few weeks. But there are still lots of features that could, and probably should, be added. See below for more details. (more…)

Places

[posted by Gavin Robinson, 12:01 pm, 28 January 2008]

Following on from adding an interactive index of people to my digital edition of Sandall’s history of 5th Lincs, I’ve now added a similar feature for place names. It works in exactly the same way as the person index, but it also has a map view. Again this uses the Exhibit API, which makes it very easy to mash up data with Google Maps without even having to know anything about the Google Maps API. The map view is a bit slower than the normal view, especially if the list isn’t filtered, but that’s an inherent limitation of using maps.

One of the many cool things about the map is that it strikingly illustrates the allied advances in the last months of the First World War. If you go into the map view and click “The Beginning of the Great Advance” on the list of chapters, you’ll see the battalion holding the line in Flanders, then moving behind the lines for rest near Amiens, then moving up to the front line at Saint-Quentin. Then click on each of the following chapters in turn and watch the markers surge forward as 46th Division breaks through the Hindenburg Line and pushes towards Belgium.

Adding the place index was mostly similar to adding the person index: I added a unique id to each<placeName> tag using a Python script, pulled out the place names into an SQLite database, identified/disambiguated them and added a regularized name, then used another Python script to pull the regularized names out of the database and put them into the key attributes in the XML file. Identifying the places was easier than identifying people, and took a couple of days, although there are a few that I couldn’t find. As with people I added some code the the XSLT to generate a JSON file of all the places. Then following the map view tutorial I used the Exhibit API to pull latitude and longitude co-ordinates from Google Maps and put them into another JSON file. This turned out to be a bit unreliable as about 10 per cent of the places had their co-ordinates missing. It seems to be random, as running the script again with the same set of data produced a similar error rate but with different places. I had to take the missing places from the output file, put them into another input file and run the script over them again, which produced a similar 10 per cent error rate, but the remaining few co-ordinates could be put in manually. Once I had a JSON file with all the correct geocodes it was easy to copy code from the tutorial to add a map view to the Exhibit page. In a few cases it turned out that Google had given me the wrong co-ordinates. Mostly this was because there are two or more places with the same name and it had picked the wrong one. I thought I’d put in enough information from my manual searches to disambiguate them but it seems that the results of a Google Map search can be a bit unpredictable, and don’t necessarily give you the full address of a place.

I’ve now done most of what I planned to do in this phase. There are still some features that could be added, especially a feedback mechanism, but I’ll be giving this project a rest soon so I can do some English Civil War work.

Marking Up Names: Part 3

[posted by Gavin Robinson, 12:52 pm, 23 January 2008]

My digital edition of Sandall’s History of 5th Lincolnshire Regiment now has a new improved index of people. This uses the Exhibit API to make an interactive list which can be filtered, sorted, and searched. Exhibit provides features that would normally need a database driven back-end but it’s all done on the client side using Javascript. The two disadvantages of this are that it doesn’t scale up very far, and that it isn’t very Google friendly. In this case there’s no problem because there are only ever going to be 350 records in the list, and there is no unique content on this page – it’s just an index to point users to other pages, which are Google friendly.

I’ve also made every occurrence of a name in the text into a link which points to the index. My worries about illegal characters in id attributes turned out to be unfounded. With Exhibit I can use the standardized names from the TEI @key attribute as hashes to make permalinks to individual records. Clicking on the link takes you to the index and displays a dialog box with all of that persons details, including links back to every mention in the text. The dialog box is also displayed by clicking on a person’s name on the index page. I just need to work out a way to display it without having to reload the page.

Exhibit is really easy to use and makes it possible to add some fairly advanced features with surprisingly little effort. It took some searching, copying examples, trial and error, and asking on the mailing list before I worked out how to do everything, but as the project is documented by a wiki I’ve been able to update it whenever I find out how to do something that isn’t already explained there. The JSON data file for my index page is generated automatically by XSLT which loops through every <persName> and <rs> tag in the TEI document, and pulls out extra details (date of death, links to medal cards and CWGC) from another XML file.

Now that person names are more or less fully implemented, it’s time to move on to place names. These should be easier to disambiguate, and with Exhibit I can do some even cooler things with them, such as generating a Google map.

Marking Up Names: Part 2

[posted by Gavin Robinson, 3:01 pm, 19 January 2008]

My digital edition of Sandall’s History of 1/5th Lincolnshire Regiment now has a new index of people. In my last post I described how names were marked up in the text. This post is about how I linked them together.

(more…)

Marking Up Names: Part 1

[posted by Gavin Robinson, 3:55 pm, 15 January 2008]

On to the next stage of digitizing Sandall’s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations.

(more…)

Sandall Update

[posted by Gavin Robinson, 7:32 pm, 9 January 2008]

I’ve now uploaded a new version of Sandall’s history of 5th Lincs with each chapter on a separate page. I thought splitting the pages and getting the internal links to work would be difficult but it turned out to be easier than I thought, although it involves some quite complicated XPath expressions. I’ve also uploaded the new XSLT to show how I did it. This could probably be better in some ways but for now I’m just pleased that it works. While doing this I decided to change the n attribute of the chapter divs from a number to a slug that could be used to make a Google friendly permalink.

Now I’m waiting for Google to re-index the site so that the custom search actually works. Meanwhile I’ve started tagging people, places, dates and abbreviations. More on that when I’ve finished. I’m also increasingly confident that the photos are in the public domain (see these guidelines, which make things a bit clearer, if they’re right), so they’ll probably be added soon.

More progress with Sandall

[posted by Gavin Robinson, 3:26 pm, 5 January 2008]

My project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War has made very good progress this week. I’ve now uploaded a new HTML version. This features links to page images and a working index: if you click on a page number in the index it takes you to the corresponding part of the text. The whole book is still on one page as I haven’t worked out how to split it yet but it’s an improvement over the previous interim version. Below are more details of what I’ve done and how I’ve done it.

(more…)

TEI Update

[posted by Gavin Robinson, 11:29 am, 2 January 2008]

Nearly a year ago I started a project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War, and in summer I published an interim version. I would’ve finished it a long time ago if I didn’t have anything else to do, but original work in peer reviewed printed journals has to come first because it’ll look better on my CV. Now I’ve got time to do some more work on it, and having had a break from it I can reassess what I’m trying to do. Below is an update on what’s new.

(more…)

Unexpected Progress

[posted by Gavin Robinson, 5:35 pm, 11 July 2007]

It’s been a long time since I wrote anything about my First World War digitization projects, but I now have some progress to report: today I published an interim version of Sandall’s History of 5th Lincolnshire Regiment. It’s still a work in progress, and there’s a lot more to be done, but you can see it here. It’s just a plain HTML version (and not strictly valid HTML), and the whole text is on one page (at least it makes it easy to search the whole text with your browser’s Find feature!), there’s no name linkage yet, no page images online, and no mechanism for submitting corrections. However, even in this form it should be useful to people who are researching the battalion and can’t get hold of the original book. More details on what I’ve done and how I’ve done it below.

(more…)

Older posts