Marking Up Names: Part 2

[posted by Gavin Robinson, 3:01 pm, 19 January 2008]

My digital edition of Sandall’s History of 1/5th Lincolnshire Regiment now has a new index of people. In my last post I described how names were marked up in the text. This post is about how I linked them together.

I’d already used a Python script to generate a unique id for every <persName> and <rs> tag, and to build a key using components of the name. Next I wrote another Python script to pull all the names out of the XML document and put them into a SQLite database so that I could standardize the key values. The SQLite clients I have are good for running queries but not so good for manually editing data, so I exported the table to CSV and opened it in Excel. That was possibly a mistake: when I came to feed the data back into the XML I ran into some nasty character encoding problems, but it was partly my own fault for letting some unnecessary non-ASCII characters creep in through copying and pasting text from web pages. Apart from that, editing the data in a spreadsheet was easy and convenient.

I decided to standardize the keys in the following form: surname, full forenames (or initials where names not known), service number (Other Ranks only; post-1917 6 digit number preferred as these were unique within regiment), rank (officers only; highest rank known to have been held during the First World War), years of birth and death (if known). Officers commissioned from the ranks can have a rank and number. Here are a few examples:

Sandall, Thomas Edward, Lt-Col (d 1930)
Stuart-Wortley, Edward James Montagu, Maj-Gen (1857-1934)
Pickard, Herbert, 240010 (d 1917)
Leadbeater, Conrad, 240252 Capt
Howard, William, 241683

In the first pass I just wanted to disambiguate names as quickly as possible. In most cases this could be done from internal evidence, particularly the original index. There were some cases where I needed to check online sources, but even so the process only took a few hours and was completed within one day. There were a few names which couldn’t be disambiguated with any certainty so I’ve had to ask the 5th Lincs experts for help.

In the second pass I tried to make the keys as full and detailed as possible by searching online sources for full names, service numbers, and dates of death. This was much more labour intensive and took several days. It isn’t helped by the fact that both Sandall’s book and the online sources contain many errors and inconsistencies, especially with names and service numbers. This led to lots of lateral thinking, deduction, trial and error, and ultimately frustration. There are still some cases where I couldn’t positively identify an individual or couldn’t find any trace of any possible candidates in the available records. Most of these are junior officers who are only mentioned briefly. All this research wasn’t strictly necessary at this stage, but I’m looking ahead to the possibility of using a wiki for biography pages, which will need page names in the above form.

Once the keys were consistent and as detailed as possible I re-imported the CSV into the database and wrote yet another Python script to loop through every name element in the XML, pull out the database record with matching id, and write the new key value from the database to the @key attribute of the XML element. This went smoothly once I’d got rid of all the non-ASCII characters! Then I added some code to the XSLT to generate the index of people by pulling out all the name elements, grouping them by key value, and creating links to each occurrence of that name. This is still looking quite basic, but it works.

The next step is to make each occurrence of a name link back to the index. There’s a complication here because I need the entries in the index to have valid unique ids so that I can link to them, but I also need the full regularized forms of the names to appear in the index list. This places conflicting demands on the @key attribute, because xml:ids can’t contain spaces, brackets, or commas. TEI used to have a @reg attribute for names which could contain a regularized form, but this was removed in P5. The way you’re supposed to do it now is put the regularized form in a <reg> element and surround them with a <choice> element. I could do that and use a modified form for the @key. But I’m also thinking about putting the name data into a JSON file so I can do cool things with Exhibit. That would mean that I wouldn’t have to embed the regularized form of the name in every occurrence – just have an abbreviated form that can be used as a key for name occurrences and as an id for the index entry. It would also allow me to store and use more data than can be accommodated in TEI XML. For example, during the research/disambiguating phase I collected a lot of links to medal cards and CWGC database entries. If these were in a JSON file I could easily link to them from the index page without much extra work. Now I need to work out exactly what I want to do with this and how to structure the data.

No Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.