Google Base and Great War Soldiers

[posted by Gavin Robinson, 1:08 pm, 27 December 2007]

I’ve just been looking into Google Base, which lets you upload structured data in XML format and make it searchable on Google (although so far Base pages don’t seem to show up in the standard web search). The data is described using item-types and attributes, and although Google provides recommended types and attributes you can also make up your own, for just about any purpose you want. This kind of semantic markup gives the potential for much more specific and accurate search results than a normal web search.

Now I’m wondering if this could be a possible solution to a big problem that I’ve been thinking about for a while: pulling together a list of all the British soldiers who served in the First World War and everything that’s known about them. This would be in the region of several million names. Many of the details are already available online in various places but they’re not linked together. The CWGC has a more or less complete database of personnel who meet their criteria of having died as a result of the war, although new names are discovered every so often (their own search engine can only search by name, not by regiment or service number). Surviving service records (only about 30 to 40% survived the Blitz!) are being put online by Ancestry, although it’s subscription only and the indexing and transcription are reputed to be really terrible. The UK National Archives has made the medal index cards available online (I’ve seen several transcription errors in the index but it’s apparently not as bad as Ancestry). This collection contains nearly 5.5 million records and should mention every soldier who qualified for a campaign medal by serving overseas (although there are unsubstantiated rumours on the Great War Forum that some cards were lost in transit). The medal cards also include men who were awarded a Silver War Badge for being discharged as unfit for service, even if they hadn’t served overseas. Officers are more problematic because if they survived the war they had to apply for their campaign medals and there are many known examples of officers who didn’t make a claim and so have no medal card. Commissions and gallantry medals are shown in the London Gazette, which is available online, but its search engine is notoriously difficult to use. Then there are various personal websites of people who are researching their families or a particular unit. And there’s the Great War Forum, which contains a huge number of posts on individual soldiers, often pulling together information from many different sources, from the most well known online databases to obscure local newspapers and family collections.

In theory something like Google Base could help to pull all this stuff together and make it easier to find information on specific people. For example, you could create an item type for soldiers and give it attributes like name, rank, regiment, battalion, service number etc. First of all a lot of thought and consultation would need to go into defining the item and attributes to make it as useful as possible to as many people as possible. This is definitely something to think about for the future.

However, there are some limitations which mean it isn’t going to happen soon. The biggest problem I can see is that you have to manually upload the records to your account, and edit them whenever they change. You can use the API to automate this but I think it would be much better if you could just embed Google Base metadata in a webpage and let Google’s spiders pull it out automatically. Another thing is that there doesn’t seem to be any scope for collaboration. Once you’ve uploaded your data no-one else can edit it. This is quite disappointing because sharing is a big part of Google Docs. In my experience many expert Great War researchers do not have advanced IT skills and so we need things to be as simple as possible, and easy ways of helping less IT literate people by being able to edit their stuff directly. The Your Archives wiki has shown that this can work really well: it doesn’t matter if people haven’t formatted their pages properly or don’t know how to insert a link. As long as you put up some relevant information, someone else can sort it out.

But these are changes that Google could make in the future, so it’s something to watch out for. There must be lots of other ways that historians could use Google Base. It’s already good enough for smaller data sets which have already been compiled by one person, so I might be able to put up some of my English Civil War data.

Zotero, XML, Python, and SP28

[posted by Gavin Robinson, 7:43 pm, 20 December 2007]

Since my last post I’ve been doing some more experiments to see how Zotero can be used for cataloguing previously uncatalogued administrative records from the English Civil War. I’ve now put some more of my ideas into practice in demo form and they seem to work. Linking images to Zotero items and adding metadata went very smoothly. The idea of adding extra data by putting XML tags in notes also works, although this is just a stopgap until they implement custom fields. Once you have data in Zotero it’s very easy to export it as XML and do whatever you want with it. More details below, but it gets a bit technical and even includes some sample code (formatting code in WordPress is hard, and it’ll probably screw up the layout for some people). If you’re not A. Nerd and you’re not doing the shopping for your mum you might want to stop reading now.

(more…)

We Are The MODS

[posted by Gavin Robinson, 1:59 pm, 20 July 2007]

I haven’t mentioned Zotero for a long time. I was really excited when I first heard about it, and tentatively started using it last year, but then I accidentally wiped my Firefox profile and lost all the stuff I’d put in. It wasn’t much – mainly books from EEBO and notes for my posts on cavalry charges – but after that I got out of the habit. Now I need to manage the bibliographies for some articles I’m working so I’ve decided to start using Zotero properly. That involves importing over 1,000 records from my old database (which I wrote about here). I decided to use MODS XML as an intermediate format as Zotero can import and export MODS, and it’s also used as an intermediate format by bibutils. So far I’ve written a PHP script to pull records out of the MySQL database and display them as MODS XML. This bit went smoothly but while I was testing it I found what I think is a bug in the Zotero MODS translator (read all about it on the Zotero forum). Until that’s sorted out I can’t do the import unless I want to spend a lot of extra time manually changing creator types from author to editor.

I also need to think about adding new records. I’m trying to get on top of the debates about the causes and outbreak of the English Civil War, something which I’ve previously tried to avoid. Some of the literature is already in my old database from my PhD research but I need to find more. The most obvious place to look is the RHS Bibliography of British and Irish History as this is a more or less complete database of academic works with good search facilities (including subject headings and dates covered). Another potential advantage is the option to select records from the search results and display them as XML. The problem here is that they’ve chosen Adlib XML, which doesn’t seem to be very well supported outside the proprietary Adlib software. There isn’t a Zotero translator for it yet and I’m not really capable of writing one myself – if I couldn’t fix the bug in the MODS then it’s unlikely that I’d be able to adapt it to handle Adlib instead. What I might be able to do is write some XSLT to transform Adlib XML into MODS XML, which I can then import into Zotero. I’m not sure if it’s worth doing this. In practice most records in the RHS database are only a couple of clicks away from a record which Zotero can scrape. All records have a link to COPAC, which is fine for scraping books. Journal articles have a link to GetCopy, which usually leads to a record that can be scraped. Essay collections are a potential problem because the RHS has a separate record for each essay but there are no links to any other pages with these details as COPAC and other sites only list the volumes as a whole. So it’s a choice between entering these manually or getting to grips with XSLT (without the benefit of oXygen).

However I do it, I should have time to write about something interesting once I get it out of the way…

Unexpected Progress

[posted by Gavin Robinson, 5:35 pm, 11 July 2007]

It’s been a long time since I wrote anything about my First World War digitization projects, but I now have some progress to report: today I published an interim version of Sandall’s History of 5th Lincolnshire Regiment. It’s still a work in progress, and there’s a lot more to be done, but you can see it here. It’s just a plain HTML version (and not strictly valid HTML), and the whole text is on one page (at least it makes it easy to search the whole text with your browser’s Find feature!), there’s no name linkage yet, no page images online, and no mechanism for submitting corrections. However, even in this form it should be useful to people who are researching the battalion and can’t get hold of the original book. More details on what I’ve done and how I’ve done it below.

(more…)

XML Tagging: Phase 1

[posted by Gavin Robinson, 1:04 pm, 26 February 2007]

Having proofread and corrected the digital text captured from Sandall’s history of 1/5th Lincolnshire (corrected to an adequate standard anyway — I can’t claim that it’s perfect), I was ready to start inserting XML tags. The first phase of markup involves the use of TEI XML tags to describe the basic structure of the text. There was nothing too difficult here, and a lot of it could be done automatically rather than reading through the text and manually inserting tags at every feature. Before I started I had to decide which tags to use and where to use them, then make sure I applied them consistently. This post gives more details of the tags I used, what I used them for, and how I got them into the text with minimal effort.

(more…)

Tags: , , , , ,

Comments Off

Digital History Project: Update

[posted by Gavin Robinson, 7:24 pm, 15 February 2007]

Another project update. Things have been slightly delayed because I have an article to rewrite (which means I’m slightly closer to getting published) but I’ve still been making some progress. This weekend I’ll be proofreading Sandall’s book. When that’s done I’ll be able to export the text and start tagging it with XML. But first I’ve been looking through the TEI guidelines, picking out the tags I think I’ll need, and working out how I think I’m going to use them. This is crucial because there are often different ways to mark up the same text and it’s important to be consistent. It’s also important to only apply tags which will actually be useful to users, because there’s an awful lot of potential to waste time marking up text in microscopic detail that no-one has any use for. As I do the proofreading I’ll also be looking at the structure of the text and the features in it that will need marking up, and revising the provisional tagging guidelines if necessary. Once I’m happy with the tag set and the guidelines for using them I’ll post it all (but be warned: it won’t be very interesting!). Even then I’m expecting to find some unexpected situations once I start trying to insert the tags.

Text Theories: Meaning

[posted by Gavin Robinson, 4:49 pm, 5 February 2007]

In my previous post about theories of digital text, I used Shannon’s communication theory to divide text into information and meaning, and then talked exclusively about text as information: a sequence of characters selected from a finite set. That allowed me to concentrate on one part of the problem, while excluding the more difficult problems associated with meaning. In this post, I’ll be trying to tackle some of the problems of meaning, while still trying to avoid as many as I can. I will also continue to avoid offering concrete definitions of “text” and “a text”, mainly because I haven’t found any satisfactory definitions yet, but I won’t be able to avoid using the word “text”.

(more…)

Text Theories: Information

[posted by Gavin Robinson, 5:07 pm, 2 February 2007]

As the next stage of my Digital History Projects I’ve been doing background reading and thinking about the theory of text. This week I’ve read Schreibman, Siemens, and Unsworth A Companion To Digital Humanities (2004); Burnard, O’Brien, O’Keeffe, and Unsworth Electronic Textual Editing (2006); Susan Hockey Electronic Texts in the Humanities (2000); and C. E. Shannon ‘A Mathematical Theory of Communication’ (1948). I can’t say that I understood everything (especially Shannon’s equations and Jerome McGann’s pretentious jargon) but it’s given me a lot to think about, and things are nowhere near as simple as I first assumed.

(more…)

Digital History Projects: Planning

[posted by Gavin Robinson, 7:30 pm, 10 January 2007]

In my New Year post, I mentioned that I’m thinking about carrying out a couple of digital history projects in connection with my First World War research. These projects are very small and should be relatively easy to carry out on my own, but there will almost certainly be challenges. Overcoming these will give me more experience of carrying out a digital history project (this is starting to sound like a job application again!), and produce useful resources. After that, I can move on to consider some more advanced issues, such as collaborating with other people, and dealing with seventeenth-century manuscripts. To make the experience even more useful, I’m trying to blog it as I go. This post is an outline of my plans so far. Now that I’ve published my plans I’ll have to carry them out!

(more…)

Newer posts