Google Base and Great War Soldiers
I’ve just been looking into Google Base, which lets you upload structured data in XML format and make it searchable on Google (although so far Base pages don’t seem to show up in the standard web search). The data is described using item-types and attributes, and although Google provides recommended types and attributes you can also make up your own, for just about any purpose you want. This kind of semantic markup gives the potential for much more specific and accurate search results than a normal web search.
Now I’m wondering if this could be a possible solution to a big problem that I’ve been thinking about for a while: pulling together a list of all the British soldiers who served in the First World War and everything that’s known about them. This would be in the region of several million names. Many of the details are already available online in various places but they’re not linked together. The CWGC has a more or less complete database of personnel who meet their criteria of having died as a result of the war, although new names are discovered every so often (their own search engine can only search by name, not by regiment or service number). Surviving service records (only about 30 to 40% survived the Blitz!) are being put online by Ancestry, although it’s subscription only and the indexing and transcription are reputed to be really terrible. The UK National Archives has made the medal index cards available online (I’ve seen several transcription errors in the index but it’s apparently not as bad as Ancestry). This collection contains nearly 5.5 million records and should mention every soldier who qualified for a campaign medal by serving overseas (although there are unsubstantiated rumours on the Great War Forum that some cards were lost in transit). The medal cards also include men who were awarded a Silver War Badge for being discharged as unfit for service, even if they hadn’t served overseas. Officers are more problematic because if they survived the war they had to apply for their campaign medals and there are many known examples of officers who didn’t make a claim and so have no medal card. Commissions and gallantry medals are shown in the London Gazette, which is available online, but its search engine is notoriously difficult to use. Then there are various personal websites of people who are researching their families or a particular unit. And there’s the Great War Forum, which contains a huge number of posts on individual soldiers, often pulling together information from many different sources, from the most well known online databases to obscure local newspapers and family collections.
In theory something like Google Base could help to pull all this stuff together and make it easier to find information on specific people. For example, you could create an item type for soldiers and give it attributes like name, rank, regiment, battalion, service number etc. First of all a lot of thought and consultation would need to go into defining the item and attributes to make it as useful as possible to as many people as possible. This is definitely something to think about for the future.
However, there are some limitations which mean it isn’t going to happen soon. The biggest problem I can see is that you have to manually upload the records to your account, and edit them whenever they change. You can use the API to automate this but I think it would be much better if you could just embed Google Base metadata in a webpage and let Google’s spiders pull it out automatically. Another thing is that there doesn’t seem to be any scope for collaboration. Once you’ve uploaded your data no-one else can edit it. This is quite disappointing because sharing is a big part of Google Docs. In my experience many expert Great War researchers do not have advanced IT skills and so we need things to be as simple as possible, and easy ways of helping less IT literate people by being able to edit their stuff directly. The Your Archives wiki has shown that this can work really well: it doesn’t matter if people haven’t formatted their pages properly or don’t know how to insert a link. As long as you put up some relevant information, someone else can sort it out.
But these are changes that Google could make in the future, so it’s something to watch out for. There must be lots of other ways that historians could use Google Base. It’s already good enough for smaller data sets which have already been compiled by one person, so I might be able to put up some of my English Civil War data.

Comment by Gavin Robinson — 7:44 pm, 27 December 2007 [permanent link to this comment]
Another thing I forgot to mention: I don’t think you can nest attribute tags, which would be quite a useful thing to do. For example, the peculiarities of the British Army numbering system in this period mean that a service number needs to be associated with a regiment (or sometimes even a battalion) to make any sense.
Even if Base isn’t quite up to pulling in complete datasets from big sites like CWGC and TNA, it could still be valuable for people who have compiled a database of men who served in a battalion and want to make it available to other people on the web but who don’t have the technical skills or the money to host it themselves. I might try some small scale experiments and see where it goes (although I have lots of other things to finish first).
Another alternative is Freebase, but that seems to be the other extreme: absolutely anyone can edit your data.
Comment by Ben Brumfield — 12:38 am, 28 December 2007 [permanent link to this comment]
just embed Google Base metadata in a webpage
This sounds reminiscent of microformats. Have you looked to see if any connectors between the two exist?
Comment by Gavin Robinson — 10:54 am, 28 December 2007 [permanent link to this comment]
Thanks for pointing that out. I’d somehow never heard of microformats but it looks like a brilliant idea - even easier than embedding XML. For people who have already published details of soldiers on their websites this is likely to be a better solution than trying to sync up with Google Base. On the other hand Google Base would be best for people who haven’t already published their data on the web and don’t know how to. It should be possible to define a Google Base type and a microformat which map to each other and contain all the necessary details of soldiers. Then someone could write a scraper which pulls it all together and allows searches across Google Base, microformatted websites, and possibly the big sites (CWGC, TNA) even if they don’t have any metadata. I know that someone is already working on a Python script which scrapes the CWGC database and allows searches by regiment and service number, which the site’s own search engine won’t do.