New blog and CSPD online

[posted by Gavin Robinson, 9:35 am, 23 April 2008]

Mercurius Politicus linked to Gilbert Mabbott, a new blog about print culture in the English Civil Wars and Interregnum. From this blog I discovered that Calendar of State Papers Domestic is starting to appear on Google Books. There’s a James I volume available with full access. I’m hoping that the rest of the series, particularly the Charles I volumes, will follow soon. There’s no reason why they shouldn’t as they’re all in the public domain. Since the original documents were under Crown Copyright and the calendars were published by HMSO in the 19th century the copyright must have expired by now. Despite that, British History Online are trying to charge money for access to digital versions of the calendars for the reigns of James I and Charles I. I always thought that was a bad decision. If all of the volumes end up being freely available on Google it’s going to look even more stupid.

Places

[posted by Gavin Robinson, 12:01 pm, 28 January 2008]

Following on from adding an interactive index of people to my digital edition of Sandall’s history of 5th Lincs, I’ve now added a similar feature for place names. It works in exactly the same way as the person index, but it also has a map view. Again this uses the Exhibit API, which makes it very easy to mash up data with Google Maps without even having to know anything about the Google Maps API. The map view is a bit slower than the normal view, especially if the list isn’t filtered, but that’s an inherent limitation of using maps.

One of the many cool things about the map is that it strikingly illustrates the allied advances in the last months of the First World War. If you go into the map view and click “The Beginning of the Great Advance” on the list of chapters, you’ll see the battalion holding the line in Flanders, then moving behind the lines for rest near Amiens, then moving up to the front line at Saint-Quentin. Then click on each of the following chapters in turn and watch the markers surge forward as 46th Division breaks through the Hindenburg Line and pushes towards Belgium.

Adding the place index was mostly similar to adding the person index: I added a unique id to each<placeName> tag using a Python script, pulled out the place names into an SQLite database, identified/disambiguated them and added a regularized name, then used another Python script to pull the regularized names out of the database and put them into the key attributes in the XML file. Identifying the places was easier than identifying people, and took a couple of days, although there are a few that I couldn’t find. As with people I added some code the the XSLT to generate a JSON file of all the places. Then following the map view tutorial I used the Exhibit API to pull latitude and longitude co-ordinates from Google Maps and put them into another JSON file. This turned out to be a bit unreliable as about 10 per cent of the places had their co-ordinates missing. It seems to be random, as running the script again with the same set of data produced a similar error rate but with different places. I had to take the missing places from the output file, put them into another input file and run the script over them again, which produced a similar 10 per cent error rate, but the remaining few co-ordinates could be put in manually. Once I had a JSON file with all the correct geocodes it was easy to copy code from the tutorial to add a map view to the Exhibit page. In a few cases it turned out that Google had given me the wrong co-ordinates. Mostly this was because there are two or more places with the same name and it had picked the wrong one. I thought I’d put in enough information from my manual searches to disambiguate them but it seems that the results of a Google Map search can be a bit unpredictable, and don’t necessarily give you the full address of a place.

I’ve now done most of what I planned to do in this phase. There are still some features that could be added, especially a feedback mechanism, but I’ll be giving this project a rest soon so I can do some English Civil War work.

Sandall Update

[posted by Gavin Robinson, 7:32 pm, 9 January 2008]

I’ve now uploaded a new version of Sandall’s history of 5th Lincs with each chapter on a separate page. I thought splitting the pages and getting the internal links to work would be difficult but it turned out to be easier than I thought, although it involves some quite complicated XPath expressions. I’ve also uploaded the new XSLT to show how I did it. This could probably be better in some ways but for now I’m just pleased that it works. While doing this I decided to change the n attribute of the chapter divs from a number to a slug that could be used to make a Google friendly permalink.

Now I’m waiting for Google to re-index the site so that the custom search actually works. Meanwhile I’ve started tagging people, places, dates and abbreviations. More on that when I’ve finished. I’m also increasingly confident that the photos are in the public domain (see these guidelines, which make things a bit clearer, if they’re right), so they’ll probably be added soon.

What I really, really want

[posted by Gavin Robinson, 7:05 pm, 3 January 2008]

I’ve been playing with Google Custom Search and although it’s good it would be much better if it could recognize metadata in microformats, RDF, or any other formats that metadata might be found in. And if it could also scrape data off web pages using regular expressions (sort of like Feed43 but better). And if you could create custom search fields and define how they map to Google Base fields, Freebase fields, metadata tags, and scraped data.

And I want the moon on a stick.

Google Base and Great War Soldiers

[posted by Gavin Robinson, 1:08 pm, 27 December 2007]

I’ve just been looking into Google Base, which lets you upload structured data in XML format and make it searchable on Google (although so far Base pages don’t seem to show up in the standard web search). The data is described using item-types and attributes, and although Google provides recommended types and attributes you can also make up your own, for just about any purpose you want. This kind of semantic markup gives the potential for much more specific and accurate search results than a normal web search.

Now I’m wondering if this could be a possible solution to a big problem that I’ve been thinking about for a while: pulling together a list of all the British soldiers who served in the First World War and everything that’s known about them. This would be in the region of several million names. Many of the details are already available online in various places but they’re not linked together. The CWGC has a more or less complete database of personnel who meet their criteria of having died as a result of the war, although new names are discovered every so often (their own search engine can only search by name, not by regiment or service number). Surviving service records (only about 30 to 40% survived the Blitz!) are being put online by Ancestry, although it’s subscription only and the indexing and transcription are reputed to be really terrible. The UK National Archives has made the medal index cards available online (I’ve seen several transcription errors in the index but it’s apparently not as bad as Ancestry). This collection contains nearly 5.5 million records and should mention every soldier who qualified for a campaign medal by serving overseas (although there are unsubstantiated rumours on the Great War Forum that some cards were lost in transit). The medal cards also include men who were awarded a Silver War Badge for being discharged as unfit for service, even if they hadn’t served overseas. Officers are more problematic because if they survived the war they had to apply for their campaign medals and there are many known examples of officers who didn’t make a claim and so have no medal card. Commissions and gallantry medals are shown in the London Gazette, which is available online, but its search engine is notoriously difficult to use. Then there are various personal websites of people who are researching their families or a particular unit. And there’s the Great War Forum, which contains a huge number of posts on individual soldiers, often pulling together information from many different sources, from the most well known online databases to obscure local newspapers and family collections.

In theory something like Google Base could help to pull all this stuff together and make it easier to find information on specific people. For example, you could create an item type for soldiers and give it attributes like name, rank, regiment, battalion, service number etc. First of all a lot of thought and consultation would need to go into defining the item and attributes to make it as useful as possible to as many people as possible. This is definitely something to think about for the future.

However, there are some limitations which mean it isn’t going to happen soon. The biggest problem I can see is that you have to manually upload the records to your account, and edit them whenever they change. You can use the API to automate this but I think it would be much better if you could just embed Google Base metadata in a webpage and let Google’s spiders pull it out automatically. Another thing is that there doesn’t seem to be any scope for collaboration. Once you’ve uploaded your data no-one else can edit it. This is quite disappointing because sharing is a big part of Google Docs. In my experience many expert Great War researchers do not have advanced IT skills and so we need things to be as simple as possible, and easy ways of helping less IT literate people by being able to edit their stuff directly. The Your Archives wiki has shown that this can work really well: it doesn’t matter if people haven’t formatted their pages properly or don’t know how to insert a link. As long as you put up some relevant information, someone else can sort it out.

But these are changes that Google could make in the future, so it’s something to watch out for. There must be lots of other ways that historians could use Google Base. It’s already good enough for smaller data sets which have already been compiled by one person, so I might be able to put up some of my English Civil War data.