Identifying Places

[posted by Gavin Robinson, 10:17 am, 8 April 2008]

Never mind the scary theory, here’s some empiricism. And computer programming. The piece I’m working on is an analysis of lists of horses donated to the parliamentarian army in the First Civil War. There are some figures derived from these lists in my forthcoming article in War In History and in the seminar paper that I posted in November, but I’m trying to write an article which examines them in much more detail. This article will be related to debates over allegiance and the causes of the war, which is why I’ve been trying to explore the historiography and think about theoretical issues, but the substance of it will be fairly straightforward empirical stuff with lots of numbers. That’s not to say that this kind of analysis is easy. If it was someone else might have done it all years ago. John Tincey was the first person to try it, but he only did the smallest of the three account books, which is a fraction of the size of the other two. Following his lead I decided to do all of them.

In 1999 I spent about 2 weeks in the PRO typing these lists into an Access database. I’m still using that transcript as the basis of my work now, although I’ve converted it to XML to make it more flexible and checked a selection of the entries against digital photos of the manuscript. I’ve been using the Python classes that I developed for representing uncertainty to calculate totals of horses and values. Some pages are damaged, meaning that exact totals can’t be calculated – this is something that was difficult to deal with in Access but the combination of XML and Python has enough flexibility to cope with it. Getting totals for days and months is fairly easy, but I also want to group by the social status of the donors and the counties that they came from. Before I can group by counties I need to identify place names given in the manuscript as although some entries specify a county in the address, many more give a place name without a county.

Back in the last century when I was doing my PhD life was hard. We didn’t have Google Maps back then. To identify places I just looked in an Ordnance Survey atlas. This was a lot of hard work. Even if the spellings of place names exactly matched the modern equivalents, looking up hundreds of names in the index would have been very tedious. But of course 17th century spelling varied wildly so it took a lot of lateral thinking to track them down. Once the obvious ones were identified some patterns emerged which made it a bit easier as entries were often likely to be clustered together, but this still meant that I had to resort to scouring the map looking for likely possibilities. I managed this well enough to include a breakdown of donations by county in my thesis, but now I want to do it better.

Inspired by Bill Turkel’s work with compression algorithms I decided to look into ways of automatically comparing addresses from the manuscript with a list of place names. My first attempt uses difflib, a standard Python library designed to compare strings. I have no idea how it’s implemented but it provides a class called SequenceMatcher which has a function to compare two strings and return a ratio of their similarity to each other: 1 is an exact match and 0 means they have nothing in common at all. The library reference says that anything above 0.6 can be considered a good match, although this is obviously subjective. Writing a Python script to loop through lists of words and compare them with each other is fairly trivial, but first I had to get the lists.

I downloaded a list of parishes from the Parish Locator website as a CSV file. I don’t know exactly how complete or accurate this list is, but it should be adequate for my purposes. In order to test things on a small scale I took addresses from a short list compiled by the Buckinghamshire commissaries at Aylesbury which is treated separately from the main list in the manuscript. Most places in this list are likely to be in Buckinghamshire. I put the addresses into a database table and deleted duplicates, leaving 55 unique strings (some of these are likely to be different spellings of the same place, but for programming purposes they’re different information). Then I copied the Buckinghamshire parishes from the CSV file and put them into another table.

As a control, I manually compared the two lists and recorded my best guess as to the identity of each place. All but 8 could be identified as Buckinghamshire parishes fairly easily, although a further two were ambiguous as they could have been either of two similarly named places (Great or Little Missenden, Weston Turville or Weston Underwood). For the purposes of calculating county totals this wouldn’t be a problem as long as all of the possibilities were in the same county. Of the 8 which were not obviously Bucks parishes, 3 could easily be identified (using Google Maps) as settlements in Bucks which are/were not parishes in their own right. This is potentially the biggest problem with this method as I don’t have a full list of settlements in 17th century England. Another 2 places, given in the MS as “Throppe” and “Stretton Audley” are highly likely to be Thrupp and Stratton Audley in Oxfordshire (although there’s also a chance that Throppe could be Castle Thorpe, Bucks). The place given as “Lannden” might be Lavendon, Bucks, or Launton, Oxon. Finally I have absolutely no idea where “Polycott”or “Ramsfee” might be. It could be that I’ve mistranscribed these (and my photos of the relevant pages are blurred enough to be ambiguous), which is another potential problem, but one which the string comparison function might help to overcome.

The Python program pulled the lists of names out of both database tables and compared every possible combination. If the ratio was greater than 0.5 the result (address from MS, matching parish name, and ratio) was written back to a third database table. The results that ended up in this table were very encouraging. The highest scoring match for each name agreed with my best guess in every case where I’d selected a Bucks parish. It easily dealt with cases which I thought might be tricky, such as “Agmondisham” for Amersham. As expected, it couldn’t do much with cases where the most likely answer was not in the list of Bucks parishes, but this is entirely a case of Garbage In Garbage Out. Even here there were no misleadingly high ratios.

The next step is to try scaling it up to compare the whole lists of addresses with the list of parishes for the entire country. Although the Bucks test would most likely have correctly identified the Oxon parishes if it had the complete parish list to test against, there is also a possibility that there will be more false positives when there are more parishes to choose from. There would be some similar issues if I could get a complete list of settlements: less chance of the algorithm being completely lost, but more chance of getting too many possible results. When working on a larger scale I might have to raise the threshold for matches. I made it 0.5 for the test to see what happened, but the results suggest that 0.55 might be better in future. Where the algorithm can’t help at all is where two or more places have exactly the same name. I was always expecting to have to make some arbitrary decisions. Although the algorithm worked better than I was expecting there are still going to be many cases where I have to decide. Even in these cases the program should save a lot of time by quickly putting together a list of reasonable possibilities to choose from.

3 Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.