Identifying places 2
Last week I posted about experiments with Python to automatically identify places mentioned in lists of horses donated to parliament’s armies in the English Civil War. The initial results were very encouraging. Using the difflib algorithm to compare a selection of places with a list of Buckinghamshire parishes gave very encouraging results. Since then I’ve scaled it up and also tried some different approaches. The results are less clear cut when comparing bigger lists, but I’ve been able to write a program which should save me a lot of time compared to the manual methods that I used during my PhD.
After the Buckinghamshire test, the next step was to try a bigger selection of places but still limited to one county. I pulled out all the place names which are specified in the manuscript as being in Essex. After deleting duplicates this gave 240 unique strings. Comparing these with the list of Essex parishes gave a correct top answer in 181 cases. Comparing with a list of parishes for the whole of England (which is over 10,000 parishes) gave 121 correct top answers. A 50% success rate is more than I was expecting at this stage. Although it was usually the most obvious matches which came out top it would still save a lot of work as the program did all the comparisons within a few minutes. The biggest problem is the huge number of false positives returned, which can make it difficult to find the most likely answer if it doesn’t have the highest ratio. A threshold of 0.55 gave a total of 28,077 results - an average of over 100 per place name. Raising the threshold could cut this down but some correct answers were slightly below 0.55.
After this I looked into different algorithms and decided to try combining difflib with metaphones. A metaphone is a phonetic representation of a string in which all the vowels are stripped out and consonants are simplified. The phonetic rules are based on English so it won’t work in other languages, and it seems to be a very precise Received Pronunciation kind of English, which doesn’t necessarily reflect how English is spoken now or was spoken in the 17th century. I used PHP to convert place and parish names into metaphones because it has a built in metaphone function, then put them into a SQLite database and wrote a Python script to pull them out and compare them using difflib.
To start with small scale tests were very encouraging. Metaphone tends to iron out minor variations in spelling, increasing the chances of getting an exact match. For example, Dedham might be spelt Deddom, Deddome, Dedhame etc but these all reduce to the metaphone TTM. Correct answers tended to have a higher ratio with metaphone than with difflib alone. This method can cope with some very tricky cases. I remember once at a conference Peter Edwards was discussing the records of a horse fair and noted that one buyer or seller was recorded as coming from Asbedelesey but he had no idea where that was. Then there was a eureka moment as someone in the audience said “Ashby de la Zouch”. The connection becomes much more obvious when the strings are converted to metaphones: Ashby de la Zouch is AXBTLSX and Asbedelesey is ASBTLS. Comparing the metaphones with difflib gives a ratio of 0.77 instead of the 0.43 you get from comparing the original strings. The difflib ratio for “Ipswich” and “Ipsage” is only 0.46, but converting them to metaphones gets a ratio of 0.67.
The problem with this approach is that although it allows a higher threshold the number of results gets even higher. Comparing the list of Essex places with all 10,000 parishes returned 41,858 results even with a threshold of 0.65. If you can limit the search to a single county comparisons are likely to be very accurate but if you don’t know the county and have to compare with the whole country there are too many possibilities.
One quick way of eliminating false positives is to compare the first letter of the metaphones. If these don’t match then the places are highly unlikely to be the same. Although there will probably be a few cases where correct answers don’t have matching first letters it’s worth losing these to make the results more manageable. Trying this with the Essex results reduced the total number of matches from 41,858 to 17,616. One reason why correct answers might not have matching first letters is inconsistent use of qualifying words such as great, little, north, south, east, west, upper, lower, etc. Sometimes these might be omitted in the manuscript or given in different forms eg “Maplestead Magna” instead of “Great Maplestead”. Therefore I added another column to the database containing metaphones calculated after stripping out these stop words. That allows comparisons with and without qualifying words, which should give the best of both worlds.
Finally I put all of this together into a function which I can call from the command line. This allows me to search for one place at a time and specify the threshold and whether to limit the search to a specific county. The results are fed back into database tables: one with stop words and one without. During the tests it became obvious that I’d need this flexibility because there was no chance of finding a perfect one size fits all approach. The basic problem is just that there are too many places in England with too similar names. Choosing between them is always going to be an arbitrary decision, but with this Python program I can make a more informed decision more quickly.
