<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Investigations of a Dog &#187; python</title>
	<atom:link href="http://www.investigations.4-lom.com/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.investigations.4-lom.com</link>
	<description>Failing better at understanding the past</description>
	<lastBuildDate>Sun, 05 Feb 2012 09:18:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Text-mining tips</title>
		<link>http://www.investigations.4-lom.com/2011/06/12/text-mining-tips/</link>
		<comments>http://www.investigations.4-lom.com/2011/06/12/text-mining-tips/#comments</comments>
		<pubDate>Sun, 12 Jun 2011 10:27:03 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[textmining]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/?p=912</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Text-mining+tips&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2011-06-12&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2011/06/12/text-mining-tips/&amp;rft.language=English"></span>
These are some insights from the text-mining that I&#8217;ve been doing this week: Stop and think about stop words One of the first rules of text-mining should be: always make your own list of stop words. Nothing absolutely and objectively is or isn&#8217;t a stop word. Which words are and aren&#8217;t meaningful depends on your [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Text-mining+tips&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2011-06-12&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2011/06/12/text-mining-tips/&amp;rft.language=English"></span>
<p>These are some insights from the text-mining that I&#8217;ve been doing this week:</p>
<h3>Stop and think about stop words</h3>
<p>One of the first rules of text-mining should be: always make your own list of stop words. Nothing absolutely and objectively is or isn&#8217;t a stop word. Which words are and aren&#8217;t meaningful depends on your research questions. For example, pronouns are often included in lists of stop words, but I&#8217;m very interested in gender so I want to know the frequencies of gendered words like &#8216;he&#8217; and &#8216;she&#8217;. If you use someone else&#8217;s list without thinking about it you&#8217;ll probably inherit various biases and assumptions. The kind of text you&#8217;re working with also makes a difference. In the proceedings of parliament words like &#8216;ordered&#8217;, &#8216;resolved&#8217; and &#8216;committee&#8217; occur too regularly to be much use to most people. If you don&#8217;t define your stop words until after you&#8217;ve calculated frequencies for every word you can get a better idea of which words are getting in the way and which ones are interesting.</p>
<h3>BeautifulSoup is not always the answer</h3>
<p>The Python library BeautifulSoup is really useful for extracting data from HTML pages, but maybe I got into the habit of using it too much. This week I was trying to work out how to get some data from pages that didn&#8217;t have a very good semantic structure. Doing it with BeautifulSoup looked like it would be really complicated, but then I realised that in this case regular expressions would be much easier.</p>
<h3>Have sets</h3>
<p>Python includes a sequence type called a set, which combines the best aspects of a Python sequence and a mathematical set, and is incredibly useful for text-mining scripts. Turning a list into a set automatically gets rid of duplicates. For example, suppose you&#8217;ve split some text into a list of separate words.</p>
<p><code>&gt;&gt;&gt;wordlist = 'it was the best of times it was the worst of times'.split()</code></p>
<p><code>&gt;&gt;&gt;wordlist</code></p>
<p><code>['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']</code></p>
<p><code>&gt;&gt;&gt;wordset = set(wordlist)</code></p>
<p><code>&gt;&gt;&gt;wordset</code></p>
<p><code>set(['of', 'it', 'times', 'worst', 'the', 'was', 'best'])</code></p>
<p>Now we have a set of unique words which we can iterate through using a for loop, counting the occurrences of each word in the list:</p>
<pre>for word in wordset:
    wordcount = wordlist.count(word)</pre>
<p>Then we can do whatever we want with wordcount (print it to the screen, add it to a tuple or a dictionary, write it to a file).</p>
<p>You can also do mathematical operations on sets, which can be really useful for removing stop words.</p>
<p>Suppose we have a set of stopwords:</p>
<p><code>&gt;&gt;&gt;stopwordset = set(['of', 'it', 'the'])</code></p>
<p>We can deduct that from the set of words before we iterate through it:</p>
<p><code>&gt;&gt;&gt;wordset = wordset -  stopwordset</code></p>
<p><code>&gt;&gt;&gt;wordset</code></p>
<p><code>set(['was', 'worst', 'best', 'times'])</code></p>
<p>Now the stop words in wordlist are completely ignored, and we don&#8217;t even have to do an if test at every iteration.</p>
<h3>A dictionary is a bit like a database</h3>
<p>Python dictionaries can be thought of as very simple databases. Obviously they can&#8217;t do everything that a database can do, but you don&#8217;t have to worry about connections or cursors either. When counting words across multiple files it&#8217;s easy to keep a running total of each word by updating a dictionary at every iteration. If the word is already in the dictionary, add to the existing count; if it isn&#8217;t, add a new key/value pair.</p>
<p>This is how I do it:</p>
<p><code>&gt;&gt;&gt;wordcount = dict()</code></p>
<p>(Then iterate through each file, open and read it etc.)</p>
<pre>for word in wordset:
    if word in wordcount:
        wordcount[word] = wordcount[word] + wordlist.count(word)
    else:
        newword = [(word, wordlist.count(word))]
        wordcount.update(newword)</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2011/06/12/text-mining-tips/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multiple Indemnity</title>
		<link>http://www.investigations.4-lom.com/2010/07/20/multiple-indemnity/</link>
		<comments>http://www.investigations.4-lom.com/2010/07/20/multiple-indemnity/#comments</comments>
		<pubDate>Tue, 20 Jul 2010 10:04:33 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[english civil war]]></category>
		<category><![CDATA[flickr]]></category>
		<category><![CDATA[pro]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sp24]]></category>
		<category><![CDATA[wikis]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/?p=809</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Multiple+Indemnity&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2010-07-20&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2010/07/20/multiple-indemnity/&amp;rft.language=English"></span>
As part of the research for my book (saying that still feels a bit weird, but I’m sure I’ll get used to it) I’m going through indemnity cases in class SP 24 in the UK National Archives (aka the PRO). The Indemnity Committee was set up by parliament in 1647 to protect soldiers and officials [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Multiple+Indemnity&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2010-07-20&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2010/07/20/multiple-indemnity/&amp;rft.language=English"></span>
<p>As part of the research for my book (saying that still feels a bit weird, but I’m sure I’ll get used to it) I’m going through indemnity cases in class SP 24 in the UK National Archives (aka the PRO). The Indemnity Committee was set up by parliament in 1647 to protect soldiers and officials from prosecution for actions that they had carried out under the authority of parliament, such as requisitioning things for the army or arresting royalists. It also dealt with disputes over sequestered rents and debts, and helped to enforce parliament’s order that apprentices who joined the army should be allowed to count military service towards their term of apprenticeship. If someone was prosecuted in court for acts which were covered by the Indemnity Ordinance (and many were despite the Ordinance banning people from bringing cases of this kind) the defendant could send a petition to the Indemnity Committee asking for protection. In SP 24 there are 58 boxes of petitions and other papers relating to cases, such as depositions and lists of expenses. Unlike some classes these are quite well sorted: papers relating to each case are grouped together and sorted in roughly alphabetical order of the plaintiff’s name (although confusingly the plaintiff in an indemnity case is the defendant in the corresponding criminal prosecution). I’m particularly interested in cases relating to horse requisitioning. According to Ian Gentles, about 30% of the military cases involve horses, although from what I’ve seen so far military cases seem to be a minority as many cases are disputes between civilians over payment of rents and debts due to sequestered estates. It usually takes me less than an hour to skim through a box, look at the first petition in each case to see if it’s about horses, and photograph the relevant cases. Sometimes I get cases that look interesting for other reasons, but I try not to wander too far off topic too often. Since I’m photographing these papers for my research, and since the National Archives allow document images to be uploaded to Flickr, that’s just what I’m doing. I’m also putting transcripts or summaries of the documents, along with links to the images, on the Your Archives wiki. You can see what I’ve done so far, and follow my progress in future, via a <a href="http://www.flickr.com/photos/wenham5thlincs/collections/72157623254203073/">Flickr collection</a> and <a href="http://yourarchives.nationalarchives.gov.uk/index.php?title=Category:Indemnity_Cases">Your Archives category</a>.</p>
<p>So far I’ve uploaded cases from the first 2 boxes. I have another 16 boxes ready to be uploaded, but I’m working on some Python scripts to automate the process. The trial run on the first two boxes proved that doing it all manually is quite labour intensive. First I copied the image files from my camera and sorted them into directories for each box. The directory structure is based on the archival reference, so there’s a directory called “SP 24” with sub-directories called “30”, “31” etc. Then I went into each of these directories and made sub-directories for each case, so it looks like this:</p>
<ul>
<li>SP 24
<ul>
<li>30
<ul>
<li>1 Abeary vs 			Windebanke</li>
<li>1 Adams vs 			Haughton</li>
<li>2 Alford vs King</li>
<li>etc</li>
</ul>
</li>
<li>31</li>
</ul>
</li>
</ul>
<p>And the path to a particular case would be:</p>
<p>SP 24/30/2 Alford vs King</p>
<p>Which looks quite similar to the archival reference.</p>
<p>The numbers at the start of the case name are the part number (each box usually contains three folders called part 1, part 2 and part 3 but I decided not to make directories for these). Up to here it has to be done manually as arranging cases into directories involves looking at the documents to see where a new case begins and to check the names. But from here a lot of it can be automated.</p>
<p>Each directory containing one case needs to have its own photoset on Flickr. I used Postr to upload one case at a time and then used Desktop Flickr Organizer to create a set and add photos to it (I got both of these applications from the Ubuntu repository – if you’re on Windows then&#8230; stop using Windows!). Then I used the Organizr on the Flickr website to drag each set into the “SP 24 Indemnity Cases” collection. Once the Flickr photos and sets were in place I went to the web page for each set, manually created a Zotero item for the case, and attached a link to the page. Finally I created a Your Archives page for each case and attached a link to it in Zotero. This includes a template that I made for indemnity cases which gives some basic information in a standardized form and includes a link to the relevant Flickr set. Doing all this manually for each case is quite tedious and takes a long time, so I’m working on some Python scripts to automate the process. What I want the scripts to do is:</p>
<ol>
<li>Upload photos from 	multiple directories</li>
<li>Create a separate 	photoset for each directory, with a name based on the directory name 	and path</li>
<li>Get the ID of each 	set and write the IDs and names to a CSV file</li>
<li>(At this point 	I’ll manually edit the CSV file to add data that will be needed 	for Your Archives and Zotero and which can only be got by looking at 	the document images, eg full names of plaintiffs and defendants,  	date of the petition, summary of the case, categories/tags)</li>
<li>Use the data from 	the CSV file to construct a wiki page with the correct template and 	upload to Your Archives through the MediaWiki API</li>
<li>Export an XML file 	which can be imported into Zotero</li>
</ol>
<p>So far I’ve written a Flickr upload script which does the first three steps and more or less works. Rather than working directly with the Flickr API I’m using the <a href="http://stuvel.eu/projects/flickrapi">Python Flickr API</a> library, which makes things very easy. It provides a flickr class with methods to handle API calls and authentication. Before using it you have to go to the <a href="http://www.flickr.com/services/apps/create/">App Garden</a> and request an API key, but that doesn’t take long to do. App pages can be kept private, which is what I’m doing in this case as I don’t really have the time or skills to make my scripts fit for public consumption. The next step is to add error handling as the script only works as long as nothing goes wrong. In the real world, there are lots of things that could go wrong. The library throws an exception if it gets an error response from the API. Until I add some exception handling this means that the script just stops on an error. The script will need to keep track of what has and hasn’t been done (photos uploaded, sets created, photos added to sets) so that I can run it again if anything was left undone, and so that it doesn’t try to do the same thing again if it’s already been done. One annoying thing about Flickr’s public API is that it provides no way to create a collection or add sets to a collection. I assumed I’d be able to automate that part of the process but it looks like I’ll still have to do it manually.</p>
<p>For step 5 I’ll be using the <a href="http://meta.wikimedia.org/wiki/Using_the_python_wikipediabot">Pywikipediabot</a> library. I’ve already done some simple tests on a local MediaWiki installation and it seems quite easy to create a page. Once I’ve finished the script and thoroughly tested it I can ask for a bot account on Your Archives. Step 6 will involve learning a bit more about Zotero RDF. The easiest way to find out how to generate the right code is to export some similar existing items and look at the results.</p>
<p>So just because I’m writing a monograph it doesn’t mean I’ve abandoned digital history. I’ll still be using lots of digital tricks in the background, but they won’t necessarily be obvious in the text of the book. New technology is certainly making my research quicker and cheaper than it used to be. The stuff that I’ve written about above isn’t exactly revolutionary: it saves labour but it doesn’t offer new insights that couldn’t have been found before. But later in the project I’m planning to do some text mining which I hope will show me things that I couldn’t otherwise have found. I’ll also be revisiting <a href="../../../../../2008/04/15/identifying-places-2/">phonetic algorithms for place name identification</a>. And if I can’t think of anything else to blog about, there are likely to be some interesting stories in the indemnity cases.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2010/07/20/multiple-indemnity/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Programming Historian</title>
		<link>http://www.investigations.4-lom.com/2008/05/05/the-programming-historian/</link>
		<comments>http://www.investigations.4-lom.com/2008/05/05/the-programming-historian/#comments</comments>
		<pubDate>Mon, 05 May 2008 14:16:35 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[ebooks]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/?p=215</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=The+Programming+Historian&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-05-05&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/05/05/the-programming-historian/&amp;rft.language=English"></span>
Yesterday Bill Turkel announced that The Programming Historian is now available. This is a book, but not as we know it. It&#8217;s published in the form of a website and is completely free to access. As the name suggests, it&#8217;s an introduction to computer programming aimed specifically at historians. The tutorials will get you doing [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=The+Programming+Historian&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-05-05&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/05/05/the-programming-historian/&amp;rft.language=English"></span>
<p>Yesterday <a href="http://digitalhistoryhacks.blogspot.com/2008/05/programming-historian-is-now-available.html">Bill Turkel</a> announced that <a href="http://niche.uwo.ca/programming-historian/">The Programming Historian</a> is now available. This is a book, but not as we know it. It&#8217;s published in the form of a website and is completely free to access. As the name suggests, it&#8217;s an introduction to computer programming aimed specifically at historians. The tutorials will get you doing useful things as soon as possible, even if you have no previous experience of programming. If you do know programming it&#8217;s also worth a look. I found lots of useful tips in it.</p>
<p>By enabling more historians to make better use of digital technology the book is helping to change the way that we do history. And it&#8217;s also helping to change the way that we present our research, because it&#8217;s a concrete example of the advantages of open access publishing on the web. This means a whole lot more than not having to pay to read it. Although the book has been published, it&#8217;s still a work in progress. New chapters will be added in future, and existing ones can be improved in response to feedback from readers. Any typos, factual errors or unclear sentences can all be corrected very easily. Comments from reviewers are displayed on accompanying discussion pages so you can see how the text developed and what people thought of it. The book can keep growing to meet the needs of digital historians: there doesn&#8217;t ever have to be a point when it&#8217;s finally finished like there is with a printed book.</p>
<p>Go and read it. Now.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/05/05/the-programming-historian/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Identifying places 2</title>
		<link>http://www.investigations.4-lom.com/2008/04/15/identifying-places-2/</link>
		<comments>http://www.investigations.4-lom.com/2008/04/15/identifying-places-2/#comments</comments>
		<pubDate>Tue, 15 Apr 2008 09:26:00 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[metaphones]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/04/15/identifying-places-2/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Identifying+places+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-04-15&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/04/15/identifying-places-2/&amp;rft.language=English"></span>
Last week I posted about experiments with Python to automatically identify places mentioned in lists of horses donated to parliament&#8217;s armies in the English Civil War. The initial results were very encouraging. Using the difflib algorithm to compare a selection of places with a list of Buckinghamshire parishes gave very encouraging results. Since then I&#8217;ve [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Identifying+places+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-04-15&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/04/15/identifying-places-2/&amp;rft.language=English"></span>
<p><a href="http://www.investigations.4-lom.com/2008/04/08/identifying-places/">Last week</a> I posted about experiments with Python to automatically identify places mentioned in lists of horses donated to parliament&#8217;s armies in the English Civil War. The initial results were very encouraging. Using the <code>difflib</code> algorithm to compare a selection of places with a list of Buckinghamshire parishes gave very encouraging results. Since then I&#8217;ve scaled it up and also tried some different approaches. The results are less clear cut when comparing bigger lists, but I&#8217;ve been able to write a program which should save me a lot of time compared to the manual methods that I used during my PhD.</p>
<p><span id="more-204"></span>After the Buckinghamshire test, the next step was to try a bigger selection of places but still limited to one county. I pulled out all the place names which are specified in the manuscript as being in Essex. After deleting duplicates this gave 240 unique strings. Comparing these with the list of Essex parishes gave a correct top answer in 181 cases. Comparing with a list of parishes for the whole of England (which is over 10,000 parishes) gave 121 correct top answers. A 50% success rate is more than I was expecting at this stage. Although it was usually the most obvious matches which came out top it would still save a lot of work as the program did all the comparisons within a few minutes. The biggest problem is the huge number of false positives returned, which can make it difficult to find the most likely answer if it doesn&#8217;t have the highest ratio. A threshold of 0.55 gave a total of 28,077 results &#8211; an average of over 100 per place name. Raising the threshold could cut this down but some correct answers were slightly below 0.55.</p>
<p>After this I looked into different algorithms and decided to try combining difflib with metaphones. A metaphone is a phonetic representation of a string in which all the vowels are stripped out and consonants are simplified. The phonetic rules are based on English so it won&#8217;t work in other languages, and it seems to be a very precise Received Pronunciation kind of English, which doesn&#8217;t necessarily reflect how English is spoken now or was spoken in the 17th century. I used PHP to convert place and parish names into metaphones because it has a built in metaphone function, then put them into a SQLite database and wrote a Python script to pull them out and compare them using difflib.</p>
<p>To start with small scale tests were very encouraging. Metaphone tends to iron out minor variations in spelling, increasing the chances of getting an exact match. For example, Dedham might be spelt Deddom, Deddome, Dedhame etc but these all reduce to the metaphone TTM. Correct answers tended to have a higher ratio with metaphone than with difflib alone. This method can cope with some very tricky cases. I remember once at a conference Peter Edwards was discussing the records of a horse fair and noted that one buyer or seller was recorded as coming from Asbedelesey but he had no idea where that was. Then there was a eureka moment as someone in the audience said &#8220;Ashby de la Zouch&#8221;. The connection becomes much more obvious when the strings are converted to metaphones: Ashby de la Zouch is AXBTLSX and Asbedelesey is ASBTLS. Comparing the metaphones with difflib gives a ratio of 0.77 instead of the 0.43 you get from comparing the original strings. The difflib ratio for &#8220;Ipswich&#8221; and &#8220;Ipsage&#8221; is only 0.46, but converting them to metaphones gets a ratio of 0.67.</p>
<p>The problem with this approach is that although it allows a higher threshold the number of results gets even higher.  Comparing the list of Essex places with all 10,000 parishes returned 41,858 results even with a threshold of 0.65. If you can limit the search to a single county comparisons are likely to be very accurate but if you don&#8217;t know the county and have to compare with the whole country there are too many possibilities.</p>
<p>One quick way of eliminating false positives is to compare the first letter of the metaphones. If these don&#8217;t match then the places are highly unlikely to be the same. Although there will probably be a few cases where correct answers don&#8217;t have matching first letters it&#8217;s worth losing these to make the results more manageable. Trying this with the Essex results reduced the total number of matches from 41,858 to 17,616.  One reason why correct answers might not have matching first letters is inconsistent use of qualifying words such as great, little, north, south, east, west, upper, lower, etc. Sometimes these might be omitted in the manuscript or given in different forms eg &#8220;Maplestead Magna&#8221; instead of &#8220;Great Maplestead&#8221;. Therefore I added another column to the database containing metaphones calculated after stripping out these stop words. That allows comparisons with and without qualifying words, which should give the best of both worlds.</p>
<p>Finally I put all of this together into a function which I can call from the command line. This allows me to search for one place at a time and specify the threshold and whether to limit the search to a specific county. The results are fed back into database tables: one with stop words and one without. During the tests it became obvious that I&#8217;d need this flexibility because there was no chance of finding a perfect one size fits all approach. The basic problem is just that there are too many places in England with too similar names. Choosing between them is always going to be an arbitrary decision, but with this Python program I can make a more informed decision more quickly.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/04/15/identifying-places-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Identifying Places</title>
		<link>http://www.investigations.4-lom.com/2008/04/08/identifying-places/</link>
		<comments>http://www.investigations.4-lom.com/2008/04/08/identifying-places/#comments</comments>
		<pubDate>Tue, 08 Apr 2008 10:17:58 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[maps]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/04/08/identifying-places/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Identifying+Places&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-04-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/04/08/identifying-places/&amp;rft.language=English"></span>
Never mind the scary theory, here’s some empiricism. And computer programming. The piece I’m working on is an analysis of lists of horses donated to the parliamentarian army in the First Civil War. There are some figures derived from these lists in my forthcoming article in War In History and in the seminar paper that [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Identifying+Places&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-04-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/04/08/identifying-places/&amp;rft.language=English"></span>
<p>Never mind the scary theory, here’s some empiricism. And computer programming. The piece I’m working on is an analysis of lists of horses donated to the parliamentarian army in the First Civil War. There are some figures derived from these lists in my forthcoming article in <em>War In History</em> and in the <a href="http://www.investigations.4-lom.com/2007/11/15/the-great-supply-chain-of-being/">seminar paper</a> that I posted in November, but I’m trying to write an article which examines them in much more detail. This article will be related to debates over allegiance and the causes of the war, which is why I’ve been trying to explore the historiography and think about theoretical issues, but the substance of it will be fairly straightforward empirical stuff with lots of numbers. That’s not to say that this kind of analysis is easy. If it was someone else might have done it all years ago. John Tincey was the first person to try it, but he only did the smallest of the three account books, which is a fraction of the size of the other two. Following his lead I decided to do all of them.</p>
<p>In 1999 I spent about 2 weeks in the PRO typing these lists into an Access database. I’m still using that transcript as the basis of my work now, although I’ve converted it to XML to make it more flexible and checked a selection of the entries against digital photos of the manuscript. I’ve been using the Python classes that I developed for representing uncertainty to calculate totals of horses and values. Some pages are damaged, meaning that exact totals can’t be calculated – this is something that was difficult to deal with in Access but the combination of XML and Python has enough flexibility to cope with it. Getting totals for days and months is fairly easy, but I also want to group by the social status of the donors and the counties that they came from. Before I can group by counties I need to identify place names given in the manuscript as although some entries specify a county in the address, many more give a place name without a county.</p>
<p><span id="more-201"></span>Back in the last century when I was doing my PhD life was hard. We didn’t have Google Maps back then. To identify places I just looked in an Ordnance Survey atlas. This was a lot of hard work. Even if the spellings of place names exactly matched the modern equivalents, looking up hundreds of names in the index would have been very tedious. But of course 17th century spelling varied wildly so it took a lot of lateral thinking to track them down. Once the obvious ones were identified some patterns emerged which made it a bit easier as entries were often likely to be clustered together, but this still meant that I had to resort to scouring the map looking for likely possibilities. I managed this well enough to include a breakdown of donations by county in my thesis, but now I want to do it better.</p>
<p>Inspired by <a href="http://digitalhistoryhacks.blogspot.com/2007/06/clustering-with-compression.html">Bill Turkel</a>’s work with compression algorithms I decided to look into ways of automatically comparing addresses from the manuscript with a list of place names. My first attempt uses <code>difflib</code>, a standard Python library designed to compare strings. I have no idea how it’s implemented but it provides a class called <code>SequenceMatcher</code> which has a function to compare two strings and return a ratio of their similarity to each other: 1 is an exact match and 0 means they have nothing in common at all. The library reference says that anything above 0.6 can be considered a good match, although this is obviously subjective. Writing a Python script to loop through lists of words and compare them with each other is fairly trivial, but first I had to get the lists.</p>
<p>I downloaded a list of parishes from the <a href="http://www.parloc.pwp.blueyonder.co.uk/parlocdl.html">Parish Locator</a> website as a CSV file. I don’t know exactly how complete or accurate this list is, but it should be adequate for my purposes. In order to test things on a small scale I took addresses from a short list compiled by the Buckinghamshire commissaries at Aylesbury which is treated separately from the main list in the manuscript. Most places in this list are likely to be in Buckinghamshire. I put the addresses into a database table and deleted duplicates, leaving 55 unique strings (some of these are likely to be different spellings of the same place, but for programming purposes they’re different information). Then I copied the Buckinghamshire parishes from the CSV file and put them into another table.</p>
<p>As a control, I manually compared the two lists and recorded my best guess as to the identity of each place. All but 8 could be identified as Buckinghamshire parishes fairly easily, although a further two were ambiguous as they could have been either of two similarly named places (Great or Little Missenden, Weston Turville or Weston Underwood). For the purposes of calculating county totals this wouldn’t be a problem as long as all of the possibilities were in the same county. Of the 8 which were not obviously Bucks parishes, 3 could easily be identified (using Google Maps) as settlements in Bucks which are/were not parishes in their own right. This is potentially the biggest problem with this method as I don’t have a full list of settlements in 17th century England. Another 2 places, given in the MS as “Throppe” and “Stretton Audley” are highly likely to be Thrupp and Stratton Audley in Oxfordshire (although there’s also a chance that Throppe could be Castle Thorpe, Bucks). The place given as “Lannden” might be Lavendon, Bucks, or Launton, Oxon. Finally I have absolutely no idea where “Polycott”or “Ramsfee” might be. It could be that I’ve mistranscribed these (and my photos of the relevant pages are blurred enough to be ambiguous), which is another potential problem, but one which the string comparison function might help to overcome.</p>
<p>The Python program pulled the lists of names out of both database tables and compared every possible combination. If the ratio was greater than 0.5 the result (address from MS, matching parish name, and ratio) was written back to a third database table. The results that ended up in this table were very encouraging. The highest scoring match for each name agreed with my best guess in every case where I’d selected a Bucks parish. It easily dealt with cases which I thought might be tricky, such as “Agmondisham” for Amersham. As expected, it couldn’t do much with cases where the most likely answer was not in the list of Bucks parishes, but this is entirely a case of Garbage In Garbage Out. Even here there were no misleadingly high ratios.</p>
<p>The next step is to try scaling it up to compare the whole lists of addresses with the list of parishes for the entire country. Although the Bucks test would most likely have correctly identified the Oxon parishes if it had the complete parish list to test against, there is also a possibility that there will be more false positives when there are more parishes to choose from. There would be some similar issues if I could get a complete list of settlements: less chance of the algorithm being completely lost, but more chance of getting too many possible results. When working on a larger scale I might have to raise the threshold for matches. I made it 0.5 for the test to see what happened, but the results suggest that 0.55 might be better in future. Where the algorithm can’t help at all is where two or more places have exactly the same name. I was always expecting to have to make some arbitrary decisions. Although the algorithm worked better than I was expecting there are still going to be many cases where I have to decide. Even in these cases the program should save a lot of time by quickly putting together a list of reasonable possibilities to choose from.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/04/08/identifying-places/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Top Tips for the Python Programmer</title>
		<link>http://www.investigations.4-lom.com/2008/03/03/top-tips-for-the-python-programmer/</link>
		<comments>http://www.investigations.4-lom.com/2008/03/03/top-tips-for-the-python-programmer/#comments</comments>
		<pubDate>Mon, 03 Mar 2008 12:45:10 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/03/03/top-tips-for-the-python-programmer/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Top+Tips+for+the+Python+Programmer&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-03-03&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/03/03/top-tips-for-the-python-programmer/&amp;rft.language=English"></span>
Last week I learnt about using exceptions, which turned out to be the solution to a problem that I&#8217;ve mentioned before: if you try to do anything with a variable that hasn&#8217;t been initialized, Python throws an exception. In many ways this is good, because trying to do things with non-existent variables can otherwise be [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Top+Tips+for+the+Python+Programmer&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-03-03&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/03/03/top-tips-for-the-python-programmer/&amp;rft.language=English"></span>
<p>Last week I learnt about using exceptions, which turned out to be the solution to a problem that I&#8217;ve mentioned before: if you try to do anything with a variable that hasn&#8217;t been initialized, Python throws an exception. In many ways this is good, because trying to do things with non-existent variables can otherwise be a source of hard to find bugs. However, I found it quite annoying when I got exceptions just for checking for a variable and only trying to do things with it inside an if block if it exists. Even the seemingly innocuous statement &#8220;if x&#8221; will bring a program to a halt if x doesn&#8217;t exist.</p>
<p>The way around this is to handle the exception when it occurs, so that the program keeps running but you know that the variable doesn&#8217;t exist. For example:</p>
<pre>
try:
    x
except NameError:
    #x doesn't exist
else:
    #x does exist</pre>
<p>This code tries to reference x. If x doesn&#8217;t exist, an exception occurs and the code after except is executed (but only if the type of exception thrown matches what&#8217;s specified between except and the colon). If x does exist the code after else is executed instead. This is a bit long winded compared to &#8220;if x&#8221; but it&#8217;s better than nothing. Also be aware that different kinds of variables throw different kinds of exceptions when they don&#8217;t exist.</p>
<ul>
<li>NameError &#8211; any single instance of a built in type or custom object, or a sequence where the sequence itself doesn&#8217;t exist, eg x</li>
<li>IndexError &#8211; numerical index of a sequence where the sequence exists but the specified element doesn&#8217;t, eg x[5]</li>
<li>KeyError &#8211; index of a map where the map exists but an element with that index doesn&#8217;t, eg x['y']</li>
<li>AttributeError &#8211; attribute of an object where the object exists but the specified attribute doesn&#8217;t eg x.y (often occurs with objects representing XML elements as it can be difficult to predict what child elements they contain)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/03/03/top-tips-for-the-python-programmer/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Representing uncertainty in Python</title>
		<link>http://www.investigations.4-lom.com/2008/02/28/representing-uncertainty-in-python/</link>
		<comments>http://www.investigations.4-lom.com/2008/02/28/representing-uncertainty-in-python/#comments</comments>
		<pubDate>Thu, 28 Feb 2008 11:50:21 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/02/28/representing-uncertainty-in-python/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Representing+uncertainty+in+Python&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-02-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/02/28/representing-uncertainty-in-python/&amp;rft.language=English"></span>
In December I wrote some Python code to do calculations with pre-decimal British currency. As well as dealing with the awkwardness of pounds, shillings, and pence, I needed to allow for situations where a damaged or illegible manuscript made the values uncertain. To start with I wrote a class called MetaOldMoney which could store exact [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Representing+uncertainty+in+Python&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-02-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/02/28/representing-uncertainty-in-python/&amp;rft.language=English"></span>
<p>In <a href="http://www.investigations.4-lom.com/2007/12/24/new-old-money/">December</a> I wrote some Python code to do calculations with pre-decimal British currency. As well as dealing with the awkwardness of pounds, shillings, and pence, I needed to allow for situations where a damaged or illegible manuscript made the values uncertain. To start with I wrote a class called MetaOldMoney which could store exact amounts of money or ranges of values.</p>
<p>Now I&#8217;ve written some new code which can easily deal with uncertain values of anything. There are three classes: one to represent an exact value, one to represent a range where the minimum and maximum values are known, and one to represent a minimum value with no maximum. Instances of all three objects contain a tuple of two values. For an exact value they&#8217;re both the same, for a range they contain the upper and lower amount, and for a minimum the second value is set to None. The addition operator is redefined so that any combination of these objects can be added together, returning an object of the correct type eg Exact + Range = Range etc. The best thing is that the values contained can be of absolutely any type. Taking full advantage of the Python approach to typing, the classes I&#8217;ve defined don&#8217;t even need to know what they contain. The addition will just work as long as the contained objects can be added together.</p>
<p>These classes are so flexible that there are lots of different ways I could use them. I could put my OldMoney objects inside them, or I could define a new money object which contains individual uncertain values for pound, shillings and pence. I could even nest the objects inside each other to allow for situations where the maximum value in a range is also a range.</p>
<p>Code below:<span id="more-184"></span></p>
<pre>
class MetaExact:

    def __init__(self, value):
        self.value = value, value

    def __add__(self, other):
        if other.value[1]:
            if other.value[0] == other.value[1]:
                return MetaExact(self.value[0] + other.value[0])
            else:
                return MetaRange(self.value[0] + other.value[0],
self.value[0] + other.value[1])
        else:
            return MetaMin(self.value[0] + other.value[0])

class MetaRange:

    def __init__(self, lower, upper):
        self.value = lower, upper

    def __add__(self, other):
        if other.value[1]:
            return MetaRange(self.value[0] + other.value[0],
self.value[1] + other.value[1])
        else:
            return MetaMin(self.value[0] + other.value[0])

class MetaMin:
    def __init__(self, value):
        self.value = value, None

    def __add__(self, other):
        return MetaMin(self.value[0] + other.value[0])

def sumMeta(values):
    total = MetaExact(values[0].value - values[0].value)
    for x in values:
        total += x
    return total</pre>
<p>The sumMeta function works like the built-in sum function but with Meta objects. Again it doesn&#8217;t need to know what kind of objects are contained in the value tuples of the Meta objects. The total variable is initialized with a MetaExact object containing objects of the correct type but with 0 value by subtracting the first value in the sequence from itself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/02/28/representing-uncertainty-in-python/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Places</title>
		<link>http://www.investigations.4-lom.com/2008/01/28/places/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/28/places/#comments</comments>
		<pubDate>Mon, 28 Jan 2008 12:01:46 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[maps]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/28/places/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Places&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/28/places/&amp;rft.language=English"></span>
Following on from adding an interactive index of people to my digital edition of Sandall&#8217;s history of 5th Lincs, I&#8217;ve now added a similar feature for place names. It works in exactly the same way as the person index, but it also has a map view. Again this uses the Exhibit API, which makes it [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Places&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/28/places/&amp;rft.language=English"></span>
<p>Following on from adding an interactive index of people to my digital edition of Sandall&#8217;s history of 5th Lincs, I&#8217;ve now added a similar feature for <a href="http://www.4-lom.com/sandall/place-index.html">place names</a>. It works in exactly the same way as the person index, but it also has a map view. Again this uses the <a href="http://simile.mit.edu/exhibit/">Exhibit API</a>, which makes it very easy to mash up data with Google Maps without even having to know anything about the Google Maps API. The map view is a bit slower than the normal view, especially if the list isn&#8217;t filtered, but that&#8217;s an inherent limitation of using maps.</p>
<p>One of the many cool things about the map is that it strikingly illustrates the allied advances in the last months of the First World War. If you go into the map view and click &#8220;The Beginning of the Great Advance&#8221; on the list of chapters, you&#8217;ll see the battalion holding the line in Flanders, then moving behind the lines for rest near Amiens, then moving up to the front line at Saint-Quentin. Then click on each of the following chapters in turn and watch the markers surge forward as 46th Division breaks through the Hindenburg Line and pushes towards Belgium.</p>
<p>Adding the place index was mostly similar to adding the person index: I added a unique  id to each&lt;placeName&gt; tag using a Python script, pulled out the place names into an SQLite database, identified/disambiguated them and added a regularized name, then used another Python script to pull the regularized names out of the database and put them into the key attributes in the XML file. Identifying the places was easier than identifying people, and took a couple of days, although there are a few that I couldn&#8217;t find. As with people I added some code the the XSLT to generate a JSON file of all the places. Then following the <a href="http://simile.mit.edu/wiki/Exhibit/2.0/Map_View_Tutorial">map view tutorial</a> I used the Exhibit API to pull latitude and longitude co-ordinates from Google Maps and put them into another JSON file. This turned out to be a bit unreliable as about 10 per cent of the places had their co-ordinates missing. It seems to be random, as running the script again with the same set of data produced a similar error rate but with different places. I had to take the missing places from the output file, put them into another input file and run the script over them again, which produced a similar 10 per cent error rate, but the remaining few co-ordinates could be put in manually. Once I had a JSON file with all the correct geocodes it was easy to copy code from the tutorial to add a map view to the Exhibit page. In a few cases it turned out that Google had given me the wrong co-ordinates. Mostly this was because there are two or more places with the same name and it had picked the wrong one. I thought I&#8217;d put in enough information from my manual searches to disambiguate them but it seems that the results of a Google Map search can be a bit unpredictable, and don&#8217;t necessarily give you the full address of a place.</p>
<p>I&#8217;ve now done most of what I planned to do in this phase. There are still some features that could be added, especially a feedback mechanism, but I&#8217;ll be giving this project a rest soon so I can do some English Civil War work.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/28/places/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Marking Up Names: Part 2</title>
		<link>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/#comments</comments>
		<pubDate>Sat, 19 Jan 2008 15:01:34 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/&amp;rft.language=English"></span>
My digital edition of Sandall&#8217;s History of 1/5th Lincolnshire Regiment now has a new index of people. In my last post I described how names were marked up in the text. This post is about how I linked them together. I&#8217;d already used a Python script to generate a unique id for every &#60;persName&#62; and [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/&amp;rft.language=English"></span>
<p>My digital edition of Sandall&#8217;s History of 1/5th Lincolnshire Regiment now has a new <a href="http://www.4-lom.com/sandall/people-index.html">index of people</a>. In my last post I described how names were marked up in the text. This post is about how I linked them together.</p>
<p><span id="more-170"></span>I&#8217;d already used a Python script to generate a unique id for every &lt;persName&gt; and &lt;rs&gt; tag, and to build a key using components of the name. Next I wrote another Python script to pull all the names out of the XML document and put them into a SQLite database so that I could standardize the key values. The SQLite clients I have are good for running queries but not so good for manually editing data, so I exported the table to CSV and opened it in Excel. That was possibly a mistake: when I came to feed the data back into the XML I ran into some nasty character encoding problems, but it was partly my own fault for letting some unnecessary non-ASCII characters creep in through copying and pasting text from web pages. Apart from that, editing the data in a spreadsheet was easy and convenient.</p>
<p>I decided to standardize the keys in the following form: surname, full forenames (or initials where names not known), service number (Other Ranks only; post-1917 6 digit number preferred as these were unique within regiment), rank (officers only; highest rank known to have been held during the First World War), years of birth and death (if known). Officers commissioned from the ranks can have a rank and number. Here are a few examples:</p>
<p>Sandall, Thomas Edward, Lt-Col (d 1930)<br />
Stuart-Wortley, Edward James Montagu, Maj-Gen (1857-1934)<br />
Pickard, Herbert, 240010 (d 1917)<br />
Leadbeater, Conrad, 240252 Capt<br />
Howard, William, 241683</p>
<p>In the first pass I just wanted to disambiguate names as quickly as possible. In most cases this could be done from internal evidence, particularly the original index. There were some cases where I needed to check online sources, but even so the process only took a few hours and was completed within one day. There were a few names which couldn&#8217;t be disambiguated with any certainty so I&#8217;ve had to ask the 5th Lincs experts for help.</p>
<p>In the second pass I tried to make the keys as full and detailed as possible by searching online sources for full names, service numbers, and dates of death. This was much more labour intensive and took several days. It isn&#8217;t helped by the fact that both Sandall&#8217;s book and the online sources contain many errors and inconsistencies, especially with names and service numbers. This led to lots of lateral thinking, deduction, trial and error, and ultimately frustration. There are still some cases where I couldn&#8217;t positively identify an individual or couldn&#8217;t find any trace of any possible candidates in the available records. Most of these are junior officers who are only mentioned briefly. All this research wasn&#8217;t strictly necessary at this stage, but I&#8217;m looking ahead to the possibility of using a wiki for biography pages, which will need page names in the above form.</p>
<p>Once the keys were consistent and as detailed as possible I re-imported the CSV into the database and wrote yet another Python script to loop through every name element in the XML, pull out the database record with matching id, and write the new key value from the database to the @key attribute of the XML element. This went smoothly once I&#8217;d got rid of all the non-ASCII characters! Then I added some code to the XSLT to generate the index of people by pulling out all the name elements, grouping them by key value, and creating links to each occurrence of that name. This is still looking quite basic, but it works.</p>
<p>The next step is to make each occurrence of a name link back to the index. There&#8217;s a complication here because I need the entries in the index to have valid unique ids so that I can link to them, but I also need the full regularized forms of the names to appear in the index list. This places conflicting demands on the @key attribute, because xml:ids can&#8217;t contain spaces, brackets, or commas. TEI used to have a @reg attribute for names which could contain a regularized form, but this was removed in P5. The way you&#8217;re supposed to do it now is put the regularized form in a &lt;reg&gt; element and surround them with a &lt;choice&gt; element. I could do that and use a modified form for the @key. But I&#8217;m also thinking about putting the name data into a JSON file so I can do cool things with <a href="http://simile.mit.edu/exhibit/">Exhibit</a>. That would mean that I wouldn&#8217;t have to embed the regularized form of the name in every occurrence &#8211; just have an abbreviated form that can be used as a key for name occurrences and as an id for the index entry. It would also allow me to store and use more data than can be accommodated in TEI XML. For example, during the research/disambiguating phase I collected a lot of links to medal cards and CWGC database entries. If these were in a JSON file I could easily link to them from the index page without much extra work. Now I need to work out exactly what I want to do with this and how to structure the data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Marking Up Names: Part 1</title>
		<link>http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/#comments</comments>
		<pubDate>Tue, 15 Jan 2008 15:55:00 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[ww1]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+1&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-15&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/&amp;rft.language=English"></span>
On to the next stage of digitizing Sandall&#8217;s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations. For this phase, I decided to [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+1&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-15&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/&amp;rft.language=English"></span>
<p>On to the next stage of digitizing Sandall&#8217;s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations.</p>
<p><span id="more-169"></span>For this phase, I decided to use the following tags:</p>
<p>&lt;placeName&gt; (used 729 times) marks up names of places. I decided to only mark up settlements which are likely to be found on Google maps. Therefore names of geographic features, trenches/fortifications, HQs etc were ignored.</p>
<p>&lt;persName&gt; (used 789 times) marks the name of a person. Contains further tags to mark up forenames, surnames, ranks etc. @key is to contain a regularized version of the name which can be used to link records together and link to external web pages.</p>
<p>&lt;forename&gt; used to mark up forenames, one tag for each forename, including inititals. @full used to denote initials or full names. In practice the vast majority of forenames in the book are given as initials</p>
<p>&lt;surname&gt; used to mark surnames. Double-barrelled names are treated as a single surname with a single tag.</p>
<p>&lt;nameLink&gt; parts of name such as &#8220;de&#8221; or &#8220;le&#8221; which are not considered part of the surname for sorting purposes. I have followed the practice of the index in the original text to decide which parts of a surname are significant.</p>
<p>&lt;roleName&gt; used mostly for military ranks.</p>
<p>&lt;addName&gt; used for service numbers, with @type given the value &#8220;servicenumber&#8221;</p>
<p>&lt;rs&gt; (used 158 times) referring strings which refer to an individual who might be identifiable but who isn&#8217;t named in the text, eg &#8220;Battalion C.O.&#8221;, &#8220;our missing man&#8221;, &#8220;Corps Commander&#8221; etc.</p>
<p>&lt;name&gt; (used 3 times) only used to mark up the names of ships. I&#8217;m not sure if I&#8217;m actually going to do anything with these yet.</p>
<p>&lt;date&gt; (used 991 times) used for all dates which are identifiable as a single day, regardless of form in text eg &#8220;4th August, 1914&#8243;, &#8220;13th October&#8221;, &#8220;the 5th&#8221;, &#8220;next day&#8221;. @when gives full date in form yyyy-mm-dd. No other tags were used to mark dates as for this project it isn&#8217;t necessary to mark up individual parts</p>
<p>&lt;abbr&gt; marks abbreviations given in text. Some obvious ones (eg &#8220;a.m.&#8221;, &#8220;p.m.&#8221;) and some ranks which were only slightly abbreviated (eg &#8220;Lieut.-Colonel&#8221;, &#8220;Sergt.&#8221;) were left unmarked but most were marked up</p>
<p>&lt;expan&gt; supplied expansion of any abbreviation marked with &lt;abbr&gt;</p>
<p>&lt;choice&gt; surrounds &lt;abbr&gt; and &lt;expan&gt; pairs. XSLT generates HTML &lt;abbr&gt; tag with the content of &lt;expan&gt; as the value of @title</p>
<p>The names and dates in the medal list in the appendix were easy to mark up fully using regular expressions as they were well organized into predictable patterns. The main text needed more manual intervention as things were much less predictable. When I say &#8220;manual&#8221; that doesn&#8217;t involve any reading or typing as oXygen can still automate things. For example, adding a tag usually only involves a couple of double clicks to select a word then select a tag from the list of tags allowed at that point. At the first pass I only added &lt;placeName&gt;, &lt;persName&gt;, &lt;abbr&gt;, and &lt;date&gt; tags. Using the Find feature in oXygen I set up a simple regular expression [A-Z] which would find every capital letter. This allowed me to cycle through, finding every proper noun and abbreviation (and a lot of false positives too, but I can&#8217;t think of a better way to do it), and mark them with the appropriate tag. This took about 4 1/2 hours. Then I used another regular expression to find any dates which hadn&#8217;t been tagged yet (many of them don&#8217;t have months given and so weren&#8217;t found by the capital letter search):</p>
<p>(?&lt;= )[0-9]{1,2}(st|nd|rd|th)</p>
<p>This also found a few unit/formation numbers. I checked each occurrence to make sure it actually was a date, but even so it only took about 40 minutes to cycle through the whole book. Next I searched for some referring strings which wouldn&#8217;t have been picked up by the previous searches (eg &#8220;man&#8221;, &#8220;officer&#8221;, &#8220;day&#8221;, &#8220;month&#8221;).</p>
<p>Adding values to the @when attributes of &lt;date&gt; tags was done automatically for the medal list but had to be done manually for the main text as the month and year often had to be worked out from the context. With a lot of copying and pasting this took about 2 hours.</p>
<p>With abbreviations marked up it was easy to supply expansions using Find and Replace. This took about half an hour, partly because there were so many different abbreviations, and partly because of time spent researching particularly obscure ones. On reflection I could also have used the Find and Replace features of oXygen (which allows the use of regular expressions AND XPath) to mark up most of the name components but at least I learnt from doing it manually that it took 3 1/2 hours and was very tedious. If forenames were full I set the @full to &#8220;yes&#8221; manually, otherwise I left them, then used F&amp;R to add full=&#8221;init&#8221; to all the ones which didn&#8217;t have @full attributes. I did a similar thing with @type=&#8221;military&#8221; for &lt;roleName&gt; tags.</p>
<p>Once all the names were fully marked up I wrote a Python script to generate a unique @id for every &lt;persName&gt; and &lt;rs&gt;. It also generated @key values for every &lt;persName&gt; from the component parts, in the form [surname], [forenames] [namelink] [servicenumber] [rank]. When I get to record linkage these will need to be standardized, as often the same person is referred to in different ways.</p>
<p>Referring strings were more difficult. These needed @key values in the same form as &lt;persName&gt; but obviously they can&#8217;t be generated from the content of the tag. It took a lot of research to fill these in but I did most of them in a day. Battalion COs and Adjutants were supplied entirely from internal evidence in other parts of the book. Division Commanders were easy because 1/5th Lincs was always in 46th Division and there&#8217;s a complete list of division commanders at <a href="http://www.1914-1918.net/46div.htm">The Long Long Trail</a>. Someone on the Great War Forum kindly supplied me with a list of Brigadiers of 138th Brigade. Surprisingly Corps and Army commanders were more difficult as I couldn&#8217;t find a complete list of which Corps and Army 46th Division belonged to and who commanded them. With a lot of cross-checking between different parts of the book, along with the Oxford DNB, The Long Long Trail and Great War Forum, Regiments.org, and various Google searches I eventually pieced most of it together and now I&#8217;m only stuck for one Corps Commander. Google, DNB, and Wikipedia together supplied names of several non-military people, such as the Archbishop of York, Vicar of Grimsby, and President of Portugal. In the end there were only a few staff officers and COs of other battalions which I had to ask for help with on the Great War Forum. I still haven&#8217;t got everyone, but where a name is unknown I&#8217;ve given them unique strings like &#8220;unknown wounded NCO 1918-09-25&#8243;.</p>
<p>The next step is to pull out all the names and standardize the key strings so that they match for all instances of the same person, then feed them back into the XML document.<br />
<span lang="EN-GB"></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

