The Programming Historian

[posted by Gavin Robinson, 2:16 pm, 5 May 2008]

Yesterday Bill Turkel announced that The Programming Historian is now available. This is a book, but not as we know it. It’s published in the form of a website and is completely free to access. As the name suggests, it’s an introduction to computer programming aimed specifically at historians. The tutorials will get you doing useful things as soon as possible, even if you have no previous experience of programming. If you do know programming it’s also worth a look. I found lots of useful tips in it.

By enabling more historians to make better use of digital technology the book is helping to change the way that we do history. And it’s also helping to change the way that we present our research, because it’s a concrete example of the advantages of open access publishing on the web. This means a whole lot more than not having to pay to read it. Although the book has been published, it’s still a work in progress. New chapters will be added in future, and existing ones can be improved in response to feedback from readers. Any typos, factual errors or unclear sentences can all be corrected very easily. Comments from reviewers are displayed on accompanying discussion pages so you can see how the text developed and what people thought of it. The book can keep growing to meet the needs of digital historians: there doesn’t ever have to be a point when it’s finally finished like there is with a printed book.

Go and read it. Now.

Identifying places 2

[posted by Gavin Robinson, 9:26 am, 15 April 2008]

Last week I posted about experiments with Python to automatically identify places mentioned in lists of horses donated to parliament’s armies in the English Civil War. The initial results were very encouraging. Using the difflib algorithm to compare a selection of places with a list of Buckinghamshire parishes gave very encouraging results. Since then I’ve scaled it up and also tried some different approaches. The results are less clear cut when comparing bigger lists, but I’ve been able to write a program which should save me a lot of time compared to the manual methods that I used during my PhD.

(more…)

Identifying Places

[posted by Gavin Robinson, 10:17 am, 8 April 2008]

Never mind the scary theory, here’s some empiricism. And computer programming. The piece I’m working on is an analysis of lists of horses donated to the parliamentarian army in the First Civil War. There are some figures derived from these lists in my forthcoming article in War In History and in the seminar paper that I posted in November, but I’m trying to write an article which examines them in much more detail. This article will be related to debates over allegiance and the causes of the war, which is why I’ve been trying to explore the historiography and think about theoretical issues, but the substance of it will be fairly straightforward empirical stuff with lots of numbers. That’s not to say that this kind of analysis is easy. If it was someone else might have done it all years ago. John Tincey was the first person to try it, but he only did the smallest of the three account books, which is a fraction of the size of the other two. Following his lead I decided to do all of them.

In 1999 I spent about 2 weeks in the PRO typing these lists into an Access database. I’m still using that transcript as the basis of my work now, although I’ve converted it to XML to make it more flexible and checked a selection of the entries against digital photos of the manuscript. I’ve been using the Python classes that I developed for representing uncertainty to calculate totals of horses and values. Some pages are damaged, meaning that exact totals can’t be calculated – this is something that was difficult to deal with in Access but the combination of XML and Python has enough flexibility to cope with it. Getting totals for days and months is fairly easy, but I also want to group by the social status of the donors and the counties that they came from. Before I can group by counties I need to identify place names given in the manuscript as although some entries specify a county in the address, many more give a place name without a county.

(more…)

Top Tips for the Python Programmer

[posted by Gavin Robinson, 12:45 pm, 3 March 2008]

Last week I learnt about using exceptions, which turned out to be the solution to a problem that I’ve mentioned before: if you try to do anything with a variable that hasn’t been initialized, Python throws an exception. In many ways this is good, because trying to do things with non-existent variables can otherwise be a source of hard to find bugs. However, I found it quite annoying when I got exceptions just for checking for a variable and only trying to do things with it inside an if block if it exists. Even the seemingly innocuous statement “if x” will bring a program to a halt if x doesn’t exist.

The way around this is to handle the exception when it occurs, so that the program keeps running but you know that the variable doesn’t exist. For example:

try:
    x
except NameError:
    #x doesn't exist
else:
    #x does exist

This code tries to reference x. If x doesn’t exist, an exception occurs and the code after except is executed (but only if the type of exception thrown matches what’s specified between except and the colon). If x does exist the code after else is executed instead. This is a bit long winded compared to “if x” but it’s better than nothing. Also be aware that different kinds of variables throw different kinds of exceptions when they don’t exist.

  • NameError – any single instance of a built in type or custom object, or a sequence where the sequence itself doesn’t exist, eg x
  • IndexError – numerical index of a sequence where the sequence exists but the specified element doesn’t, eg x[5]
  • KeyError – index of a map where the map exists but an element with that index doesn’t, eg x['y']
  • AttributeError – attribute of an object where the object exists but the specified attribute doesn’t eg x.y (often occurs with objects representing XML elements as it can be difficult to predict what child elements they contain)

Representing uncertainty in Python

[posted by Gavin Robinson, 11:50 am, 28 February 2008]

In December I wrote some Python code to do calculations with pre-decimal British currency. As well as dealing with the awkwardness of pounds, shillings, and pence, I needed to allow for situations where a damaged or illegible manuscript made the values uncertain. To start with I wrote a class called MetaOldMoney which could store exact amounts of money or ranges of values.

Now I’ve written some new code which can easily deal with uncertain values of anything. There are three classes: one to represent an exact value, one to represent a range where the minimum and maximum values are known, and one to represent a minimum value with no maximum. Instances of all three objects contain a tuple of two values. For an exact value they’re both the same, for a range they contain the upper and lower amount, and for a minimum the second value is set to None. The addition operator is redefined so that any combination of these objects can be added together, returning an object of the correct type eg Exact + Range = Range etc. The best thing is that the values contained can be of absolutely any type. Taking full advantage of the Python approach to typing, the classes I’ve defined don’t even need to know what they contain. The addition will just work as long as the contained objects can be added together.

These classes are so flexible that there are lots of different ways I could use them. I could put my OldMoney objects inside them, or I could define a new money object which contains individual uncertain values for pound, shillings and pence. I could even nest the objects inside each other to allow for situations where the maximum value in a range is also a range.

Code below: (more…)

Places

[posted by Gavin Robinson, 12:01 pm, 28 January 2008]

Following on from adding an interactive index of people to my digital edition of Sandall’s history of 5th Lincs, I’ve now added a similar feature for place names. It works in exactly the same way as the person index, but it also has a map view. Again this uses the Exhibit API, which makes it very easy to mash up data with Google Maps without even having to know anything about the Google Maps API. The map view is a bit slower than the normal view, especially if the list isn’t filtered, but that’s an inherent limitation of using maps.

One of the many cool things about the map is that it strikingly illustrates the allied advances in the last months of the First World War. If you go into the map view and click “The Beginning of the Great Advance” on the list of chapters, you’ll see the battalion holding the line in Flanders, then moving behind the lines for rest near Amiens, then moving up to the front line at Saint-Quentin. Then click on each of the following chapters in turn and watch the markers surge forward as 46th Division breaks through the Hindenburg Line and pushes towards Belgium.

Adding the place index was mostly similar to adding the person index: I added a unique id to each<placeName> tag using a Python script, pulled out the place names into an SQLite database, identified/disambiguated them and added a regularized name, then used another Python script to pull the regularized names out of the database and put them into the key attributes in the XML file. Identifying the places was easier than identifying people, and took a couple of days, although there are a few that I couldn’t find. As with people I added some code the the XSLT to generate a JSON file of all the places. Then following the map view tutorial I used the Exhibit API to pull latitude and longitude co-ordinates from Google Maps and put them into another JSON file. This turned out to be a bit unreliable as about 10 per cent of the places had their co-ordinates missing. It seems to be random, as running the script again with the same set of data produced a similar error rate but with different places. I had to take the missing places from the output file, put them into another input file and run the script over them again, which produced a similar 10 per cent error rate, but the remaining few co-ordinates could be put in manually. Once I had a JSON file with all the correct geocodes it was easy to copy code from the tutorial to add a map view to the Exhibit page. In a few cases it turned out that Google had given me the wrong co-ordinates. Mostly this was because there are two or more places with the same name and it had picked the wrong one. I thought I’d put in enough information from my manual searches to disambiguate them but it seems that the results of a Google Map search can be a bit unpredictable, and don’t necessarily give you the full address of a place.

I’ve now done most of what I planned to do in this phase. There are still some features that could be added, especially a feedback mechanism, but I’ll be giving this project a rest soon so I can do some English Civil War work.

Marking Up Names: Part 2

[posted by Gavin Robinson, 3:01 pm, 19 January 2008]

My digital edition of Sandall’s History of 1/5th Lincolnshire Regiment now has a new index of people. In my last post I described how names were marked up in the text. This post is about how I linked them together.

(more…)

Marking Up Names: Part 1

[posted by Gavin Robinson, 3:55 pm, 15 January 2008]

On to the next stage of digitizing Sandall’s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations.

(more…)

More progress with Sandall

[posted by Gavin Robinson, 3:26 pm, 5 January 2008]

My project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War has made very good progress this week. I’ve now uploaded a new HTML version. This features links to page images and a working index: if you click on a page number in the index it takes you to the corresponding part of the text. The whole book is still on one page as I haven’t worked out how to split it yet but it’s an improvement over the previous interim version. Below are more details of what I’ve done and how I’ve done it.

(more…)

New Old Money

[posted by Gavin Robinson, 5:02 pm, 24 December 2007]

In my last post I posted my first attempt at writing Python code to do calculations with pre-decimal currency. With a lot of help from Ben Brumfield I’ve rewritten it so that it now does a lot more with less code. The classes and functions have been completely rearranged, everything is easier to read, and there is more scope for dealing with uncertainty. This is yet another example of the benefits of blogging. Without Ben’s input I’d still be using some pretty mediocre code, but by posting my first attempt on the blog and brainstorming with readers I’ve made a vast improvement in only a couple of days. More details below.

(more…)

Older posts