Text-mining tips

[posted by Gavin Robinson, 10:27 am, 12 June 2011]

These are some insights from the text-mining that I’ve been doing this week:

Stop and think about stop words

One of the first rules of text-mining should be: always make your own list of stop words. Nothing absolutely and objectively is or isn’t a stop word. Which words are and aren’t meaningful depends on your research questions. For example, pronouns are often included in lists of stop words, but I’m very interested in gender so I want to know the frequencies of gendered words like ‘he’ and ‘she’. If you use someone else’s list without thinking about it you’ll probably inherit various biases and assumptions. The kind of text you’re working with also makes a difference. In the proceedings of parliament words like ‘ordered’, ‘resolved’ and ‘committee’ occur too regularly to be much use to most people. If you don’t define your stop words until after you’ve calculated frequencies for every word you can get a better idea of which words are getting in the way and which ones are interesting.

BeautifulSoup is not always the answer

The Python library BeautifulSoup is really useful for extracting data from HTML pages, but maybe I got into the habit of using it too much. This week I was trying to work out how to get some data from pages that didn’t have a very good semantic structure. Doing it with BeautifulSoup looked like it would be really complicated, but then I realised that in this case regular expressions would be much easier.

Have sets

Python includes a sequence type called a set, which combines the best aspects of a Python sequence and a mathematical set, and is incredibly useful for text-mining scripts. Turning a list into a set automatically gets rid of duplicates. For example, suppose you’ve split some text into a list of separate words.

>>>wordlist = 'it was the best of times it was the worst of times'.split()

>>>wordlist

['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']

>>>wordset = set(wordlist)

>>>wordset

set(['of', 'it', 'times', 'worst', 'the', 'was', 'best'])

Now we have a set of unique words which we can iterate through using a for loop, counting the occurrences of each word in the list:

for word in wordset:
    wordcount = wordlist.count(word)

Then we can do whatever we want with wordcount (print it to the screen, add it to a tuple or a dictionary, write it to a file).

You can also do mathematical operations on sets, which can be really useful for removing stop words.

Suppose we have a set of stopwords:

>>>stopwordset = set(['of', 'it', 'the'])

We can deduct that from the set of words before we iterate through it:

>>>wordset = wordset - stopwordset

>>>wordset

set(['was', 'worst', 'best', 'times'])

Now the stop words in wordlist are completely ignored, and we don’t even have to do an if test at every iteration.

A dictionary is a bit like a database

Python dictionaries can be thought of as very simple databases. Obviously they can’t do everything that a database can do, but you don’t have to worry about connections or cursors either. When counting words across multiple files it’s easy to keep a running total of each word by updating a dictionary at every iteration. If the word is already in the dictionary, add to the existing count; if it isn’t, add a new key/value pair.

This is how I do it:

>>>wordcount = dict()

(Then iterate through each file, open and read it etc.)

for word in wordset:
    if word in wordcount:
        wordcount[word] = wordcount[word] + wordlist.count(word)
    else:
        newword = [(word, wordlist.count(word))]
        wordcount.update(newword)

Digital images: how do you manage?

[posted by Gavin Robinson, 3:09 pm, 23 October 2010]

Back in July I posted about a Python script I was working on to help with organizing photos of archival documents. I didn’t think it would all that interesting to many other people, but a comment from Chris Williams made me realize that there’s potentially quite a lot of demand for something like this. Digital photography in archives doesn’t seem to be much of a sexy buzz topic among digital historians, but it’s something that lots of researchers do even if they’re not into digital history (although Melissa Terras‘s latest book seems to cover it). As far as I know there aren’t any tools specifically designed to help with organizing large numbers of document images. The python script I’m working on is just a stopgap thing which is mostly specific to what I’m doing and how I work, and is never likely to be very user friendly. Maybe what we need is a Firefox extension that plugs into Zotero, or maybe image management features in Zotero itself. Some features that might be useful:

  • Browse a directory of images in Firefox (I used to use MozImage for this, as I was reminded when I found this old post)
  • Mark a page image as being the first or last in a document (this is the really crucial thing, and I’m not aware of any image browsers that can currently do it)
  • Create sub-directories for documents and move images into them based on first and last markers
  • Create Zotero items for marked documents, maybe with some fields pre-filled in a standard form which can be applied to all documents in a directory. For example, if I’m working through box SP 24/30 from the National Archives, set Repository to “TNA” and Loc in Archive to “SP 24/30”.
  • Upload images to Flickr and create sets for them, maybe based on associated Zotero items; attach Flickr links to relevant Zotero items

I’m not in a position to do this myself right now, but I need to learn how to make Firefox extensions sooner or later. Apart from image management stuff, I also need a word count extension (I usually draft most of my writing in a private wiki instead of a word processor; having Firefox count the words for me is much easier than pasting into Open Office just to see how much I’ve written). The one I used to use isn’t compatible with Firefox 3.6 and the author hasn’t updated it for a long time. Counting words can’t be that hard can it? Or maybe it is.

So, does anyone have any thoughts on image management? If you take lots of photos in the archives, how do you deal with them once you get them home? Is there any software I don’t know about which would do what I need? What features would make your life easier?

Multiple Indemnity

[posted by Gavin Robinson, 10:04 am, 20 July 2010]

As part of the research for my book (saying that still feels a bit weird, but I’m sure I’ll get used to it) I’m going through indemnity cases in class SP 24 in the UK National Archives (aka the PRO). The Indemnity Committee was set up by parliament in 1647 to protect soldiers and officials from prosecution for actions that they had carried out under the authority of parliament, such as requisitioning things for the army or arresting royalists. It also dealt with disputes over sequestered rents and debts, and helped to enforce parliament’s order that apprentices who joined the army should be allowed to count military service towards their term of apprenticeship. If someone was prosecuted in court for acts which were covered by the Indemnity Ordinance (and many were despite the Ordinance banning people from bringing cases of this kind) the defendant could send a petition to the Indemnity Committee asking for protection. In SP 24 there are 58 boxes of petitions and other papers relating to cases, such as depositions and lists of expenses. Unlike some classes these are quite well sorted: papers relating to each case are grouped together and sorted in roughly alphabetical order of the plaintiff’s name (although confusingly the plaintiff in an indemnity case is the defendant in the corresponding criminal prosecution). I’m particularly interested in cases relating to horse requisitioning. According to Ian Gentles, about 30% of the military cases involve horses, although from what I’ve seen so far military cases seem to be a minority as many cases are disputes between civilians over payment of rents and debts due to sequestered estates. It usually takes me less than an hour to skim through a box, look at the first petition in each case to see if it’s about horses, and photograph the relevant cases. Sometimes I get cases that look interesting for other reasons, but I try not to wander too far off topic too often. Since I’m photographing these papers for my research, and since the National Archives allow document images to be uploaded to Flickr, that’s just what I’m doing. I’m also putting transcripts or summaries of the documents, along with links to the images, on the Your Archives wiki. You can see what I’ve done so far, and follow my progress in future, via a Flickr collection and Your Archives category.

So far I’ve uploaded cases from the first 2 boxes. I have another 16 boxes ready to be uploaded, but I’m working on some Python scripts to automate the process. The trial run on the first two boxes proved that doing it all manually is quite labour intensive. First I copied the image files from my camera and sorted them into directories for each box. The directory structure is based on the archival reference, so there’s a directory called “SP 24” with sub-directories called “30”, “31” etc. Then I went into each of these directories and made sub-directories for each case, so it looks like this:

  • SP 24
    • 30
      • 1 Abeary vs Windebanke
      • 1 Adams vs Haughton
      • 2 Alford vs King
      • etc
    • 31

And the path to a particular case would be:

SP 24/30/2 Alford vs King

Which looks quite similar to the archival reference.

The numbers at the start of the case name are the part number (each box usually contains three folders called part 1, part 2 and part 3 but I decided not to make directories for these). Up to here it has to be done manually as arranging cases into directories involves looking at the documents to see where a new case begins and to check the names. But from here a lot of it can be automated.

Each directory containing one case needs to have its own photoset on Flickr. I used Postr to upload one case at a time and then used Desktop Flickr Organizer to create a set and add photos to it (I got both of these applications from the Ubuntu repository – if you’re on Windows then… stop using Windows!). Then I used the Organizr on the Flickr website to drag each set into the “SP 24 Indemnity Cases” collection. Once the Flickr photos and sets were in place I went to the web page for each set, manually created a Zotero item for the case, and attached a link to the page. Finally I created a Your Archives page for each case and attached a link to it in Zotero. This includes a template that I made for indemnity cases which gives some basic information in a standardized form and includes a link to the relevant Flickr set. Doing all this manually for each case is quite tedious and takes a long time, so I’m working on some Python scripts to automate the process. What I want the scripts to do is:

  1. Upload photos from multiple directories
  2. Create a separate photoset for each directory, with a name based on the directory name and path
  3. Get the ID of each set and write the IDs and names to a CSV file
  4. (At this point I’ll manually edit the CSV file to add data that will be needed for Your Archives and Zotero and which can only be got by looking at the document images, eg full names of plaintiffs and defendants, date of the petition, summary of the case, categories/tags)
  5. Use the data from the CSV file to construct a wiki page with the correct template and upload to Your Archives through the MediaWiki API
  6. Export an XML file which can be imported into Zotero

So far I’ve written a Flickr upload script which does the first three steps and more or less works. Rather than working directly with the Flickr API I’m using the Python Flickr API library, which makes things very easy. It provides a flickr class with methods to handle API calls and authentication. Before using it you have to go to the App Garden and request an API key, but that doesn’t take long to do. App pages can be kept private, which is what I’m doing in this case as I don’t really have the time or skills to make my scripts fit for public consumption. The next step is to add error handling as the script only works as long as nothing goes wrong. In the real world, there are lots of things that could go wrong. The library throws an exception if it gets an error response from the API. Until I add some exception handling this means that the script just stops on an error. The script will need to keep track of what has and hasn’t been done (photos uploaded, sets created, photos added to sets) so that I can run it again if anything was left undone, and so that it doesn’t try to do the same thing again if it’s already been done. One annoying thing about Flickr’s public API is that it provides no way to create a collection or add sets to a collection. I assumed I’d be able to automate that part of the process but it looks like I’ll still have to do it manually.

For step 5 I’ll be using the Pywikipediabot library. I’ve already done some simple tests on a local MediaWiki installation and it seems quite easy to create a page. Once I’ve finished the script and thoroughly tested it I can ask for a bot account on Your Archives. Step 6 will involve learning a bit more about Zotero RDF. The easiest way to find out how to generate the right code is to export some similar existing items and look at the results.

So just because I’m writing a monograph it doesn’t mean I’ve abandoned digital history. I’ll still be using lots of digital tricks in the background, but they won’t necessarily be obvious in the text of the book. New technology is certainly making my research quicker and cheaper than it used to be. The stuff that I’ve written about above isn’t exactly revolutionary: it saves labour but it doesn’t offer new insights that couldn’t have been found before. But later in the project I’m planning to do some text mining which I hope will show me things that I couldn’t otherwise have found. I’ll also be revisiting phonetic algorithms for place name identification. And if I can’t think of anything else to blog about, there are likely to be some interesting stories in the indemnity cases.

How To Make A Bookmarklet

[posted by Gavin Robinson, 10:55 am, 13 February 2010]

Knowing how to program can save you from tedious repetitive tasks, such as inserting templates into a wiki page. Recently I’ve been spending more time editing the UK National Archives wiki Your Archives. I created a category for women’s wills, and while I was adding pages to it, I found that a lot of them didn’t have the correct template. Wills that were proved in the Prerogative Court of Canterbury are held by the National Archives and can be downloaded from their DocumentsOnline service. Transcripts of these wills can be posted on Your Archives, and we have a template for them which automatically creates a link back to DocumentsOnline based on an ID code, and formats some key data (testator’s name, dates, catalogue reference) in a standard form. Most of the data which goes into the template can be found in the DocumentsOnline index. We used to copy and paste each value manually, which was not the best use of a human’s time. Faced with the prospect of doing this an awful lot, I decided to write a program to do it automatically. First I threw together a Python script, which was alright for me but no use for people who don’t have Python and BeautifulSoup (and I also wrote it in such a way that it relied on Linux with xclip installed). So then I decided to rewrite it in JavaScript, so that other people could use it in their browsers. You can find the finished version and documentation on the PCC Will Bookmarklet page. Below is a walk through of how I did it.

(more…)

Converted to Ubuntu

[posted by Gavin Robinson, 2:47 pm, 23 January 2010]

Last year computer programming was out, but now it’s back in. For me anyway. Having finished my data entry job in October I’ve got more spare computer time, which means I can be more active in digital history again. Some things are different now. Zotero has groups and syncing. The Programming Historian has moved since the last time I looked at it. I can finish my digital edition of Sandall’s history of the 5th Lincolns because Major Teall’s epilogue came out of copyright in the UK at the start of this year. But the biggest change is that I’ve switched my operating system from Windows to Linux. When I built my new desktop PC (codenamed Zen) I installed Ubuntu, and I love it. My laptop (codenamed Orac) still has Windows Vista, but I don’t use it much.

Changing to a completely different operating system might sound like a big step but it was actually really easy. This is partly because most of the applications I use are cross platform. I use Firefox more than any other application (and possibly more than all other applications put together). Don’t think that I spend all my time idly browsing the web: Firefox is vital for my historical research and writing. I use Zotero to store, sort, and access all the bibliographic data plus associated notes and PDFs for my research projects. These can all be synced between my PCs via the Zotero server and my own WebDAV server. My works in progress are now drafted on a private wiki which is also necessarily accessed through my web browser. This is much more powerful and flexible than writing in Word like I used to. Every page has an edit history so I can easily compare versions and revert to an earlier one. Wikilinks make it easy to fit sections together in different orders and link to supplementary information. Thanks to Google my e-mail and RSS feed reader are also on the web. When I’m not using Firefox, I still mostly use cross-platform applications. For the last few years I’ve used oXygen for XML editing and jEdit for find and replace operations, both of which are written in Java. Python can run on Linux, Windows and Macs, and although that doesn’t necessarily make individual scripts cross-platform it doesn’t really matter when I’m writing them for myself. The only Windows specific app that I’ve relied on in the last few years is MS Access. Even that was mainly because I was getting paid good money to put data into it for someone else. For my own research I’ve got some old databases from my PhD research, but all I ever need to do with them is export data into other formats.

Given all this, changing to Linux was not likely to be much of a problem, but that would be understating things. In fact it turned out to be a big advantage. Ubuntu is actually much quicker and easier to install and set up than Windows. It just works out of the box and comes with most of the things that most people need to get started. Open Office, Firefox, and even Python are all pre-installed. Once I’d added my favourite Firefox extensions and synced my Zotero library I was ready to do most of what I need to do. The only tricky things were manually installing a proprietary graphics driver and setting up DVD playback, but even this wasn’t too hard. If you don’t have a powerful new graphics card and don’t need 3D performance out of it, the pre-installed open source driver will be adequate for desktop stuff. Even setting up a network printer was completely painless.

Adding new applications is generally much easier than on Windows. Instead of buying a CD or downloading an executable file you can just access software repositories via a menu and tick boxes to select apps you want to be downloaded and installed. Because most of these apps are free in every sense of the word (like Ubuntu itself) you won’t have to pay money or agree to a licence that sells your soul to the devil. Via the repositories I could easily install Geany (a code editor which I now use for Python programming: I actually like it more than Komodo), gFTP (FTP client), the aforementioned jEdit, and the BeautifulSoup library for Python. It only took a few simple commands at the terminal to install and set up an Apache server with PHP and MySQL for local testing. oXygen had to be downloaded and installed manually as it’s a proprietary application, but the academic licence is cheap and cross-platform: I originally bought it for Windows but my licence automatically carries over to Linux. To get it working properly I had to install the proprietary Sun version of Java, but that was easy to do via the repository. There is a thing called WINE which lets you run some Windows programs in Linux, but so far I’ve only used it for listening to music with Spotify.

With everything set up to my liking, Ubuntu has made me fitter, happier and more productive. It’s faster, more secure, more stable, and less annoying than Windows. You can start using it as soon as the desktop appears on the screen instead of waiting for it to finish starting, or dealing with a patronising storm of pop-ups about how your anti-virus might be out of date or how you’ve got unused icons on your desktop. The Blue Screen of Death is now just an unpleasant memory. Linux users generally don’t have virus scanners or software firewalls because we don’t need them. The only major problem I’ve had so far is when an upgrade to a new version didn’t agree with my proprietary graphics driver and made it impossible to boot to the desktop from the hard disk. Even that was surprisingly easy to recover from, as being able to run the operating system from the LiveCD makes it very easy to rescue any files which aren’t already backed up before doing a clean reinstall (and the reinstall process is quicker and easier than for Windows).

So those are my reasons for preferring Ubuntu to Windows. If you haven’t tried Linux before you can download Ubuntu, burn it onto a CD, and then boot from the CD, which gives you an option to try it out without actually installing it on your PC. And it won’t cost you anything. Meanwhile I’ll be getting on with my research, writing and programming. And blogging about those things…

The Programming Historian

[posted by Gavin Robinson, 2:16 pm, 5 May 2008]

Yesterday Bill Turkel announced that The Programming Historian is now available. This is a book, but not as we know it. It’s published in the form of a website and is completely free to access. As the name suggests, it’s an introduction to computer programming aimed specifically at historians. The tutorials will get you doing useful things as soon as possible, even if you have no previous experience of programming. If you do know programming it’s also worth a look. I found lots of useful tips in it.

By enabling more historians to make better use of digital technology the book is helping to change the way that we do history. And it’s also helping to change the way that we present our research, because it’s a concrete example of the advantages of open access publishing on the web. This means a whole lot more than not having to pay to read it. Although the book has been published, it’s still a work in progress. New chapters will be added in future, and existing ones can be improved in response to feedback from readers. Any typos, factual errors or unclear sentences can all be corrected very easily. Comments from reviewers are displayed on accompanying discussion pages so you can see how the text developed and what people thought of it. The book can keep growing to meet the needs of digital historians: there doesn’t ever have to be a point when it’s finally finished like there is with a printed book.

Go and read it. Now.

Identifying places 2

[posted by Gavin Robinson, 9:26 am, 15 April 2008]

Last week I posted about experiments with Python to automatically identify places mentioned in lists of horses donated to parliament’s armies in the English Civil War. The initial results were very encouraging. Using the difflib algorithm to compare a selection of places with a list of Buckinghamshire parishes gave very encouraging results. Since then I’ve scaled it up and also tried some different approaches. The results are less clear cut when comparing bigger lists, but I’ve been able to write a program which should save me a lot of time compared to the manual methods that I used during my PhD.

(more…)

Identifying Places

[posted by Gavin Robinson, 10:17 am, 8 April 2008]

Never mind the scary theory, here’s some empiricism. And computer programming. The piece I’m working on is an analysis of lists of horses donated to the parliamentarian army in the First Civil War. There are some figures derived from these lists in my forthcoming article in War In History and in the seminar paper that I posted in November, but I’m trying to write an article which examines them in much more detail. This article will be related to debates over allegiance and the causes of the war, which is why I’ve been trying to explore the historiography and think about theoretical issues, but the substance of it will be fairly straightforward empirical stuff with lots of numbers. That’s not to say that this kind of analysis is easy. If it was someone else might have done it all years ago. John Tincey was the first person to try it, but he only did the smallest of the three account books, which is a fraction of the size of the other two. Following his lead I decided to do all of them.

In 1999 I spent about 2 weeks in the PRO typing these lists into an Access database. I’m still using that transcript as the basis of my work now, although I’ve converted it to XML to make it more flexible and checked a selection of the entries against digital photos of the manuscript. I’ve been using the Python classes that I developed for representing uncertainty to calculate totals of horses and values. Some pages are damaged, meaning that exact totals can’t be calculated – this is something that was difficult to deal with in Access but the combination of XML and Python has enough flexibility to cope with it. Getting totals for days and months is fairly easy, but I also want to group by the social status of the donors and the counties that they came from. Before I can group by counties I need to identify place names given in the manuscript as although some entries specify a county in the address, many more give a place name without a county.

(more…)

Top Tips for the Python Programmer

[posted by Gavin Robinson, 12:45 pm, 3 March 2008]

Last week I learnt about using exceptions, which turned out to be the solution to a problem that I’ve mentioned before: if you try to do anything with a variable that hasn’t been initialized, Python throws an exception. In many ways this is good, because trying to do things with non-existent variables can otherwise be a source of hard to find bugs. However, I found it quite annoying when I got exceptions just for checking for a variable and only trying to do things with it inside an if block if it exists. Even the seemingly innocuous statement “if x” will bring a program to a halt if x doesn’t exist.

The way around this is to handle the exception when it occurs, so that the program keeps running but you know that the variable doesn’t exist. For example:

try:
    x
except NameError:
    #x doesn't exist
else:
    #x does exist

This code tries to reference x. If x doesn’t exist, an exception occurs and the code after except is executed (but only if the type of exception thrown matches what’s specified between except and the colon). If x does exist the code after else is executed instead. This is a bit long winded compared to “if x” but it’s better than nothing. Also be aware that different kinds of variables throw different kinds of exceptions when they don’t exist.

  • NameError – any single instance of a built in type or custom object, or a sequence where the sequence itself doesn’t exist, eg x
  • IndexError – numerical index of a sequence where the sequence exists but the specified element doesn’t, eg x[5]
  • KeyError – index of a map where the map exists but an element with that index doesn’t, eg x['y']
  • AttributeError – attribute of an object where the object exists but the specified attribute doesn’t eg x.y (often occurs with objects representing XML elements as it can be difficult to predict what child elements they contain)

Representing uncertainty in Python

[posted by Gavin Robinson, 11:50 am, 28 February 2008]

In December I wrote some Python code to do calculations with pre-decimal British currency. As well as dealing with the awkwardness of pounds, shillings, and pence, I needed to allow for situations where a damaged or illegible manuscript made the values uncertain. To start with I wrote a class called MetaOldMoney which could store exact amounts of money or ranges of values.

Now I’ve written some new code which can easily deal with uncertain values of anything. There are three classes: one to represent an exact value, one to represent a range where the minimum and maximum values are known, and one to represent a minimum value with no maximum. Instances of all three objects contain a tuple of two values. For an exact value they’re both the same, for a range they contain the upper and lower amount, and for a minimum the second value is set to None. The addition operator is redefined so that any combination of these objects can be added together, returning an object of the correct type eg Exact + Range = Range etc. The best thing is that the values contained can be of absolutely any type. Taking full advantage of the Python approach to typing, the classes I’ve defined don’t even need to know what they contain. The addition will just work as long as the contained objects can be added together.

These classes are so flexible that there are lots of different ways I could use them. I could put my OldMoney objects inside them, or I could define a new money object which contains individual uncertain values for pound, shillings and pence. I could even nest the objects inside each other to allow for situations where the maximum value in a range is also a range.

Code below: (more…)

Older posts