Zotero, XML, Python, and SP28
Since my last post I’ve been doing some more experiments to see how Zotero can be used for cataloguing previously uncatalogued administrative records from the English Civil War. I’ve now put some more of my ideas into practice in demo form and they seem to work. Linking images to Zotero items and adding metadata went very smoothly. The idea of adding extra data by putting XML tags in notes also works, although this is just a stopgap until they implement custom fields. Once you have data in Zotero it’s very easy to export it as XML and do whatever you want with it. More details below, but it gets a bit technical and even includes some sample code (formatting code in Wordpress is hard, and it’ll probably screw up the layout for some people). If you’re not A. Nerd and you’re not doing the shopping for your mum you might want to stop reading now.
First of all creating the items. I had a sample of 50 images of pay warrants from SP28/1A to use for this test run. This wasn’t quite a fair test as I knew in advance that they were all exactly the same kind of document: warrants authorized by the Earl of Essex in August 1642. In practice you don’t always get unbroken runs as there are sometimes other kinds of documents mixed in with the main series of warrants. So I pointed MozImage at the directory where the images were and went through each one, creating a new item, linking to the image file, and adding the metadata. The template hack that I mentioned in the last post worked really well here, saving a lot of time and effort. I realized that I was barking up the wrong tree with my hope that I could keep multiple templates in the same RDF file and select one by clicking on the folder. It makes much more sense to keep one template in each RDF file and attach a snapshot of it to the item that you created it from. That way you can see exactly which fields are prefilled. Otherwise if you have a lot of these templates it would be difficult to remember exactly what each one did. In this case I prefilled the author, repository, and location fields, and also set up notes with the XML tags that I used as pseudo-custom fields. These are to record the amounts of money due and paid, the date of payment (often different from the date that the warrant was authorized), and the names of the people who were paid.
It took about two and a half hours to create items, link images, and fill in all the data for these 50 documents. That’s an average of about three minutes per document, which isn’t too bad, and might improve with practice and/or more timesaving shortcuts (if I can think of any). It only took 10 minutes to photograph these 50 documents, which is faster than my more recent experiment with an account book (see here), I think partly because I didn’t need to use weights for this volume. So in a total of 3 minutes 10 seconds I have a digital image of the document plus the most important data in machine readable form. That compares quite favourably with taking notes by hand and then copying them onto file cards, which is how I dealt with these same documents 10 years ago.
Once the data was entered I exported the collection to a MODS XML file. When Zotero exports notes to XML any angle brackets in the note are converted to entity references, but it takes hardly any time to run a Find and Replace (I use jEdit for this) to turn them back. This is a sample record from the file, including the custom XML fields in the notes:
<mods>
<titleInfo>
<title>Warrant</title>
</titleInfo>
<typeOfResource>text</typeOfResource>
<genre authority=”local”>manuscript</genre>
<genre>Warrant</genre>
<name type=”personal”>
<namePart type=”family”>Essex</namePart>
<namePart type=”given”>Earl of</namePart>
<role>
<roleTerm type=”code” authority=”marcrelator”>aut</roleTerm>
</role>
</name>
<part>
<extent unit=”pages”>
<start>48</start>
<end>48</end>
</extent>
</part>
<originInfo>
<dateCreated>1642-08-11</dateCreated>
</originInfo>
<location>
<physicalLocation>SP 28/1A</physicalLocation>
</location>
<note type=”content”><amountdue>
<pounds>69</pounds>
<shillings>0</shillings>
<pence>0</pence>
</amountdue>
</note>
<note type=”content”><amountrecd>
<pounds>69</pounds>
<shillings>0</shillings>
<pence>0</pence>
</amountrecd>
</note>
<note type=”content”><datepaid>1642-08-15</datepaid></note>
<note type=”content”><payto standard=”">
<titlebefore>Sir</titlebefore>
<firstname>Philip</firstname>
<surname>Stapleton</surname>
<titleafter></titleafter>
</payto></note>
<note type=”content”><recdby standard=”">
<titlebefore></titlebefore>
<firstname>Edward</firstname>
<surname>Constable</surname>
<titleafter></titleafter>
</recdby></note>
</mods>
You could easily do almost anything with this data now (except putting it back into Zotero and synchronizing it with the original records might be problematic, but all I want to do is analyze the data outside Zotero). I just tried writing a Python script to pull some of the fields into a SQLite database, which isn’t very exciting, but everybody’s got to start somewhere. This is the first time I’ve tried writing Python code which actually does anything. If you’re dealing with SQLite there’s another useful Firefox plugin which can help you: SQLite Manager allows you to create, edit, and view SQLite databases.
XML DOM makes it very easy to deal with XML in Python. You just feed an XML file into the parser, and then you can access any of the elements using some simple functions. I couldn’t find a really quick and easy way to get at the text content of XML tags (I think there are some proprietary Javascript extensions which do this better) so I wrote this function:
from xml.dom import minidom #imports xml library
def xmltext (myelement): #returns list of text from xml element
mychildren = myelement.childNodes #makes list of child nodes
mytext = [] #list to store text data in
for x in mychildren: #loop through list of child nodes
if x.data: #if node contains text data, append data to list
mytext.append(x.data)
return mytext #return the list containing data of any text nodes
It takes an XML element object and returns a list of any text data in its child nodes. There are probably better ways to do this but I haven’t found them yet. It works well enough for what I want to do.
Then I wrote this script to pull the location and page number fields out of the XML and insert them into a database:
#import xml library
from xml.dom import minidom
#imports sqlite library
import sqlite3
#imports custom function to get text from xml element
from xmltext import xmltext
#parse xml document into object
xmldoc = minidom.parse('sp281a.xml')
#pull out all records and puts into list
records = xmldoc.getElementsByTagName('mods')
#opens database connection and create cursor
conn = sqlite3.connect('wrtest.sqlite', isolation_level=None)
c = conn.cursor()
#loop through every record
for x in records:
#get child location elements into list
location = x.getElementsByTagName('physicalLocation')
#call function to get text
loc_text = xmltext(location[0])
#repeat for the other field
folio = x.getElementsByTagName('start')
folio_text = xmltext(folio[0])
mydata = loc_text[0], folio_text[0]
#run SQL to insert data into table
c.execute("insert into warrants values (?,?)", mydata)
conn.close()
And that’s it for now. Maybe not very exciting in itself, but there’s a lot of potential. There are lots more things that I need to do, including writing some code to add up pre-decimal currency.

Comment by Gavin Robinson — 10:59 am, 21 December 2007 [permanent link to this comment]
Actually that function to get the text from an element doesn’t work properly if there isn’t any text, and it’s probably unnecessary anyway. At first I was just using firstChild to extract the text, but then I thought “what if there happens to be more than one child node?”. Well, with the custom XML I’m using the elements I’m interested in won’t have any children other than the text they contain, and if they do it means I’ve done something wrong. Maybe in future I’ll have to think of something more sophisticated but for now it seems like I’m trying to solve a problem that I don’t really have.
Comment by Mike Cosgrave — 7:49 pm, 23 December 2007 [permanent link to this comment]
This is very useful stuff!
However, since Zotero uses sqlite as its database, surely you can use any sqlite front end to pull out the records you want directly into a separate database in order to perform analysis? For some research, this could be handier than exporting and re-importing the data.
I wonder if OpenOffice Base can connect to SQLite? Then you could just populate a Calc sheet with the data, and work out the pounds-shillings-pence cost of supplies, regardless of whether they were delivered in tuns, kilderkins, pecks barrels or bushels! Maybe I’ll get to play with these over the holidays.
Comment by Gavin Robinson — 8:42 pm, 23 December 2007 [permanent link to this comment]
I know Bill Turkel has mentioned going directly into the Zotero SQLite database, but I haven’t tried it myself. In this case I’m putting in custom XML as notes anyway, and I find XML very easy to work with, so exporting the records as XML seemed like the way to go.
I’m now increasingly thinking that even with custom fields it won’t be practical to use Zotero to store all the data I want. At least some documents are just too big and complicated and will need to be represented in a custom database or XML. For example, the account book I photographed at the PRO last month has around 3,000 entries!