Zotero, XML, Python, and SP28

Since my last post I’ve been doing some more experiments to see how Zotero can be used for cataloguing previously uncatalogued administrative records from the English Civil War. I’ve now put some more of my ideas into practice in demo form and they seem to work. Linking images to Zotero items and adding metadata went very smoothly. The idea of adding extra data by putting XML tags in notes also works, although this is just a stopgap until they implement custom fields. Once you have data in Zotero it’s very easy to export it as XML and do whatever you want with it. More details below, but it gets a bit technical and even includes some sample code (formatting code in Wordpress is hard, and it’ll probably screw up the layout for some people). If you’re not A. Nerd and you’re not doing the shopping for your mum you might want to stop reading now.

First of all creating the items. I had a sample of 50 images of pay warrants from SP28/1A to use for this test run. This wasn’t quite a fair test as I knew in advance that they were all exactly the same kind of document: warrants authorized by the Earl of Essex in August 1642. In practice you don’t always get unbroken runs as there are sometimes other kinds of documents mixed in with the main series of warrants. So I pointed MozImage at the directory where the images were and went through each one, creating a new item, linking to the image file, and adding the metadata. The template hack that I mentioned in the last post worked really well here, saving a lot of time and effort. I realized that I was barking up the wrong tree with my hope that I could keep multiple templates in the same RDF file and select one by clicking on the folder. It makes much more sense to keep one template in each RDF file and attach a snapshot of it to the item that you created it from. That way you can see exactly which fields are prefilled. Otherwise if you have a lot of these templates it would be difficult to remember exactly what each one did. In this case I prefilled the author, repository, and location fields, and also set up notes with the XML tags that I used as pseudo-custom fields. These are to record the amounts of money due and paid, the date of payment (often different from the date that the warrant was authorized), and the names of the people who were paid.

It took about two and a half hours to create items, link images, and fill in all the data for these 50 documents. That’s an average of about three minutes per document, which isn’t too bad, and might improve with practice and/or more timesaving shortcuts (if I can think of any). It only took 10 minutes to photograph these 50 documents, which is faster than my more recent experiment with an account book (see here), I think partly because I didn’t need to use weights for this volume. So in a total of 3 minutes 10 seconds I have a digital image of the document plus the most important data in machine readable form. That compares quite favourably with taking notes by hand and then copying them onto file cards, which is how I dealt with these same documents 10 years ago.

Once the data was entered I exported the collection to a MODS XML file. When Zotero exports notes to XML any angle brackets in the note are converted to entity references, but it takes hardly any time to run a Find and Replace (I use jEdit for this) to turn them back. This is a sample record from the file, including the custom XML fields in the notes:

<mods>
<titleInfo>
<title>Warrant</title>
</titleInfo>
<typeOfResource>text</typeOfResource>
<genre authority=”local”>manuscript</genre>
<genre>Warrant</genre>
<name type=”personal”>
<namePart type=”family”>Essex</namePart>
<namePart type=”given”>Earl of</namePart>
<role>
<roleTerm type=”code” authority=”marcrelator”>aut</roleTerm>
</role>
</name>
<part>
<extent unit=”pages”>
<start>48</start>
<end>48</end>
</extent>
</part>
<originInfo>
<dateCreated>1642-08-11</dateCreated>
</originInfo>
<location>
<physicalLocation>SP 28/1A</physicalLocation>
</location>
<note type=”content”><amountdue>
<pounds>69</pounds>
<shillings>0</shillings>
<pence>0</pence>
</amountdue>
</note>
<note type=”content”><amountrecd>
<pounds>69</pounds>
<shillings>0</shillings>
<pence>0</pence>
</amountrecd>
</note>
<note type=”content”><datepaid>1642-08-15</datepaid></note>
<note type=”content”><payto standard=”">
<titlebefore>Sir</titlebefore>
<firstname>Philip</firstname>
<surname>Stapleton</surname>
<titleafter></titleafter>
</payto></note>
<note type=”content”><recdby standard=”">
<titlebefore></titlebefore>
<firstname>Edward</firstname>
<surname>Constable</surname>
<titleafter></titleafter>
</recdby></note>
</mods>

You could easily do almost anything with this data now (except putting it back into Zotero and synchronizing it with the original records might be problematic, but all I want to do is analyze the data outside Zotero). I just tried writing a Python script to pull some of the fields into a SQLite database, which isn’t very exciting, but everybody’s got to start somewhere. This is the first time I’ve tried writing Python code which actually does anything. If you’re dealing with SQLite there’s another useful Firefox plugin which can help you: SQLite Manager allows you to create, edit, and view SQLite databases.

XML DOM makes it very easy to deal with XML in Python. You just feed an XML file into the parser, and then you can access any of the elements using some simple functions. I couldn’t find a really quick and easy way to get at the text content of XML tags (I think there are some proprietary Javascript extensions which do this better) so I wrote this function:

from xml.dom import minidom #imports xml library

def xmltext (myelement): #returns list of text from xml element
    mychildren = myelement.childNodes #makes list of child nodes
    mytext = [] #list to store text data in
    for x in mychildren: #loop through list of child nodes
        if x.data: #if node contains text data, append data to list
            mytext.append(x.data)
    return mytext #return the list containing data of any text nodes

It takes an XML element object and returns a list of any text data in its child nodes. There are probably better ways to do this but I haven’t found them yet. It works well enough for what I want to do.

Then I wrote this script to pull the location and page number fields out of the XML and insert them into a database:

#import xml library
from xml.dom import minidom
#imports sqlite library
import sqlite3
#imports custom function to get text from xml element
from xmltext import xmltext

#parse xml document into object
xmldoc = minidom.parse('sp281a.xml')
#pull out all records and puts into list
records = xmldoc.getElementsByTagName('mods') 

#opens database connection and create cursor
conn = sqlite3.connect('wrtest.sqlite', isolation_level=None)
c = conn.cursor()

#loop through every record
for x in records:
   #get child location elements into list
    location = x.getElementsByTagName('physicalLocation')
   #call function to get text
    loc_text = xmltext(location[0])
    #repeat for the other field
    folio = x.getElementsByTagName(’start’)
    folio_text = xmltext(folio[0])

    mydata = loc_text[0], folio_text[0]
    #run SQL to insert data into table
    c.execute(”insert into warrants values (?,?)”, mydata) 

conn.close()

And that’s it for now. Maybe not very exciting in itself, but there’s a lot of potential. There are lots more things that I need to do, including writing some code to add up pre-decimal currency.

Permanent link to this post

Digital History, English Civil War, Military — posted by Gavin Robinson, 7:43 pm, 20 December 2007

3 Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.