Bibliography Databases

[posted by Gavin Robinson, 10:31 pm, 30 October 2006]

Time to start filling up the “Information Technology” category then. Anyone who isn’t interested in SQL should probably look away now. I’ll be posting some thoughts on Zotero sooner or later, but this post is about my own attempts at making bibliographical databases. I’ve always preferred doing it myself to using off the shelf solutions, which can have advantages and disadvantages.

When I started my PhD (in 1997) I was using file cards to keep track of my reading. Once I got a laptop (not until 1999) I designed an Access database to replace the cards. It was quite simple. The records were all stored in one table with the standard fields such as author, title, publisher, place etc. Originally it could only store one location for each work. This was no problem because I had a fixed hierarchy of where to find and consult books: Reading University library was first choice; if something wasn’t there I’d look in the IHR; if it wasn’t there I’d go to the British Library. This proved inadequate when I went home from Reading in 2000 and was mostly working at Leeds University Library, so I had to add extra fields for Leeds locations. Works were assigned to one of four categories: primary, secondary, bibliography, or manuscript list. In addition I added a rudimentary subject search and another field for counties covered. With this basic table and a form as a front end it was quite easy to add data, although data entry was all manual, and by the end of my PhD I had over 800 records. With searches, sorting, filters, and queries I could usually find the records I wanted, and I could use reports to print a nicely formatted list of books to look for. This got me through my PhD, and when I’d finished my thesis (in 2001) I just mailmerged my bibliography into Word. Designing this and other databases during my PhD gave me good experience of Access.

This year I decided I needed something better, which didn’t run on Access 97. I now realise that Access query objects are horribly bloated front ends for what is essentially just a line of SQL code. I decided to use PHP and MySQL running on localhost. I already had a local Apache server with PHP and MySQL (installed painlessly using WAMP) which I use for testing websites. The big advantage of this approach is being able to use HTML pages as a front end rather than learning any APIs. Also I wanted to improve my PHP and MySQL skills, which would be valuable even if the project itself was a failure. Although I had experiece of PHP and SQL I’d never used them together before. It turned out that this wasn’t a big deal because combining them only involves learning to use a couple of new functions. The biggest disadvantage was having to program everything from scratch which was quite time consuming, but this at least added a lot to my programming experience.

Since the database was only ever going to be for my own personal use and was being created from the ground up I had freedom to try some different ideas. It’s still a work in progress, which might be abandoned now that Zotero is out. Data entry is still all manual because I haven’t got round to learning how to automate it. I was able to import all the records from my old Access database but they needed a lot of manual tidying up because the database structures were so different. Because the new database is only for my use on my own PC I haven’t made any attempt to make it secure or efficient. It would easily kill a webserver if it was open to the public and is wide open to SQL injection attacks.

These are some of the features I experimented with and some of their advantages and disadvantages:

Separation of authors and works:

The Access database, seemingly in common with most library catalogues, had a field for authors’ names in the record for each work. With the new database I decided to try a separate table for authors, with a third table to link author IDs to work IDs (and a fourth table to link the IDs if the author(s) had edited rather than written the work; in case you’re wondering I made myself number 1, which is probably enough hubris to ensure that I never get anything published in print). The idea was that it would be easier to keep track of works written by the same author, and to differentiate between authors with the same or similar names. Also, having a master record allows me to keep additional notes on an author and to split the name into constituent parts. This gives much more flexibility, because I can choose whether to output the surname with initials, first forename, full forenames, or just the surname on its own (doesn’t work so well with books credited to an organisation rather than individuals eg “Great Britain - War Office”).

The biggest problem with this approach is that it’s so different from how most OPACs are structured that fully automated data scraping would be virtually impossible. There would always be a need for some manual intervention in linking authors to works. Selecting the correct author ID is always a subjective decision, and I’ve already had some problems disambiguating authors with the same name. If the author is already in the database, the linking process only involves clicking on links to select the author from the list, but if the author isn’t already present, details have to be entered manually. It would be nice if there was a central database of all published authors with biographical details and complete lists of all works published. Since there isn’t, I think trying to organise my data like this is probably more trouble than it’s worth. There is also an efficiency issue, because retrieving the details of authors requires lots of extra queries. Because of this searches which return hundreds of works can be noticeably slow.

Containers:

Journal articles, essays, and volumes of serial publications are entered in the main works table in their own right, but also linked to the containing journal, collection, or series. This means that I only have to enter the details of a journal once and those details can be shared by all the articles linked to that journal. The record for each article stores its own year, volume number, and page numbers, but pulls the journal title from the journal’s record. This gives much more consistency and creates less work, especially with long and obscure titles of local antiquarian journals. Like the separation of authors this can lead to inefficiency and presents problems for fully automated data capture. However, there are no problems with disambiguation like there are with authors and so more likelihood of the computer being able to match journals automatically (although I haven’t tried it yet).

Multiple Locations:

Each library that I might use has an ID and a priority number stored in a master locations table. This table is linked to a work locations table, which stores shelf marks of each work in each library that it can be located in, along with the work ID to link to the records in the main works table. When a work record is displayed, it pulls all the locations of that work from the locations table and displays them in order of priority numbers. A copy of the highest priority location of each work is stored in the work record itself to allow more efficient searches. This is automatically updated whenever the work is linked to a new location, or when priorities are changed. Articles and essays can be linked to the location of their containing journal or collection, or be given their own location. There is effectively no limit on the number of locations I can store, and I can change their priorities by editing the numbers in the master locations table. So if I start using a different library as a result of getting a job at a university at the other end of the country the database can easily accommodate the changed circumstances. This flexibility and future-proofing make it a pretty good feature. However, it multiplies the number of tables and queries, so again not good for efficiency.

No distinction between primary and secondary works:

This is an absence of a feature rather than a feature, but it needs explaining. Originally I did have a field for primary or secondary, just like the old Access database. Then I decided to see if I could do without it. Since getting interested in theory I’ve been finding the distinction between primary and secondary at best blurred and at worst redundant and even misleading. The database seems to be working well enough without it, but the empirical part of my brain still knows the difference so I’ve either failed to break out of the old ideology, or succeeded in keeping my grip on reality (depending on your point of view).

Overall creating the new database has been good experience and I’ve learnt a lot from it, but I can’t help wondering if the time taken up by design, programming, and data entry could have been better spent. Even if Zotero turns out to not do everything I want, it has a lot of advantages over anything I could create myself. One of the most exciting things is the potential for collaboration, especially if it becomes a de facto standard for academics.

1 Comment »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.