Historical Information and Noise

[posted by Gavin Robinson, 8:50 pm, 21 November 2006]

Over the last 10 years or so, technology has brought huge changes to historical research and opened up new possibilities. Computers have solved some old problems, but also created some new ones. Meanwhile there has been an increasing focus on the problems of epistemology: what can we know about the past and how can we know it? The debate has mostly been about the relationship between textual sources and the reality of the past. Even if you reject theory and take a purely empirical view of what the sources can tell us, there are some potential problems with the transmission of the information that they contain.

Information theory makes a clear separation between information and meaning. This was originally because information theory was concerned with the engineering problems of transmitting messages, and from this point of view any meaning which the message might have is irrelevant. Meaning is highly relevant to historians, because sources are not much use if we don’t know what they mean. Post-structuralist theory raises fundamental questions about the meaning of language, and when taken to its most extreme logical conclusion suggests that communication might be impossible. This is a problem which needs to be solved but it isn’t yet clear how that can be done. Treating historical sources as information isn’t a solution in itself, but it might offer a different way to approach the problem, since excluding meaning also excludes the problems associated with meaning.

Leaving aside this idle speculation, the distinction between information and meaning is still an important one. If the information itself is transmitted unreliably, then we have significant problems even before we start worrying about meaning. We don’t even have to go into source criticism to find potential inaccuracy. Sources might not be as original as they claim to be, but for the purposes of this post I’m going to take archival sources at face value. I’m thinking more about what happens to the information after it has been extracted from the original document.

When I started my PhD research in 1997, I was still using the old ways. Most of my time was spent in the Public Records Office reading seventeenth-century manuscripts and making notes with a pencil and paper. Since continuous notes on A4 paper were not flexible enough for any kind of analysis, I would later copy the information onto 6×4 file cards (again by hand) and file them under relevant headings. Some sets of records went onto spreadsheets, but that was limited because I didn’t have my own computer then (I can’t imagine life without one now!). This system had many opportunities for noise to creep into the information.

First of all, there’s the question of whether I’d read the documents accurately. Seventeenth-century secretary hand isn’t easy to read, not least because some of the letters are formed in very different ways from modern handwriting. Progress was slow at first, but my reading ability soon improved with experience. While my transcripts probably got more accurate over time, it’s also likely that my increasing confidence led me to miss some relevant documents by skimming over them too quickly. Time, budget, and the huge quantity of documents I had to get through created pressure for speed at the expense of accuracy. This is a problem which all research projects have to deal with, and there is no ideal solution.

Assuming that I’d read the document correctly, had I copied the information correctly? There were several points where this could be a problem. First, in copying from the original documents to my notes in the archives; second, copying the notes onto file cards; and third, putting examples into the text during the process of writing up my thesis. If the information went onto a spreadsheet or database at a later date, this introduced an extra step and more chance of errors. Taking notes by hand was problematic because I might not be able to read my own handwriting accurately in the future. If I found my pencil notes to be completely illegible, that would deny access to the information unless I went back to the archive and looked at the document again. A more insidious threat was that I would misread my writing without noticing any ambiguity. Context didn’t always help much, because I was often dealing with numbers in sources like account books.

Once I got my hands on a laptop things changed significantly, but there were still potential issues. Being able to type my notes while looking at the original document in the archive removed some of the intermediate steps. For records which would fit easily onto a database I could enter them directly into a table. There was no chance that this information would deteriorate by being copied by hand several times, and no chance that it would turn out to be illegible. However, typing can introduce some errors more easily than writing. Keying errors can occur frequently even when you know what the document says and know what you’re trying to type. Autocorrect claims to reduce these kinds of errors, but can also introduce errors of its own, and doesn’t help at all with numerical data. The pressures of time and cost still apply, and repetitive data entry can lead to loss of concentration.

These problems are not unique to my PhD. There are some major research projects in progress which rely heavily on Access databases, with research assistants in the archives entering data directly from original sources. Team projects have the advantage of experienced leadership, thorough training, and regular assurance checks. All of these things will help to reduce errors, but they can never be eliminated completely. Wherever it’s necessary to send research assistants to look at original documents, it’s usually only practical to have one assistant look at each document once. Management assurance checks are only ever likely to be random samples, although the frequency of sampling would be expected to be highest early in the project when staff are least experienced.

However, the whole idea of taking notes in archives might be coming to an end. Some archives (most notably the Public Records Office) allow the use of cameras to photograph documents. Over the last few years the quality and storage capacity of digital cameras has improved drastically, while prices have kept falling. My current camera cost less than £200. At 3 megapixels the quality is good enough for my needs, and the 1GB memory card can store over 1,000 high quality pictures. This has completely changed the way I do research. Once I’ve identified relevant documents, I can just photograph them and work from the images when I get home. All the uncertainty of my old note-taking is gone. There’s still a chance of missing relevant documents, but this is reduced because the time spent transcribing documents is eliminated. Less time needs to be spent in archives and so research is cheaper.

Looking beyond the individual level, the implications of digital imaging are even bigger, especially combined with the growth of the internet and the increasing availability of broadband connections. It is now possible to digitise whole collections of historical records and make them available on the web. Once this has been done, the costs of working on those records is further reduced for future researchers because few people will need to go to the archives to check the originals. However, simple images of documents are only of limited value. To get the most out of the sources they need to be transcribed into digital text so that they can be searched. XML tagging adds even more value, describing the structure of the document, and classifying pieces of information (such as names of people or places), so that data can easily be extracted into databases while preserving the whole text. Transcribing and tagging the information introduces new possibilities for noise to get in, but the technology mitigates these problems to a certain extent.

The Old Bailey Proceedings provides a good case study of a cutting edge digitisation project. The first phase of the project (now complete) involved imaging 60,000 pages of original trial reports, transcribing millions of words, and inserting XML tags. Working from digital images allowed the use of double-keying: every page was typed twice by different people, and discrepancies between the two versions were flagged by a computer to be checked manually. While this isn’t necessarily cheap, it’s likely to be more cost effective than sending research assistants to archives because the work can be outsourced to home workers. Cost, speed, and accuracy depend on the age and legibility of the source material. Early modern manuscripts still require palaeography skills, which makes it harder to find staff with the necessary skills and potentially increases the cost (although I imagine there are plenty of people with PhDs in early modern history who cant get jobs!).

The double keying process used by HEDS claims an accuracy rate of 99.8% provided that the text in the digital images is legible. In practice, historical records are often not legible enough to be transcribed accurately. The Old Bailey Proceedings website shows that this can be overcome to a certain extent because the images of the documents are available online along with the digitised text. Wherever it’s noted that text couldn’t be transcribed satisfactorily you can click on a link to the image and try to decipher it yourself. Things are potentially more tricky in cases where the text is legible but a transcription error has made it through the double keying process, because users won’t necessarily check the image if they think there isn’t a problem with the text. However, academics who need to be certain of accuracy and who are used to travelling to archives won’t find it much trouble to click on the link to the image and make sure.

If human errors were completely random, double keying would reduce the chances of errors exponentially. For example, if the probability of one keyer mistranscribing each word was 0.1, then the probability of both keyers mistranscribing that same word would be 0.01 (I think that’s right but I have a few doubts, as it’s a long time since I did any maths! I just hope this doesn’t turn into something like the notorious circle thread at The Valve). In practice it’s likely that human errors aren’t perfectly random, and that there is a greater chance of people making the same mistakes. If HEDS meets its accuracy target of 99.8% for the 70 million words in the next phase of the Old Bailey/Plebeian Lives project, you might expect 140,000 of those words to be mistranscribed. That might not be true in practice, because if the 0.2% error rate is calculated from letters rather than words, the wrong letters won’t necessarily be evenly distributed through words — they might be concentrated in a few wrong words, so the total number of wrong words would be considerably less than 140,000. In any case, this error rate isn’t as bad as it sounds, because the text will be checked again by data developers during the XML tagging process, and presumably a certain percentage of their work will be checked by managers, so many of those errors will be caught.

Ultimately human error is unavoidable. Best practice is to design and implement systems which minimise errors as far as possible, although constraints of time and budget place limits on accuracy. The most basic step is to never rely on one person. It’s also important to combine people and computers so that they cover each others weaknesses. Technology has vastly improved the systems for transmission of historical information. The challenges faced at the cutting edge of digitisation are mostly down to the unprecedented scale of the latest projects. Ten years ago, transcribing 70 million words would have been impractical, prohibitively expensive, and perhaps unimaginable.

4 Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.