Text Theories: Information
As the next stage of my Digital History Projects I’ve been doing background reading and thinking about the theory of text. This week I’ve read Schreibman, Siemens, and Unsworth A Companion To Digital Humanities (2004); Burnard, O’Brien, O’Keeffe, and Unsworth Electronic Textual Editing (2006); Susan Hockey Electronic Texts in the Humanities (2000); and C. E. Shannon ‘A Mathematical Theory of Communication’ (1948). I can’t say that I understood everything (especially Shannon’s equations and Jerome McGann’s pretentious jargon) but it’s given me a lot to think about, and things are nowhere near as simple as I first assumed.
What is text? What is a text? It turns out that there are no easy answers to these questions. While I was right to think that digitization avoids some of the epistemological problems of history, allowing readers to make their own decisions about the relationship between text and reality, digital text presents plenty of new problems which could be equally intractable. A text is not necessarily the same thing as a book or an article or a play. Things get really complicated when there are differing versions of a text, as is often the case with medieval manuscripts. Should we classify them as the same text with some differences, or different texts with some similarities? The separation of information and meaning is an important concept which can allow us to think more clearly about what we’re doing, but in practice, the separation is not necessarily easy to make. This is how Shannon introduced the idea:
The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages.
By Shannon’s definition, the sequence of characters contained in a book can be considered to be information. We can select this message and attempt to reproduce it exactly without having to worry about meaning. At the very least, we can fall back on structuralism, since an alphabet is a fixed arbitrary system in which the characters are identified by the differences between them. There is no fixed relationship between a character and the sound it represents (for example, characters in the latin alphabet can be pronounced differently in English, French and German). The same character might be represented in different ways. In modern print there are different typefaces which can be used to represent the same characters. In early-modern handwriting letter forms are often very different from modern forms, and the same character might have different forms in the same word (especially s). This fits in with Saussure’s distinction between langue and parole: the size, weight, font, and even form of a character might vary, but it can still be identified as the same character in relation to the system it comes from. [EDIT: the proper words for this are substantives and accidentals] This is not to say that parole is unimportant. Typography can have a significant effect on how text is perceived and understood, just like regional accents can signify group identities and influence how speech is understood. However, it is useful to be aware of distinctions. Susan Hockey points out that computers force us to concentrate on what we are doing and why (p. 3). They also force us to analyse everything more systematically rather than assume that anything “just is”. I’ve already had to break text down into information and meaning, with meaning further broken down into equivalents of langue and parole, and I’m still a long way from having any idea of what text “is”.
In theory it should be easy to transmit a sequence of characters, even a long sequence such as a book. In practice, getting an accurate electronic transcript of printed text is one of the biggest problems for digital humanities projects. Whether using OCR or human double-keying, getting acceptable accuracy is difficult and expensive, and perfection seems unattainable. This is surprising when you consider that printed characters and ASCII codes seem to meet Shannon’s definition of a discrete channel: “Generally, a discrete channel will mean a system whereby a sequence of choices from a finite set of elementary symbols can be transmitted from one point to another.” So what’s the problem?
Shannon’s model gives us five parts of the communication system:
- Information source
- Transmitter
- Channel
- Receiver
- Destination
The transmitter converts the message from the information source into a signal, and the receiver converts it back into a message which can be understood by the destination (usually a person). Shannon’s theory is mainly concerned with maintaining the integrity of a signal in the channel by calculating how much redundancy is required for a given level of noise. In terms of digitization projects, this is all about the electronic working of the computer and its peripherals. Thanks to the application of Shannon’s theory, we can usually be sure that when we press the “a” key on the keyboard, the “a” character will appear on the screen (the keyboard can be seen as the transmitter, and the screen as the receiver).
With double keying, the real problem is what happens between the source and the transmitter. Shannon wasn’t too worried about this, implicitly assuming that the person at the source selected the message they wanted to select. Even if they didn’t, he points out that the redundancy of the English language is about 50%, meaning that even if half of the characters are wrong the message will probably still be intelligible to the recipient. Academic projects demand much more than 50% accuracy, and also need to preserve mistakes from the original text, which makes things more complicated.
We could perhaps see the keyer as another communication system, which introduces its own noise by misreading or mistyping the characters. According to Shannon, any system can transmit a message perfectly provided that it’s transmitted slowly enough and with sufficient redundancy. This applies to typists as well as it applies to telegraph wires. Typing very slowly and carefully will reduce the number of mistakes you make. Having more people rekeying the same text will reduce the overall number of errors. In practice there will always be a probability, however small, that a mistake can be missed. It’s also likely that people will make similar mistakes in reading and typing, rather than introducing completely random errors (I don’t know if any cognitive psychologists have done any experiments on this, but it would be interesting to see if there’s any empirical proof to back up this suspicion).
If time and money are unlimited it should be possible to make transcription errors negligible by employing large numbers of typists and making them type very slowly and carefully. However, we all know that major digital humanities projects don’t have unlimited time and money. Getting the right balance is important, as is having realistic expectations. Are the demands of digitization projects too high for the available techniques? Are time and budget considerations pushing text keying beyond its limits and making errors inevitable?
OCR is attractive because it offers the possibility of automating text capture, bypassing the expense and unreliability of humans. However, Cohen and Rosenzweig cite studies which show that when the time and cost of proofreading and correcting OCR text are taken into account, double keying works out more cost effective as well as more accurate. This is because computers are much worse at recognizing characters than humans are. You can scan a document at 300dpi, and those dots will appear in the same sequence on the screen. Perfect transmission, or near enough. But when the computer tries to select a message from those dots as a “sequence of choices from a finite set of elementary symbols” things often go wrong. This is immensely frustrating, because to a human it seems like such a simple task. We can hope that advances in Artificial Intelligence will eventually lead to reliable OCR, but it’s not going to be an easy problem to solve. (The ultimate proof of the unreliability of OCR is that the online version of the Companion To Digital Humanities is full of scannos!)
As it is, OCR text needs to be proofread by at least one human. Distributed Proofreaders now use three rounds of proofing (followed by two rounds of formatting). Because of the “open source” nature of the project, which is run by unpaid volunteers, time and cost don’t need to be considered at all. A text is ready when it’s ready, and nobody has to pay for it. This makes triple proofing more feasible than in a funded project. However, it might also be the case that more proofing is required because the proofreaders themselves are an unknown quantity. As I haven’t qualified for round two yet, I don’t yet know how much time round two proofers spend correcting errors introduced by less experienced proofers in round one. Radical trust is feasible provided you get a critical mass of responsible users (which DP appears to have attained) and offers some interesting possibilities. Large numbers of unpaid volunteers doing small amounts of work very carefully might overcome some of the problems of big digitization projects, although it might also bring problems of its own.
Allowing users to supply corrections after publication could also help to increase the accuracy of transcriptions. This is even more radical than the DP model, and gives some traditional minded people the fear. Berrie, Eggert, Tiffin, and Barwell take a very traditional view of authentication which is based on the assumption that editors can make a text perfect and that it will deteriorate if not controlled (even more depressing is their emphasis on defending copyright). In the light of everything I’ve discussed so far, I suggest that the opposite might be true: it is impossible for any individual editor or team of editors to produce a perfect text, and that the more people who are involved in correcting errors, the more accurate the transcription is likely to be. Wikipedia shows that with a critical mass of committed and responsible users, even deliberate vandalism can be overcome. This is not to say that every electronic text has to be, or can be, as open as Wikipedia. The most obvious problem is getting that critical mass of users, and this will be more difficult for more esoteric projects in which fewer people are likely to take an interest. At the very least, there should be some mechanism for users to suggest corrections, even if these have to be reviewed before being implemented. For example, the Old Bailey Proceedings has a form for submitting errors.
My current projects are small enough that I can take a lot of time and care over them, but I also want to develop techniques that will scale up, otherwise the experience will be of limited value. The relatively small amount of text to be dealt with means that in absolute terms there are likely to be few errors. It’s when you scale things up to millions of words that a small probability of errors can lead to a huge number of errors.
So far I’ve only considered characters as information, and haven’t got any closer to defining what text “is”. For the purposes of digitizing a book, I can avoid that question by setting out my aim as transcribing all of the characters in a particular book. Even though the definition of a book is at least slightly less problematic than the definition of a text, there’s more to a book than a sequence of characters. I’m choosing to represent one aspect of the book while discarding others, such as ink, paper, and binding. This is an arbitrary choice. Partly it’s because of the impossibility of representing the book as a complete physical object within a digital computer. Perfect information of that kind would need to go down to the level of atoms, and we would need some mechanism for reconstructing objects from the information contained in the computer. This is getting into the realms of alchemy, and clearly isn’t possible with any current technology.
But above all, any digitization project needs to look to the requirements of its intended users. From this point of view, it’s the information contained in a book, the sequence of characters making up the message, which potentially has the most value for readers. “Frequently the messages have meaning”. In the next part I’ll be going beyond information and looking at the even greater problems associated with meaning.
Bibliography
- Lou Burnard, John Unsworth, and Katherine O’Brien O’Keeffe, Electronic Textual Editing with CDROM (Modern Language Association of America, September 2006).
- Susan M. Hockey, Electronic Texts in the Humanities (Oxford University Press, November 2000).
- C E Shannon, ‘A mathematical theory of communication’, Bell System Technical Journal, 27 (1948), pp. 379-423, 623-656.
- Ray Siemens, John Unsworth, and Susan Schreibman, Companion to Digital Humanities (Blackwell Companions to Literature and Culture) (Blackwell Publishing Professional: Oxford, December 2004).

Pingback by Investigations of a Dog » Text Theories: Meaning — 4:49 pm, 5 February 2007 [permanent link to this comment]
[...] In my previous post about theories of digital text, I used Shannon’s communication theory to divide text into information and meaning, and then talked exclusively about text as information: a sequence of characters selected from a finite set. That allowed me to concentrate on one part of the problem, while excluding the more difficult problems associated with meaning. In this post, I’ll be trying to tackle some of the problems of meaning, while still trying to avoid as many as I can. I will also continue to avoid offering concrete definitions of “text” and “a text”, mainly because I haven’t found any satisfactory definitions yet, but I won’t be able to avoid using the word “text”. [...]