Text Theories: Meaning
In my previous post about theories of digital text, I used Shannon’s communication theory to divide text into information and meaning, and then talked exclusively about text as information: a sequence of characters selected from a finite set. That allowed me to concentrate on one part of the problem, while excluding the more difficult problems associated with meaning. In this post, I’ll be trying to tackle some of the problems of meaning, while still trying to avoid as many as I can. I will also continue to avoid offering concrete definitions of “text” and “a text”, mainly because I haven’t found any satisfactory definitions yet, but I won’t be able to avoid using the word “text”.
When scanning, OCR, and proofreading are complete (meaning you’ve gone as far as you can/will go with it — last time I suggested that proofing can never be truly complete) you are left with one or more plain text files which contain reasonably accurate information. There is likely to be some noise in the form of wrongly transcribed characters, but it should be expected that you have selected methods which result in a level of accuracy that is acceptable to your project, and that you have assurance checks in place to be able to determine that the work does meet your minimum requirements. The information you have in the text file is a sequence of characters which more or less matches the sequence of characters in the book. What, if anything, should you do next?
You could just put the text file on the internet as it is and let users worry about what it means. This is the approach taken by Project Gutenburg. Their texts are all made available as plain text files, with some also available as HTML. This approach is based on the assumption that digitized texts should conform to the lowest common denominator, and that any additional markup might reduce cross compatibility and make the files inaccessible in the future. I don’t entirely agree with this view. TEI XML is a widespread standard and looks like it will remain so for a long time. XML files are not a proprietary format. In terms of file systems they are no different from plain text files: both can be read and edited by any text editing software on any platform. The XML Document Object Model should make it easy to update tags in the future, and if XML ever turns out to be totally useless, then you can at least use find and replace to strip it out automatically, leaving you with the original text and no markup. This is not meant to be criticism of Project Gutenburg. They are doing valuable work in making public domain works more widely accessible, and in developing tools and procedures for collaborative work. Digitizing and proofreading text is necessary before any markup can be added. Project Gutenburg stops before the markup stage, but there’s nothing to stop other people from taking PG text files and adding advanced markup.
Adding markup necessarily involves meaning to a certain extent. Even Project Gutenburg, which aims only at producing plain text editions, isn’t just transmitting the sequence of characters from printed book to ASCII codes. Some characters, such as page numbers and running heads, are omitted. This is a subjective decision about which information to include and exclude in the digital edition, based on what is most likely to be useful to readers. Therefore, there has to be some kind of judgement about what the information means.
While digital text offers more flexibility than printed text, the role of the editor is just as crucial as ever. As Edward Vanhoutte says: “The editor is always present in the organization of the material and the transcription of source documents”. Marking up the basic structure of a document according to established standards like TEI might seem unproblematic, but Susan Hockey points out that even this is an act of interpretation (p. 48). For example, replacing line break characters with paragraph tags makes an assumption about the meaning of line breaks. Hockey cites Huitfeldt’s observation that there are no objective facts about a text (p. 47). Adding TEI XML tags to a text file is imposing an arbitrary taxonomy. Nevertheless, all language is an arbitrary taxonomy. As long as we recognise that nothing actually is what it’s called, those taxonomies can be useful. The taxonomy you choose has to be relevant to your objectives, and therefore you have to know why you are digitizing a text, who the target audience is, and what they are likely to want from it. This is crucial, because there is no perfect way of digitizing a text. It also helps if your taxonomy ties into a system which is widely used and understood, and which does not vary unpredictably. TEI is a good starting point because it’s widely used, and while flexible enough to accommodate many different purposes is fixed enough to prevent too much random slippage.
In the interests of preventing slippage and maintaining cross-compatibility, it is vital to apply tags consistently. Julia Flanders points out that this is easier said than done because of the complexity and flexibility of TEI: the same feature could be marked up several different ways. Again this is a problem of meaning: how do you interpret the meaning of a sequence of characters, and how do you fit that interpretation into your arbitrary taxonomy? Flanders emphasises the importance of documenting procedures and modifying documentation in the light of management decisions on difficult interpretations or previously unknown features. The Old Bailey project has taken an innovative approach to this, using a wiki to co-ordinate XML tagging. In a collaborative project, regular assurance checks are necessary to make sure that all team members are following the documentation consistently, and that the documentation is adequate.
While marking up the basic structure of a text (paragraphs, chapters, headings) must be recognised as an act of interpretation and arbitrary classification, it should be relatively unproblematic in practice. This is particularly true of the book I’ll be working on first: a very conventional regimental history published in Britain in the 20th century. Colonel Sandall is hardly Ezra Pound or Jack Kerouac! The structure of a normal book can be seen in structuralist terms: although it’s an arbitrary system which doesn’t necessarily have a fixed relationship with reality, it’s fixed in relation to itself. Most people understand this system, which is much less complex than the whole of a language system, and publishers help to enforce conformity in printed works. Manuscripts are more problematic as they don’t necessarily have such rigid conventions. Interpreting the structure of William Wenham’s letters will be more difficult than interpreting the structure of Sandall’s book, but at least there are some established conventions of letter writing (again we’re not dealing with a modernist stream-of-consciousness here), and field postcards have their own basic structure.
The next stage of markup involves picking out dates, and names of people, places, and organizations, and therefore more subjective interpretation of meanings. At this stage no claims will be made about who or what the names signify. It will only be necessary to decide whether or not a sequence of characters represents a proper noun. Fortunately there is an established convention in English printed books that proper nouns are distinguished by a capital letter. This might even allow a certain amount of automation of picking out names, although the potential for confusion with capital letters at the beginnings of sentences will probably make a lot of human intervention necessary. John Lavagnino points out that names are not always easy to define and delimit. In Sandall’s book, names will often be accompanied by ranks, which makes them easier to spot.
The third stage of markup is potentially the most contentious because record linkage involves making epistemological claims about the identities of the people referred to by the names. The first question is whether the same name refers to the same person when it occurs in different places within the text. In doubtful cases the book’s index might help to disambiguate two people with the same name. A different rank doesn’t necessarily indicate a different person, because ranks can change. At this stage I will be attempting to reconstruct the author’s understanding of who is who, which means confronting the major problem of author’s intentions. This doesn’t mean that I can remain neutral or that my assumptions won’t influence the record linkage. Linkage at this level will be determined by my own subjective interpretation of what I think the author meant. I will have to assume that there is some consistent logic to what he wrote, but that can’t necessarily be proved from within the text.
What about outside the text? Linking the text to other records would add value for users. If identifications can be corroborated from other sources, then my judgements might be more secure. However, this also involves making more ambitious claims about meaning. How do I know that the same sequence of characters in two different texts means the same thing? Ultimately I don’t. Record linkage is an empirical technique which can’t necessarily be justified to post-structuralists, but I don’t necessarily have to justify it to post-structuralists.
Once again the important thing is the purpose of the project and the needs and expectations of its target audience. The main value of Sandall’s book is to amateur researchers who want to know more about specific individual soldiers or officers, or about what the battalion was doing at a particular time. These people are unlikely to be impressed by agonising about meaning, intentions, and epistemology. Their methodology will most likely be traditional empiricism. This is not to say that they will be naive — they can recognise that some sources are more reliable than others and that different people have different interpretations of what happened — but ultimately what they really care about will probably be “the facts” of what really happened in the past. I don’t intend to challenge those beliefs, but conversely my project doesn’t depend on them either. While record linkage requires claims about the meaning of information and the relationship between different texts, it does not necessarily involve any claims about the relationship between text and reality.
To a certain extent I hope that I can let users make up their own minds about the meaning of the text, and that if they disagree with an editorial decision they can either ignore it or save their own personal copy which they can edit to their own specifications. TEI XML adds a layer of meaning to the text, but doesn’t change the underlying information, unlike a database where the information has to be cut up and rearranged to fit into an arbitrary taxonomy. Allen Renear: “One might say that the TEI is an agreement about how to express disagreement”. Julia Flanders reminds us that editorial responsibility should not be offloaded onto the reader. The problems of digital text make the editor more, not less, important.
I hope I’ve demonstrated that nothing about editing digital texts is simple. Over the last two weeks I’ve become aware of more problems than I imagined when I set out, but it’s been very useful to think about these issues more clearly. Even if I can’t solve every problem, I can at least avoid some of them, and minimise the impact of others. Above all, these projects are intended to be educational, and I’m certainly learning a lot from them. Now I’m nearly ready to start creating the digital texts themselves.
Bibliography
- Lou Burnard, John Unsworth, and Katherine O’Brien O’Keeffe, Electronic Textual Editing with CDROM (Modern Language Association of America, September 2006).
- Susan M. Hockey, Electronic Texts in the Humanities (Oxford University Press, November 2000).
- Ray Siemens, John Unsworth, and Susan Schreibman, Companion to Digital Humanities (Blackwell Companions to Literature and Culture) (Blackwell Publishing Professional: Oxford, December 2004).
