<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Investigations of a Dog &#187; information theory</title>
	<atom:link href="http://www.investigations.4-lom.com/tag/information-theory/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.investigations.4-lom.com</link>
	<description>Failing better at understanding the past</description>
	<lastBuildDate>Sun, 05 Feb 2012 09:18:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Information vs Meaning: A False Dichotomy?</title>
		<link>http://www.investigations.4-lom.com/2007/07/02/information-vs-meaning/</link>
		<comments>http://www.investigations.4-lom.com/2007/07/02/information-vs-meaning/#comments</comments>
		<pubDate>Mon, 02 Jul 2007 14:33:23 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[information theory]]></category>
		<category><![CDATA[meaning]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/07/02/information-vs-meaning/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Information+vs+Meaning%3A+A+False+Dichotomy%3F&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-07-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/07/02/information-vs-meaning/&amp;rft.language=English"></span>
In a few previous posts I&#8217;ve stressed the difference between information and meaning (which I picked up from Claude Shannon, the father of information theory) and some of its implications. For example, in this post I pointed out that Shannon&#8217;s separation of meaning and information is compatible with structuralist and post-structuralist theories which maintain that [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Information+vs+Meaning%3A+A+False+Dichotomy%3F&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-07-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/07/02/information-vs-meaning/&amp;rft.language=English"></span>
<p>In a few previous posts I&#8217;ve stressed the difference between information and meaning (which I picked up from Claude Shannon, the father of information theory) and some of its implications. For example, in <a href="http://www.investigations.4-lom.com/2007/06/08/science-friction/">this post</a> I pointed out that Shannon&#8217;s separation of meaning and information is compatible with structuralist and post-structuralist theories which maintain that there is no inherent meaning in the text. (I&#8217;ve also had to deal with it in the course of digitizing a book &#8211; see <a href="http://www.investigations.4-lom.com/2007/02/02/text-theories-information/">here</a>). Work on Artificial Intelligence has tended to reinforce this distinction: computers are very good at processing information but not very good at understanding meaning.</p>
<p>But last week <a href="http://digitalhistoryhacks.blogspot.com/2007/06/clustering-with-compression.html">Bill Turkel</a> wrote a post which turned my understanding of the meaning/information dichotomy on its head. This isn&#8217;t such a new development as it&#8217;s following on from a post he wrote in March 2006, and that was inspired by an article by Rudi Cilibrasi and Paul Vitányi published in 2005. There&#8217;s a lot of mathematical stuff about compression algorithms which I can&#8217;t claim to understand, but the schwerpunkt is that without understanding anything about meaning, computers can compare similarities in the information content of texts and cluster them accordingly. The result is patterns that make sense to humans who <em>can</em> understand the meaning of the text. Bill&#8217;s example used entries from the <em>Canadian Dictionary of National Biography</em>, finding geographical and chronological clusters of entries.</p>
<p>Despite the attention grabbing title of my post, the distinction between information and meaning isn&#8217;t a false one. However, these experiments show that in practice the relationship between information and meaning within the context of a particular linguistic/cultural system is not as arbitrary and unpredictable as theorizing might suggest. Does this mean that structuralism could make a comeback against post-structuralism? Or do we need to move beyond both of those things and find a new way to think about text? Whatever the implications for theory, this is an exciting development which promises to  be very useful in practice.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/07/02/information-vs-meaning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Text Theories: Meaning</title>
		<link>http://www.investigations.4-lom.com/2007/02/05/text-theories-meaning/</link>
		<comments>http://www.investigations.4-lom.com/2007/02/05/text-theories-meaning/#comments</comments>
		<pubDate>Mon, 05 Feb 2007 16:49:51 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[information theory]]></category>
		<category><![CDATA[meaning]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[theory]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/02/05/text-theories-meaning/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Text+Theories%3A+Meaning&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-05&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/05/text-theories-meaning/&amp;rft.language=English"></span>
In my previous post about theories of digital text, I used Shannon&#8217;s communication theory to divide text into information and meaning, and then talked exclusively about text as information: a sequence of characters selected from a finite set. That allowed me to concentrate on one part of the problem, while excluding the more difficult problems [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Text+Theories%3A+Meaning&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-05&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/05/text-theories-meaning/&amp;rft.language=English"></span>
<p>In my previous post about <a href="http://www.investigations.4-lom.com/2007/02/02/text-theories-information/" title="Investigations of a Dog: Text Theories: Information">theories of digital text</a>, I used Shannon&#8217;s communication theory to divide text into information and meaning, and then talked exclusively about text as information: a sequence of characters selected from a finite set. That allowed me to concentrate on one part of the problem, while excluding the more difficult problems associated with meaning. In this post, I&#8217;ll be trying to tackle some of the problems of meaning, while still trying to avoid as many as I can. I will also continue to avoid offering concrete definitions of &#8220;text&#8221; and &#8220;a text&#8221;, mainly because I haven&#8217;t found any satisfactory definitions yet, but I won&#8217;t be able to avoid using the word &#8220;text&#8221;.</p>
<p><span id="more-54"></span></p>
<p>When scanning, OCR, and proofreading are complete (meaning you&#8217;ve gone as far as you can/will go with it &amp;mdash; last time I suggested that proofing can never be truly complete) you are left with one or more plain text files which contain reasonably accurate information. There is likely to be some noise in the form of wrongly transcribed characters, but it should be expected that you have selected methods which result in a level of accuracy that is acceptable to your project, and that you have assurance checks in place to be able to determine that the work does meet your minimum requirements. The information you have in the text file is a sequence of characters which more or less matches the sequence of characters in the book. What, if anything, should you do next?</p>
<p>You could just put the text file on the internet as it is and let users worry about what it means. This is the approach taken by <a href="http://www.gutenberg.org/wiki/Main_Page" title="Project Gutenburg">Project Gutenburg</a>. Their texts are all made available as plain text files, with some also available as HTML. This approach is based on the assumption that digitized texts should conform to the lowest common denominator, and that any additional markup might reduce cross compatibility and make the files inaccessible in the future. I don&#8217;t entirely agree with this view. TEI XML is a widespread standard and looks like it will remain so for a long time. XML files are not a proprietary format. In terms of file systems they are no different from plain text files: both can be read and edited by any text editing software on any platform. The XML Document Object Model should make it easy to update tags in the future, and if XML ever turns out to be totally useless, then you can at least use find and replace to strip it out automatically, leaving you with the original text and no markup. This is not meant to be criticism of Project Gutenburg. They are doing valuable work in making public domain works more widely accessible, and in developing tools and procedures for collaborative work. Digitizing and proofreading text is necessary before any markup can be added. Project Gutenburg stops before the markup stage, but there&#8217;s nothing to stop other people from taking PG text files and adding advanced markup.</p>
<p>Adding markup necessarily involves meaning to a certain extent. Even Project Gutenburg, which aims only at producing plain text editions, isn&#8217;t just transmitting the sequence of characters from printed book to ASCII codes. Some characters, such as page numbers and running heads, are omitted. This is a subjective decision about which information to include and exclude in the digital edition, based on what is most likely to be useful to readers. Therefore, there has to be some kind of judgement about what the information <em>means</em>.</p>
<p>While digital text offers more flexibility than printed text, the role of the editor is just as crucial as ever. As <a href="http://www.tei-c.org.uk/Activities/ETE/Preview/vanhoutte.xml" title="Electronic Textual Editing">Edward Vanhoutte</a> says: &#8220;The editor is always present in the organization of the material and the transcription of source documents&#8221;. Marking up the basic structure of a document according to established standards like TEI might seem unproblematic, but Susan Hockey points out that even this is an act of interpretation (p. 48). For example, replacing line break characters with paragraph tags makes an assumption about the meaning of line breaks. Hockey cites Huitfeldt&#8217;s observation that there are no objective facts about a text (p. 47). Adding TEI XML tags to a text file is imposing an arbitrary taxonomy. Nevertheless, all language is an arbitrary taxonomy. As long as we recognise that nothing actually <em>is</em> what it&#8217;s called, those taxonomies can be useful. The taxonomy you choose has to be relevant to your objectives, and therefore you have to know why you are digitizing a text, who the target audience is, and what they are likely to want from it. This is crucial, because there is no perfect way of digitizing a text. It also helps if your taxonomy ties into a system which is widely used and understood, and which does not vary unpredictably. TEI is a good starting point because it&#8217;s widely used, and while flexible enough to accommodate many different purposes is fixed enough to prevent too much random slippage.</p>
<p>In the interests of preventing slippage and maintaining cross-compatibility, it is vital to apply tags consistently. <a href="http://www.tei-c.org.uk/Activities/ETE/Preview/flanders.xml" title="Electronic Textual Editing">Julia Flanders</a> points out that this is easier said than done because of the complexity and flexibility of TEI: the same feature could be marked up several different ways. Again this is a problem of meaning: how do you interpret the meaning of a sequence of characters, and how do you fit that interpretation into your arbitrary taxonomy? Flanders emphasises the importance of documenting procedures and modifying documentation in the light of management decisions on difficult interpretations or previously unknown features. The <a href="http://crimpleb.group.shef.ac.uk/" title="Central Criminal Court/Plebeian Lives projects">Old Bailey project</a> has taken an innovative approach to this, using a wiki to co-ordinate XML tagging. In a collaborative project, regular assurance checks are necessary to make sure that all team members are following the documentation consistently, and that the documentation is adequate.</p>
<p>While marking up the basic structure of a text (paragraphs, chapters, headings) must be recognised as an act of interpretation and arbitrary classification, it should be relatively unproblematic in practice. This is particularly true of the book I&#8217;ll be working on first: a very conventional regimental history published in Britain in the 20th century. Colonel Sandall is hardly Ezra Pound or Jack Kerouac! The structure of a normal book can be seen in structuralist terms: although it&#8217;s an arbitrary system which doesn&#8217;t necessarily have a fixed relationship with reality, it&#8217;s fixed in relation to itself. Most people understand this system, which is much less complex than the whole of a language system, and publishers help to enforce conformity in printed works. Manuscripts are more problematic as they don&#8217;t necessarily have such rigid conventions. Interpreting the structure of William Wenham&#8217;s letters will be more difficult than interpreting the structure of Sandall&#8217;s book, but at least there are some established conventions of letter writing (again we&#8217;re not dealing with a modernist stream-of-consciousness here), and field postcards have their own basic structure.</p>
<p>The next stage of markup involves picking out dates, and names of people, places, and organizations, and therefore more subjective interpretation of meanings. At this stage no claims will be made about who or what the names signify. It will only be necessary to decide whether or not a sequence of characters represents a proper noun. Fortunately there is an established convention in English printed books that proper nouns are distinguished by a capital letter. This might even allow a certain amount of automation of picking out names, although the potential for confusion with capital letters at the beginnings of sentences will probably make a lot of human intervention necessary. <a href="http://www.tei-c.org.uk/Activities/ETE/Preview/lavagnino.xml" title="Electronic Textual Editing">John Lavagnino</a> points out that names are not always easy to define and delimit. In Sandall&#8217;s book, names will often be accompanied by ranks, which makes them easier to spot.</p>
<p>The third stage of markup is potentially the most contentious because record linkage involves making epistemological claims about the identities of the people referred to by the names. The first question is whether the same name refers to the same person when it occurs in different places within the text. In doubtful cases the book&#8217;s index might help to disambiguate two people with the same name. A different rank doesn&#8217;t necessarily indicate a different person, because ranks can change. At this stage I will be attempting to reconstruct the author&#8217;s understanding of who is who, which means confronting the major problem of author&#8217;s intentions. This doesn&#8217;t mean that I can remain neutral or that my assumptions won&#8217;t influence the record linkage. Linkage at this level will be determined by my own subjective interpretation of what I think the author meant. I will have to assume that there is some consistent logic to what he wrote, but that can&#8217;t necessarily be proved from within the text.</p>
<p>What about outside the text? Linking the text to other records would add value for users. If identifications can be corroborated from other sources, then my judgements might be more secure. However, this also involves making more ambitious claims about meaning. How do I know that the same sequence of characters in two different texts means the same thing? Ultimately I don&#8217;t. Record linkage is an empirical technique which can&#8217;t necessarily be justified to post-structuralists, but I don&#8217;t necessarily have to justify it to post-structuralists.</p>
<p>Once again the important thing is the purpose of the project and the needs and expectations of its target audience. The main value of Sandall&#8217;s book is to amateur researchers who want to know more about specific individual soldiers or officers, or about what the battalion was doing at a particular time. These people are unlikely to be impressed by agonising about meaning, intentions, and epistemology. Their methodology will most likely be traditional empiricism. This is not to say that they will be naive &amp;mdash; they can recognise that some sources are more reliable than others and that different people have different interpretations of what happened &amp;mdash; but ultimately what they really care about will probably be &#8220;the facts&#8221; of what really happened in the past. I don&#8217;t intend to challenge those beliefs, but conversely my project doesn&#8217;t depend on them either. While record linkage requires claims about the meaning of information and the relationship between different texts, it does not necessarily involve any claims about the relationship between text and reality.</p>
<p>To a certain extent I hope that I can let users make up their own minds about the meaning of the text, and that if they disagree with an editorial decision they can either ignore it or save their own personal copy which they can edit to their own specifications. TEI XML adds a layer of meaning to the text, but doesn&#8217;t change the underlying information, unlike a database where the information has to be cut up and rearranged to fit into an arbitrary taxonomy. <a href="http://nora.lis.uiuc.edu/xtf/view?docId=blackwell/9781405103213/9781405103213.xml&amp;chunk.id=ss1-3-5&amp;toc.depth=1&amp;toc.id=ss1-3-5&amp;brand=default" title="Companion to Digital Humanities">Allen Renear</a>: &#8220;One might say that the TEI is an agreement about how to express disagreement&#8221;. <a href="http://www.tei-c.org.uk/Activities/ETE/Preview/flanders.xml" title="Electronic Textual Editing">Julia Flanders</a> reminds us that editorial responsibility should not be offloaded onto the reader. The problems of digital text make the editor more, not less, important.</p>
<p>I hope I&#8217;ve demonstrated that nothing about editing digital texts is simple. Over the last two weeks I&#8217;ve become aware of more problems than I imagined when I set out, but it&#8217;s been very useful to think about these issues more clearly. Even if I can&#8217;t solve every problem, I can at least avoid some of them, and minimise the impact of others. Above all, these projects are intended to be educational, and I&#8217;m certainly learning a lot from them. Now I&#8217;m nearly ready to start creating the digital texts themselves.</p>
<h3>Bibliography</h3>
<ol>
<li>Lou Burnard, John Unsworth, and Katherine O&#8217;Brien O&#8217;Keeffe, <span style="font-style:italic;">Electronic Textual Editing with CDROM</span> (Modern Language Association of America, September 2006). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A0873529715&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Electronic%20Textual%20Editing%20with%20CDROM&amp;rft.publisher=Modern%20Language%20Association%20of%20America&amp;rft.edition=Pap%2FCdr&amp;rft.aufirst=Lou&amp;rft.aulast=Burnard&amp;rft.au=Lou%20Burnard&amp;rft.au=John%20Unsworth&amp;rft.au=Katherine%20O'Brien%20O'Keeffe&amp;rft.date=2006-09-30&amp;rft.pages=419&amp;rft.isbn=0873529715"></span></li>
<li>Susan M. Hockey, <span style="font-style:italic;">Electronic Texts in the Humanities</span> (Oxford University Press, November 2000). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A0198711948&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Electronic%20Texts%20in%20the%20Humanities%3A%20Principles%20and%20Practice&amp;rft.publisher=Oxford%20University%20Press&amp;rft.aufirst=Susan%20M.&amp;rft.aulast=Hockey&amp;rft.au=Susan%20M.%20Hockey&amp;rft.date=2000-11-23&amp;rft.pages=228&amp;rft.isbn=0198711948"></span></li>
<li>Ray Siemens, John Unsworth, and Susan Schreibman, <span style="font-style:italic;">Companion to Digital Humanities (Blackwell Companions to Literature and Culture)</span> (Blackwell Publishing Professional: Oxford, December 2004). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A1405103213&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Companion%20to%20Digital%20Humanities%20(Blackwell%20Companions%20to%20Literature%20and%20Culture)&amp;rft.place=Oxford&amp;rft.publisher=Blackwell%20Publishing%20Professional&amp;rft.edition=Hardcover&amp;rft.series=Blackwell%20Companions%20to%20Literature%20and%20Culture&amp;rft.aufirst=Ray&amp;rft.aulast=Siemens&amp;rft.au=Ray%20Siemens&amp;rft.au=John%20Unsworth&amp;rft.au=Susan%20Schreibman&amp;rft.date=2004-12-12&amp;rft.isbn=1405103213"></span></li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/02/05/text-theories-meaning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Text Theories: Information</title>
		<link>http://www.investigations.4-lom.com/2007/02/02/text-theories-information/</link>
		<comments>http://www.investigations.4-lom.com/2007/02/02/text-theories-information/#comments</comments>
		<pubDate>Fri, 02 Feb 2007 17:07:38 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[information theory]]></category>
		<category><![CDATA[meaning]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[theory]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/02/02/text-theories-information/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Text+Theories%3A+Information&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/02/text-theories-information/&amp;rft.language=English"></span>
As the next stage of my Digital History Projects I&#8217;ve been doing background reading and thinking about the theory of text. This week I&#8217;ve read Schreibman, Siemens, and Unsworth A Companion To Digital Humanities (2004); Burnard, O&#8217;Brien, O&#8217;Keeffe, and Unsworth Electronic Textual Editing (2006); Susan Hockey Electronic Texts in the Humanities (2000); and C. E. [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Text+Theories%3A+Information&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/02/text-theories-information/&amp;rft.language=English"></span>
<p>As the next stage of my Digital History Projects I&#8217;ve been doing background reading and thinking about the theory of text. This week I&#8217;ve read Schreibman, Siemens, and Unsworth <a href="http://www.digitalhumanities.org/companion/" title="A Companion To Digital Humanities">A Companion To Digital Humanities</a> (2004); Burnard, O&#8217;Brien, O&#8217;Keeffe, and Unsworth <a href="http://www.tei-c.org.uk/Activities/ETE/Preview/index.xml" title="Electronic Textual Editing">Electronic Textual Editing</a> (2006); Susan Hockey <em>Electronic Texts in the Humanities</em> (2000); and C. E. Shannon &#8216;A Mathematical Theory of Communication&#8217; (1948). I can&#8217;t say that I understood everything (especially Shannon&#8217;s equations and Jerome McGann&#8217;s pretentious jargon) but it&#8217;s given me a lot to think about, and things are nowhere near as simple as I first assumed.</p>
<p><span id="more-52"></span></p>
<p>What is text? What is <em>a</em> text? It turns out that there are no easy answers to these questions. While I was right to think that digitization avoids some of the epistemological problems of history, allowing readers to make their own decisions about the relationship between text and reality, digital text presents plenty of new problems which could be equally intractable. A text is not necessarily the same thing as a book or an article or a play. Things get really complicated when there are differing versions of a text, as is often the case with medieval manuscripts. Should we classify them as the same text with some differences, or different texts with some similarities? The separation of information and meaning is an important concept which can allow us to think more clearly about what we&#8217;re doing, but in practice, the separation is not necessarily easy to make. This is how Shannon introduced the idea:</p>
<blockquote><p>The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages.</p></blockquote>
<p>By Shannon&#8217;s definition, the sequence of characters contained in a book can be considered to be information. We can select this message and attempt to reproduce it exactly without having to worry about meaning. At the very least, we can fall back on structuralism, since an alphabet is a fixed arbitrary system in which the characters are identified by the differences between them. There is no fixed relationship between a character and the sound it represents (for example, characters in the latin alphabet can be pronounced differently in English, French and German). The same character might be represented in different ways. In modern print there are different typefaces which can be used to represent the same characters. In early-modern handwriting letter forms are often very different from modern forms, and the same character might have different forms in the same word (especially <em>s</em>). This fits in with Saussure&#8217;s distinction between <em>langue</em> and <em>parole</em>: the size, weight, font, and even form of a character might vary, but it can still be identified as the same character in relation to the system it comes from. [EDIT: the proper words for this are substantives and accidentals] This is not to say that <em>parole</em> is unimportant. Typography can have a significant effect on how text is perceived and understood, just like regional accents can signify group identities and influence how speech is understood. However, it is useful to be aware of distinctions. Susan Hockey points out that computers force us to concentrate on what we are doing and why (p. 3). They also force us to analyse everything more systematically rather than assume that anything &#8220;just is&#8221;. I&#8217;ve already had to break text down into information and meaning, with meaning further broken down into equivalents of <em>langue</em> and <em>parole</em>, and I&#8217;m still a long way from having any idea of what text &#8220;is&#8221;.</p>
<p>In theory it should be easy to transmit a sequence of characters, even a long sequence such as a book. In practice, getting an accurate electronic transcript of printed text is one of the biggest problems for digital humanities projects. Whether using OCR or human double-keying, getting acceptable accuracy is difficult and expensive, and perfection seems unattainable. This is surprising when you consider that printed characters and ASCII codes seem to meet Shannon&#8217;s definition of a discrete channel: &#8220;Generally, a discrete channel will mean a system whereby a sequence of choices from a finite set of elementary symbols can be transmitted from one point to another.&#8221; So what&#8217;s the problem?</p>
<p>Shannon&#8217;s model gives us five parts of the communication system:</p>
<ol>
<li>Information source</li>
<li>Transmitter</li>
<li>Channel</li>
<li>Receiver</li>
<li>Destination</li>
</ol>
<p>The transmitter converts the message from the information source into a signal, and the receiver converts it back into a message which can be understood by the destination (usually a person). Shannon&#8217;s theory is mainly concerned with maintaining the integrity of a signal in the channel by calculating how much redundancy is required for a given level of noise. In terms of digitization projects, this is all about the electronic working of the computer and its peripherals. Thanks to the application of Shannon&#8217;s theory, we can usually be sure that when we press the &#8220;a&#8221; key on the keyboard, the &#8220;a&#8221; character will appear on the screen (the keyboard can be seen as the transmitter, and the screen as the receiver).</p>
<p>With double keying, the real problem is what happens between the source and the transmitter. Shannon wasn&#8217;t too worried about this, implicitly assuming that the person at the source selected the message they wanted to select. Even if they didn&#8217;t, he points out that the redundancy of the English language is about 50%, meaning that even if half of the characters are wrong the message will probably still be intelligible to the recipient. Academic projects demand much more than 50% accuracy, and also need to preserve mistakes from the original text, which makes things more complicated.</p>
<p>We could perhaps see the keyer as another communication system, which introduces its own noise by misreading or mistyping the characters. According to Shannon, any system can transmit a message perfectly provided that it&#8217;s transmitted slowly enough and with sufficient redundancy. This applies to typists as well as it applies to telegraph wires. Typing very slowly and carefully will reduce the number of mistakes you make. Having more people rekeying the same text will reduce the overall number of errors. In practice there will always be a probability, however small, that a mistake can be missed. It&#8217;s also likely that people will make similar mistakes in reading and typing, rather than introducing completely random errors (I don&#8217;t know if any cognitive psychologists have done any experiments on this, but it would be interesting to see if there&#8217;s any empirical proof to back up this suspicion).</p>
<p>If time and money are unlimited it should be possible to make transcription errors negligible by employing large numbers of typists and making them type very slowly and carefully. However, we all know that major digital humanities projects don&#8217;t have unlimited time and money. Getting the right balance is important, as is having realistic expectations. Are the demands of digitization projects too high for the available techniques? Are time and budget considerations pushing text keying beyond its limits and making errors inevitable?</p>
<p>OCR is attractive because it offers the possibility of automating text capture, bypassing the expense and unreliability of humans. However, <a href="http://chnm.gmu.edu/digitalhistory/digitizing/4.php" title="Digital History">Cohen and Rosenzweig</a> cite studies which show that when the time and cost of proofreading and correcting OCR text are taken into account, double keying works out more cost effective as well as more accurate. This is because computers are much worse at recognizing characters than humans are. You can scan a document at 300dpi, and those dots will appear in the same sequence on the screen. Perfect transmission, or near enough. But when the computer tries to select a message from those dots as a &#8220;sequence of choices from a finite set of elementary symbols&#8221; things often go wrong. This is immensely frustrating, because to a human it seems like such a simple task. We can hope that advances in Artificial Intelligence will eventually lead to reliable OCR, but it&#8217;s not going to be an easy problem to solve. (The ultimate proof of the unreliability of OCR is that the online version of the <a href="http://www.digitalhumanities.org/companion/" title="A Companion To Digital Humanities">Companion To Digital Humanities</a> is full of scannos!)</p>
<p>As it is, OCR text needs to be proofread by at least one human. <a href="http://www.pgdp.net/c/" title="Distributed Proofreaders">Distributed Proofreaders</a> now use three rounds of proofing (followed by two rounds of formatting). Because of the &#8220;open source&#8221; nature of the project, which is run by unpaid volunteers, time and cost don&#8217;t need to be considered at all. A text is ready when it&#8217;s ready, and nobody has to pay for it. This makes triple proofing more feasible than in a funded project. However, it might also be the case that more proofing is required because the proofreaders themselves are an unknown quantity. As I haven&#8217;t qualified for round two yet, I don&#8217;t yet know how much time round two proofers spend correcting errors introduced by less experienced proofers in round one. Radical trust is feasible provided you get a critical mass of responsible users (which DP appears to have attained) and offers some interesting possibilities. Large numbers of unpaid volunteers doing small amounts of work very carefully might overcome some of the problems of big digitization projects, although it might also bring problems of its own.</p>
<p>Allowing users to supply corrections after publication could also help to increase the accuracy of transcriptions. This is even more radical than the DP model, and gives some traditional minded people the fear. <a href="http://www.tei-c.org.uk/Activities/ETE/Preview/eggert.xml" title="Electronic Textual Editing">Berrie, Eggert, Tiffin, and Barwell</a> take a very traditional view of authentication which is based on the assumption that editors can make a text perfect and that it will deteriorate if not controlled (even more depressing is their emphasis on defending copyright). In the light of everything I&#8217;ve discussed so far, I suggest that the opposite might be true: it is impossible for any individual editor or team of editors to produce a perfect text, and that the more people who are involved in correcting errors, the more accurate the transcription is likely to be. Wikipedia shows that with a critical mass of committed and responsible users, even deliberate vandalism can be overcome. This is not to say that every electronic text has to be, or can be, as open as Wikipedia. The most obvious problem is getting that critical mass of users, and this will be more difficult for more esoteric projects in which fewer people are likely to take an interest. At the very least, there should be some mechanism for users to suggest corrections, even if these have to be reviewed before being implemented. For example, the Old Bailey Proceedings has a form for submitting errors.</p>
<p>My current projects are small enough that I can take a lot of time and care over them, but I also want to develop techniques that will scale up, otherwise the experience will be of limited value. The relatively small amount of text to be dealt with means that in absolute terms there are likely to be few errors. It&#8217;s when you scale things up to millions of words that a small probability of errors can lead to a huge number of errors.</p>
<p>So far I&#8217;ve only considered characters as information, and haven&#8217;t got any closer to defining what text &#8220;is&#8221;. For the purposes of digitizing a book, I can avoid that question by setting out my aim as transcribing all of the characters in a particular book. Even though the definition of a book is at least slightly less problematic than the definition of a text, there&#8217;s more to a book than a sequence of characters. I&#8217;m choosing to represent one aspect of the book while discarding others, such as ink, paper, and binding. This is an arbitrary choice. Partly it&#8217;s because of the impossibility of representing the book as a complete physical object within a digital computer. Perfect information of that kind would need to go down to the level of atoms, and we would need some mechanism for reconstructing objects from the information contained in the computer. This is getting into the realms of alchemy, and clearly isn&#8217;t possible with any current technology.</p>
<p>But above all, any digitization project needs to look to the requirements of its intended users. From this point of view, it&#8217;s the information contained in a book, the sequence of characters making up the message, which potentially has the most value for readers. &#8220;Frequently the messages have meaning&#8221;. In the next part I&#8217;ll be going beyond information and looking at the even greater problems associated with meaning.</p>
<h3>Bibliography</h3>
<ol>
<li>Lou Burnard, John Unsworth, and Katherine O&#8217;Brien O&#8217;Keeffe, <span style="font-style:italic;">Electronic Textual Editing with CDROM</span> (Modern Language Association of America, September 2006). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A0873529715&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Electronic%20Textual%20Editing%20with%20CDROM&amp;rft.publisher=Modern%20Language%20Association%20of%20America&amp;rft.edition=Pap%2FCdr&amp;rft.aufirst=Lou&amp;rft.aulast=Burnard&amp;rft.au=Lou%20Burnard&amp;rft.au=John%20Unsworth&amp;rft.au=Katherine%20O'Brien%20O'Keeffe&amp;rft.date=2006-09-30&amp;rft.pages=419&amp;rft.isbn=0873529715"></span></li>
<li>Susan M. Hockey, <span style="font-style:italic;">Electronic Texts in the Humanities</span> (Oxford University Press, November 2000). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A0198711948&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Electronic%20Texts%20in%20the%20Humanities%3A%20Principles%20and%20Practice&amp;rft.publisher=Oxford%20University%20Press&amp;rft.aufirst=Susan%20M.&amp;rft.aulast=Hockey&amp;rft.au=Susan%20M.%20Hockey&amp;rft.date=2000-11-23&amp;rft.pages=228&amp;rft.isbn=0198711948"></span></li>
<li>C E Shannon, &#8216;A mathematical theory of communication&#8217;, <span style="font-style:italic;">Bell System Technical Journal</span>, 27 (1948), pp. 379-423, 623-656. <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.atitle=A%20mathematical%20theory%20of%20communication&amp;rft.jtitle=Bell%20System%20Technical%20Journal&amp;rft.volume=27&amp;rft.aufirst=C%20E&amp;rft.aulast=Shannon&amp;rft.au=C%20E%20Shannon&amp;rft.date=1948&amp;rft.pages=379-423%2C%20623-656"></span></li>
<li>Ray Siemens, John Unsworth, and Susan Schreibman, <span style="font-style:italic;">Companion to Digital Humanities (Blackwell Companions to Literature and Culture)</span> (Blackwell Publishing Professional: Oxford, December 2004). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A1405103213&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Companion%20to%20Digital%20Humanities%20(Blackwell%20Companions%20to%20Literature%20and%20Culture)&amp;rft.place=Oxford&amp;rft.publisher=Blackwell%20Publishing%20Professional&amp;rft.edition=Hardcover&amp;rft.series=Blackwell%20Companions%20to%20Literature%20and%20Culture&amp;rft.aufirst=Ray&amp;rft.aulast=Siemens&amp;rft.au=Ray%20Siemens&amp;rft.au=John%20Unsworth&amp;rft.au=Susan%20Schreibman&amp;rft.date=2004-12-12&amp;rft.isbn=1405103213"></span></li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/02/02/text-theories-information/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ego me mihi meme</title>
		<link>http://www.investigations.4-lom.com/2007/01/23/memes/</link>
		<comments>http://www.investigations.4-lom.com/2007/01/23/memes/#comments</comments>
		<pubDate>Tue, 23 Jan 2007 20:14:44 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[cultural history]]></category>
		<category><![CDATA[gender]]></category>
		<category><![CDATA[information theory]]></category>
		<category><![CDATA[memetics]]></category>
		<category><![CDATA[theory]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/01/23/memes/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Ego+me+mihi+meme&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-01-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/01/23/memes/&amp;rft.language=English"></span>
Oh no! Bill Turkel has tagged me for a meme! Is this the end of civilisation as we know it? When I started this weblog I was determined to stick to substantial original content. There would be no room for memes or other self-indulgent timewasting — I already have a LiveJournal for that. However, Bill [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Ego+me+mihi+meme&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-01-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/01/23/memes/&amp;rft.language=English"></span>
<p>Oh no! <a href="http://digitalhistoryhacks.blogspot.com/2007/01/5-things-about-memes-and-blogosphere.html" title="Digital History Hacks">Bill Turkel</a> has tagged me for a meme! Is this the end of civilisation as we know it? When I started this weblog I was determined to stick to substantial original content. There would be no room for memes or other self-indulgent timewasting — I already have a LiveJournal for that. However, Bill managed to turn this particular meme into some interesting analysis of memetics and the blogosphere. That&#8217;s inspired me to move even further away from the original meme and post some random thoughts about memes. I won&#8217;t be tagging anyone at the end, because I hope to demonstrate that history bloggers don&#8217;t need to tag each other.</p>
<p><span id="more-47"></span></p>
<p>I first encountered the concept of the meme in 2004 after I started my LiveJournal. Back then I didn&#8217;t know anything about the theory behind it or that the term had been invented by Richard Dawkins. I didn&#8217;t even know how to pronounce the word and came up with my own folk etymology that it must be derived from the word &#8220;me&#8221; because the whole thing was an outlet for egotism and self-indulgence (this is almost as good as my folk etymology for &#8220;emo bands&#8221;, which I thought referred to their singers&#8217; whiny nasal voices sounding like Emo Philips). As far as I could see it was a piece of harmless fun, but potentially addictive enough to turn into a serious waste of time. (At this point it&#8217;s worth noting that I&#8217;m falling back on a familiar plot device for this blog: the autobiographical journey of discovery, beginning with &#8220;look how ignorant I used to be&#8221; and teleologically progressing to &#8220;but I got better&#8221;.)</p>
<p>But I got better. Some of my friends know about science and write about it in their LiveJournals, so eventually I realised what memes were really about. The meme could potentially be useful idea to explain how human culture spreads and changes. The Marxist model in which economic base determines cultural superstructure is now realised to be completely inadequate (the passive voice there: I&#8217;m not just standing alone and saying <em>I</em> think it&#8217;s completely inadequate, but it would be tempting fate to assert that <em>everyone</em> thinks it&#8217;s inadequate). More subtle Marxist thinkers like Althusser and Gramsci moved away from simple economic determinism but still assumed that ideological hegemony served the interests of the elite.</p>
<p>Feminist and queer approaches to history and literature focus on gender ideology. In some ways gender fits the Marxist model, being widely assumed to be natural even though the feminist distinction between sex and gender exposes how <em>un</em>natural gender can be. In other ways, gender ideology brings out the limitations of a model of cultural change based on social and economic class. Although gender ideology can&#8217;t be directly linked to an economic base, it has changed over time, and historians need to account for those changes somehow (mainly because explaining change is part of historians&#8217; claim to importance, and therefore explanations have to be produced somehow or other; I&#8217;m having increasing doubts about whether we, the historians, can ever explain why anything happened, since even finding out <em>what</em> happened gives us more than enough methodological problems).</p>
<p>Dror Wahrman has suggested that England experienced a &#8220;gender panic&#8221; in the 1780s caused by the American Revolution (&#8216;<a href="http://findarticles.com/p/articles/mi_m2279/is_n159/ai_21029552" title="Dror Wahrman: Percy's Prologue">Percy&#8217;s Prologue</a>&#8216;, Past and Present, 159, 1998). The dates match up well enough, but we (as in everyone, I hope) all know that correlation doesn&#8217;t prove causation. Americans rejecting British government can easily be seen as a blow to patriarchy which might have had knock-on effects for British women. I&#8217;m actually really impressed by Wahrman&#8217;s article, but I&#8217;m just not entirely convinced by the conclusion. We (as in absolutely everyone in the world) don&#8217;t know enough about how culture works to be able to draw this kind of conclusion. We can&#8217;t even arrive at an adequate definition of what culture is. John Tosh was also sceptical of Wahrman&#8217;s conclusions, suggesting that cultural change might be independent of other factors (in Tim Hitchcock and Michele Cohen eds. <em>English Masculinities, 1660-1800</em>, 1999, ISBN: 0582319226). We might even have to consider the possibility that changes in economies, societies, and politics are driven by culture. After all, if cultural assumptions say that economy and society should be a certain way, and if those assumptions are so hegemonic that nobody ever thinks of questioning them, how can economy and society actually change? Doesn&#8217;t the idea of change have to come first?</p>
<p>The problem with this line of thinking is that it makes it much harder to explain cultural change. This is where Richard Dawkins comes in. These days he&#8217;s widely perceived as the angry red-faced militant atheist who can&#8217;t tolerate anyone who thinks differently from him, and as the leader of a pack of extreme reductionists who see absolutely everything as serving an evolutionary purpose (someone will probably comment that Dawkins isn&#8217;t a militant because he doesn&#8217;t use weapons or physical violence, and I&#8217;ll probably reply condescendingly that reductionists don&#8217;t understand metaphors). But it wasn&#8217;t always like this. In fact Dawkins recognised that biological evolution can&#8217;t account for everything in human history. There are some aspects of culture which don&#8217;t have any obvious connection with natural selection, or which even work against it. His proposed solution was the meme. Although this model was based on the transmission of DNA it can&#8217;t be dismissed as simple biological determinism. The meme is roughly analogous to the gene, but it&#8217;s really a cultural solution to a cultural problem.</p>
<p>When Dawkins came up with the idea of memes it was just a hypothesis, and one which wasn&#8217;t particularly important for the main arguments in <em>The Selfish Gene</em>. Dawkins himself doesn&#8217;t take his hypothesis very seriously (Bill Benzon said this somewhere on <a href="http://www.thevalve.org/" title="The Valve">The Valve</a> but I can&#8217;t find it now). Other people have enthusiastically taken it up and run with it. Does this prove that the hypothesis is true (memetics is a successful meme itself), or is it just a self-fulfilling prophecy?</p>
<p>One of my LiveJournal friends, who is a scientist, posted a succinct and eloquent <a href="http://innerbrat.livejournal.com/334427.html" title="Innerbrat on memetics">summary of meme theory</a>. We had an informed debate in the comments, and although it petered out just as it was getting more interesting, it at least showed me where I stand on the issue. I don&#8217;t think the meme lives up to the hype because it doesn&#8217;t really help to explain anything about culture. As far as I can see, the meme is just an arbitrary unit of information (in the strict sense of Shannon&#8217;s Information Theory, in which information and meaning are separate). Focusing on the transmission of a piece of information might be interesting up to a point but doesn&#8217;t necessarily tell us anything about how or why it spreads or whether its propagation has any significance.</p>
<p>Ultimately memetics doesn&#8217;t help us to avoid or solve the problem of meaning. What is meaning? How is it made? Can it be fixed? How far and how fast can it slip? These are fundamental questions which can&#8217;t be answered yet, and might never be answered. Structuralism suggests that language can be fixed in relation to itself, and that meaning derives from the differences between words, which still leaves us with the problem of how language relates to reality. Post-structuralism suggests that the meanings of words can slip rapidly and unpredictably. Few people who have given the matter any thought are naive enough to believe that words have direct and unproblematic relationships with objects. That mental concepts come somewhere in between isn&#8217;t a particularly controversial statement, but there&#8217;s plenty of room for controversy about how far those mental concepts influence perception and communication. I&#8217;m hoping that cognitive science will give us some more definite answers (although it shouldn&#8217;t be assumed that those answers will necessarily be reductionist or realist) but right now the jury is still out.</p>
<p>If memes rely on meaning then they are as problematic as anything else which relies on meaning, but if we exclude meaning from memetics then it doesn&#8217;t seem to be much use. We still need to work out some basic things about language, culture, and the human brain (to a certain extent that &#8220;we&#8221; is really &#8220;I&#8221;, because I know far too little about the current state of cognitive science, but there is almost certainly more work to be done). Until then, explaining human culture is likely to remain beyond the scope of memetics.</p>
<p>On the other hand, the model might be usefully applied to the propagation of information independently of human language and culture. Computer viruses are self-replicating information, but they can only replicate if some of that information has meaning within the context of a computer&#8217;s operating system. This fits perfectly into the structuralist paradigm while avoiding some of the difficult questions raised by post-structuralism. A computer language is an arbitrary system which is fixed in relation to itself but which does not have a fixed relationship with reality. Unlike human language, computer languages don&#8217;t normally slip. We can clearly see a synchronic moment between changes in a language specification (compare that to human language, where a complete picture of a language system in a synchronic moment is unattainable in practice). Non-standard implementations could be classed as slippage, but they could just as well be classed as different language systems in their own synchronic moments. I&#8217;m not sure whether this gets us anywhere since memes aren&#8217;t necessary to explain computer languages or operating systems.</p>
<p>While I reject the meme as a tool for explaining human culture there clearly are ideas circulating in the blogosphere through copying from one blog to another. In this context, there <em>are</em> memes, but when they&#8217;re explicitly called memes they&#8217;re often pretty much what I first thought: fun but pointless. However, since I graduated from LiveJournal to WordPress last October and joined the history blogosphere I&#8217;ve been able to take part in a much more interesting exchange of ideas than just telling the world which dysfunctional Care Bear I am (you can probably guess that it was Nihilist Bear anyway). Until today this blog has officially been a meme free zone, but I&#8217;ve written several posts that were inspired by reading other people&#8217;s blogs. In fact around a quarter of my posts so far have been responding to something on another history blog. For example, my post on <a href="http://www.investigations.4-lom.com/2006/10/19/narratives-global-war/" title="Investigations of a Dog: Grand Narratives of Global War">Grand Narratives of Global War</a> was originally going to be posted as a comment on <a href="http://airminded.org/" title="Airminded">Airminded</a> until it got so long and complicated that it had to be a post in its own right . In the light of this, I don&#8217;t think there&#8217;s any need for history bloggers to tag each other with memes, because we&#8217;re already interacting in a more interesting and productive way.</p>
<h3>Bibliography</h3>
<ol>
<li>Tim Hitchcock and Michele Cohen (eds.), <span style="font-style: italic">English masculinities, 1660-1800</span> (Addison Wesley: London, 1999). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A0582319226&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=English%20masculinities%2C%201660-1800&amp;rft.place=London&amp;rft.publisher=Addison%20Wesley&amp;rft.aufirst=Tim&amp;rft.aulast=Hitchcock&amp;rft.au=Tim%20Hitchcock&amp;rft.au=Michele%20Cohen&amp;rft.date=1999&amp;rft.isbn=0582319226"></span></li>
<li>Dror Wahrman, &#8216;Percy&#8217;s prologue&#8217;, <span style="font-style: italic">Past and Present</span>, 159 (1998), pp. 113-60. <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.atitle=Percy's%20prologue%3A%20from%20gender%20play%20to%20gender%20panic%20in%20eighteenth-century%20England&amp;rft.jtitle=Past%20and%20Present&amp;rft.volume=159&amp;rft.aufirst=Dror&amp;rft.aulast=Wahrman&amp;rft.au=Dror%20Wahrman&amp;rft.date=1998&amp;rft.pages=113-60&amp;rft.issn=00312746"></span></li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/01/23/memes/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Historical Information and Noise</title>
		<link>http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/</link>
		<comments>http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/#comments</comments>
		<pubDate>Tue, 21 Nov 2006 19:50:51 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[information theory]]></category>
		<category><![CDATA[theory]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Historical+Information+and+Noise&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2006-11-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/&amp;rft.language=English"></span>
Over the last 10 years or so, technology has brought huge changes to historical research and opened up new possibilities. Computers have solved some old problems, but also created some new ones. Meanwhile there has been an increasing focus on the problems of epistemology: what can we know about the past and how can we [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Historical+Information+and+Noise&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2006-11-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/&amp;rft.language=English"></span>
<p>Over the last 10 years or so, technology has brought huge changes to historical research and opened up new possibilities. Computers have solved some old problems, but also created some new ones. Meanwhile there has been an increasing focus on the problems of epistemology: what can we know about the past and how can we know it? The debate has mostly been about the relationship between textual sources and the reality of the past. Even if you reject theory and take a purely empirical view of what the sources can tell us, there are some potential problems with the transmission of the information that they contain.</p>
<p><span id="more-23"></span></p>
<p>Information theory makes a clear separation between information and meaning. This was originally because information theory was concerned with the engineering problems of transmitting messages, and from this point of view any meaning which the message might have is irrelevant. Meaning is highly relevant to historians, because sources are not much use if we don&#8217;t know what they mean. Post-structuralist theory raises fundamental questions about the meaning of language, and when taken to its most extreme logical conclusion suggests that communication might be impossible. This is a problem which needs to be solved but it isn&#8217;t yet clear how that can be done. Treating historical sources as information isn&#8217;t a solution in itself, but it might offer a different way to approach the problem, since excluding meaning also excludes the problems associated with meaning.</p>
<p>Leaving aside this idle speculation, the distinction between information and meaning is still an important one. If the information itself is transmitted unreliably, then we have significant problems even before we start worrying about meaning. We don&#8217;t even have to go into source criticism to find potential inaccuracy. Sources might not be as original as they claim to be, but for the purposes of this post I&#8217;m going to take archival sources at face value. I&#8217;m thinking more about what happens to the information after it has been extracted from the original document.</p>
<p>When I started my PhD research in 1997, I was still using the old ways. Most of my time was spent in the Public Records Office reading seventeenth-century manuscripts and making notes with a pencil and paper. Since continuous notes on A4 paper were not flexible enough for any kind of analysis, I would later copy the information onto 6&#215;4 file cards (again by hand) and file them under relevant headings. Some sets of records went onto spreadsheets, but that was limited because I didn&#8217;t have my own computer then (I can&#8217;t imagine life without one now!). This system had many opportunities for noise to creep into the information.</p>
<p>First of all, there&#8217;s the question of whether I&#8217;d read the documents accurately. Seventeenth-century secretary hand isn&#8217;t easy to read, not least because some of the letters are formed in very different ways from modern handwriting. Progress was slow at first, but my reading ability soon improved with experience. While my transcripts probably got more accurate over time, it&#8217;s also likely that my increasing confidence led me to miss some relevant documents by skimming over them too quickly. Time, budget, and the huge quantity of documents I had to get through created pressure for speed at the expense of accuracy. This is a problem which all research projects have to deal with, and there is no ideal solution.</p>
<p>Assuming that I&#8217;d read the document correctly, had I copied the information correctly? There were several points where this could be a problem. First, in copying from the original documents to my notes in the archives; second, copying the notes onto file cards; and third, putting examples into the text during the process of writing up my thesis. If the information went onto a spreadsheet or database at a later date, this introduced an extra step and more chance of errors. Taking notes by hand was problematic because I might not be able to read my own handwriting accurately in the future. If I found my pencil notes to be completely illegible, that would deny access to the information unless I went back to the archive and looked at the document again. A more insidious threat was that I would misread my writing without noticing any ambiguity. Context didn&#8217;t always help much, because I was often dealing with numbers in sources like account books.</p>
<p>Once I got my hands on a laptop things changed significantly, but there were still potential issues. Being able to type my notes while looking at the original document in the archive removed some of the intermediate steps. For records which would fit easily onto a database I could enter them directly into a table. There was no chance that this information would deteriorate by being copied by hand several times, and no chance that it would turn out to be illegible. However, typing can introduce some errors more easily than writing. Keying errors can occur frequently even when you know what the document says and know what you&#8217;re trying to type. Autocorrect claims to reduce these kinds of errors, but can also introduce errors of its own, and doesn&#8217;t help at all with numerical data. The pressures of time and cost still apply, and repetitive data entry can lead to loss of concentration.</p>
<p>These problems are not unique to my PhD. There are some major research projects in progress which rely heavily on Access databases, with research assistants in the archives entering data directly from original sources. Team projects have the advantage of experienced leadership, thorough training, and regular assurance checks. All of these things will help to reduce errors, but they can never be eliminated completely. Wherever it&#8217;s necessary to send research assistants to look at original documents, it&#8217;s usually only practical to have one assistant look at each document once. Management assurance checks are only ever likely to be random samples, although the frequency of sampling would be expected to be highest early in the project when staff are least experienced.</p>
<p>However, the whole idea of taking notes in archives might be coming to an end. Some archives (most notably the Public Records Office) allow the use of cameras to photograph documents. Over the last few years the quality and storage capacity of digital cameras has improved drastically, while prices have kept falling. My current camera cost less than £200. At 3 megapixels the quality is good enough for my needs, and the 1GB memory card can store over 1,000 high quality pictures. This has completely changed the way I do research. Once I&#8217;ve identified relevant documents, I can just photograph them and work from the images when I get home. All the uncertainty of my old note-taking is gone. There&#8217;s still a chance of missing relevant documents, but this is reduced because the time spent transcribing documents is eliminated. Less time needs to be spent in archives and so research is cheaper.</p>
<p>Looking beyond the individual level, the implications of digital imaging are even bigger, especially combined with the growth of the internet and the increasing availability of broadband connections. It is now possible to digitise whole collections of historical records and make them available on the web. Once this has been done, the costs of working on those records is further reduced for future researchers because few people will need to go to the archives to check the originals. However, simple images of documents are only of limited value. To get the most out of the sources they need to be transcribed into digital text so that they can be searched. XML tagging adds even more value, describing the structure of the document, and classifying pieces of information (such as names of people or places), so that data can easily be extracted into databases while preserving the whole text. Transcribing and tagging the information introduces new possibilities for noise to get in, but the technology mitigates these problems to a certain extent.</p>
<p>The <a href="http://www.oldbaileyonline.org/about/" title="Old Bailey online">Old Bailey Proceedings</a> provides a good case study of a cutting edge digitisation project. The first phase of the project (now complete) involved imaging 60,000 pages of original trial reports, transcribing millions of words, and inserting XML tags. Working from digital images allowed the use of double-keying: every page was typed twice by different people, and discrepancies between the two versions were flagged by a computer to be checked manually. While this isn&#8217;t necessarily cheap, it&#8217;s likely to be more cost effective than sending research assistants to archives because the work can be outsourced to home workers. Cost, speed, and accuracy depend on the age and legibility of the source material. Early modern manuscripts still require palaeography skills, which makes it harder to find staff with the necessary skills and potentially increases the cost (although I imagine there are plenty of people with PhDs in early modern history who cant get jobs!).</p>
<p>The double keying process used by <a href="http://heds.herts.ac.uk/" title="Higher Education Digitisation Service">HEDS</a> claims an accuracy rate of 99.8% provided that the text in the digital images is legible. In practice, historical records are often not legible enough to be transcribed accurately. The <a href="http://www.oldbaileyonline.org/about/" title="Old Bailey online">Old Bailey Proceedings</a> website shows that this can be overcome to a certain extent because the images of the documents are available online along with the digitised text. Wherever it&#8217;s noted that text couldn&#8217;t be transcribed satisfactorily you can click on a link to the image and try to decipher it yourself. Things are potentially more tricky in cases where the text is legible but a transcription error has made it through the double keying process, because users won&#8217;t necessarily check the image if they think there isn&#8217;t a problem with the text. However, academics who need to be certain of accuracy and who are used to travelling to archives won&#8217;t find it much trouble to click on the link to the image and make sure.</p>
<p>If human errors were completely random, double keying would reduce the chances of errors exponentially. For example, if the probability of one keyer mistranscribing each word was 0.1, then the probability of both keyers mistranscribing that same word would be 0.01 (I think that&#8217;s right but I have a few doubts, as it&#8217;s a long time since I did any maths! I just hope this doesn&#8217;t turn into something like the notorious circle thread at The Valve). In practice it&#8217;s likely that human errors aren&#8217;t perfectly random, and that there is a greater chance of people making the same mistakes. If HEDS meets its accuracy target of 99.8% for the 70 million words in the next phase of the Old Bailey/Plebeian Lives project, you might expect 140,000 of those words to be mistranscribed. That might not be true in practice, because if the 0.2% error rate is calculated from letters rather than words, the wrong letters won&#8217;t necessarily be evenly distributed through words &amp;mdash; they might be concentrated in a few wrong words, so the total number of wrong words would be considerably less than 140,000. In any case, this error rate isn&#8217;t as bad as it sounds, because the text will be checked again by data developers during the XML tagging process, and presumably a certain percentage of their work will be checked by managers, so many of those errors will be caught.</p>
<p>Ultimately human error is unavoidable. Best practice is to design and implement systems which minimise errors as far as possible, although constraints of time and budget place limits on accuracy. The most basic step is to never rely on one person. It&#8217;s also important to combine people and computers so that they cover each others weaknesses. Technology has vastly improved the systems for transmission of historical information. The challenges faced at the cutting edge of digitisation are mostly down to the unprecedented scale of the latest projects. Ten years ago, transcribing 70 million words would have been impractical, prohibitively expensive, and perhaps unimaginable.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

