<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Investigations of a Dog &#187; ocr</title>
	<atom:link href="http://www.investigations.4-lom.com/tag/ocr/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.investigations.4-lom.com</link>
	<description>Failing better at understanding the past</description>
	<lastBuildDate>Sun, 05 Feb 2012 09:18:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Proofreading</title>
		<link>http://www.investigations.4-lom.com/2007/02/23/proofreading/</link>
		<comments>http://www.investigations.4-lom.com/2007/02/23/proofreading/#comments</comments>
		<pubDate>Fri, 23 Feb 2007 12:01:05 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[sandall]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/02/23/proofreading/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Proofreading&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/23/proofreading/&amp;rft.language=English"></span>
In my last project update I described how I used FineReader to OCR the text of Sandall&#8217;s History of 5th Lincolnshire Regiment. Since then I&#8217;ve manually proofread the text and inserted some basic XML markup. Proofing and basic tagging have given me a more detailed understanding of the text and the features in it, and [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Proofreading&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/23/proofreading/&amp;rft.language=English"></span>
<p>In my last project update I described how I used FineReader to OCR the text of Sandall&#8217;s History of 5th Lincolnshire Regiment. Since then I&#8217;ve manually proofread the text and inserted some basic XML markup. Proofing and basic tagging have given me a more detailed understanding of the text and the features in it, and I&#8217;ve been noting potential issues as I go. I&#8217;ll post more about how I&#8217;m using XML later, but this post is a more detailed description of the process of proofreading.</p>
<p><span id="more-60"></span></p>
<p>I used FineReader for proofreading, as its interface makes it easy to compare digitized text with the original image. I had previously stepped through the uncertain characters and unrecognised words, so wasn&#8217;t expecting major problems, but I still wanted to iron out any remaining scannos and impose consistent style before I exported the text. In the post on OCR I said I was exporting copies of the text at each stage for comparison purposes. I&#8217;ve now decided to abandon this because I was trying to do two contradictory things at once: create a useable text (the main aim of the project) and produce a perfect rendition of the sequence of characters in the original text in order to assess the accuracy of FineReader&#8217;s OCR. I now realise that it isn&#8217;t possible or desirable to do both at once. For example, where the text uses I for 1 the two objectives can&#8217;t be reconciled. If FineReader renders it as &#8216;I&#8217; that&#8217;s a correct rendition but in order to create a useable text I need to change it to &#8217;1&#8242;. If I wanted to quantify the accuracy of the OCR I&#8217;d have to do it separately from creating a useable text, but I don&#8217;t think the effort would be justified. I&#8217;ve already got enough of a feel for FineReader to know that it&#8217;s good enough for what I want to do but that some manual proofing will always be necessary to create a releasable text.</p>
<p>I decided not to carefully compare the text and image line by line as this would take too long, and I wasn&#8217;t expecting enough scannos to make it worthwhile. Instead I just read through the digitized text &amp;mdash; slowly and carefully but in Distributed Proofreaders terms it was more like smooth reading than proofing. Since I had never read the whole book before it was unfamiliar enough to make this a viable approach. However, I made a point of double checking every number and every proper noun which I wasn&#8217;t familiar with. This slowed down the reading, and rapidly switching my attention from the text in the lower pane to the image in the upper pane and back again gave me eye strain! If I was scaling the project up, this is the kind of work which I could <strike>palm off onto someone else</strike> get volunteers to help with as it doesn&#8217;t require a great deal of technical skill (apart from knowledge of spelling and punctuation) and can be split into small chunks, as at <a href="http://www.pgdp.net/c/" title="Project Gutenburg Distributed Proofreaders">PGDP</a>. I also checked every end of line hyphen to make sure it was correctly marked as soft or hard. FineReader had already got a lot of these right, but there were some cases where it was obviously wrong, or which were ambiguous. These ambiguities could only be resolved by looking for occurrences of the same word elsewhere in the text. As I went I built up a hyphenation list, showing whether certain words should have a hard hyphen or not. For example headquarters didn&#8217;t but out-buildings did. I was aiming for consistent usage according to the original text, regardless of what might now be considered &#8220;correct&#8221;.</p>
<p>Before I started reading I spent about 15 minutes manually removing running heads from the tops of the pages. I could have done it slightly quicker using Find and Replace, but that wouldn&#8217;t have saved a great deal of time because the book has over 20 chapters which are all quite short. In future I&#8217;d recommend adjusting the text capture boxes to cut out all extraneous matter. This includes junk from around the edges of the pages as well as headers and footers.</p>
<p>It took about 8 hours to read through the whole text, including front and back matter. I found very few scannos. The most usual one was where the original text rendered &#8217;11th&#8217; as &#8216;IIth&#8217; FineReader thought it was &#8216;nth&#8217;. Many of these had been flagged as uncertain but a few had slipped through. There were a few obviously wrong letters but generally everything made sense in context. Some stealth scannos could have slipped through, but I&#8217;m not prepared to put in the effort to find them. The perfect is the enemy of the good. Double checking names and numbers which were not self-evidently wrong turned out to be more trouble than it was worth as errors here were negligible. Most of the changes I made during proofing were actually formatting. As well as checking hyphens I had to make sure that abbreviations conformed to my style guidelines. In future I must make sure that FineReader isn&#8217;t automatically inserting spaces after full points as I spent a lot of time correcting this, especially with all the corps and rank abbreviations in a book like this. Obvious typographic errors in the original text were marked with [sic], but I didn&#8217;t find many.</p>
<p>Once proofing was finished I exported each batch to a utf-8 text file. Although FineReader marks soft hyphens as ¬ to distinguish them from hard hyphens, it converts them all to &#8211; when exporting them to text, which is slightly annoying. Because of this I ran Find and Replace over each batch to replace every ¬ with the entity reference &amp;amp;shy; For my purposes I could leave it like this, but if I was creating individual page files for other people to proof through the DP interface I&#8217;d want to convert them back to ¬ after the export, which is easily done. FineReader has an option to insert a page break character at the end of every page when exporting a batch to a single file. This would be more useful if you could choose the character to insert, but as it is the extra step needed to replace the page break character with the TEI &lt;pb/&gt; element isn&#8217;t too much trouble.</p>
<p>With each batch exported to a single file I just had to tidy them up a bit and then I could start marking them up with XML tags. More about that in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/02/23/proofreading/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital History Projects: OCR</title>
		<link>http://www.investigations.4-lom.com/2007/02/07/digital-history-projects-ocr/</link>
		<comments>http://www.investigations.4-lom.com/2007/02/07/digital-history-projects-ocr/#comments</comments>
		<pubDate>Wed, 07 Feb 2007 20:11:30 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[sandall]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/02/07/digital-history-projects-ocr/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Digital+History+Projects%3A+OCR&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-07&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/07/digital-history-projects-ocr/&amp;rft.language=English"></span>
Now that I&#8217;ve got all the theoretical agonising out of the way, I can actually do something about digitizing the text. This week I&#8217;m carrying out OCR and proofreading on the text of Sandall&#8217;s History of 5th Battalion the Lincolnshire Regiment. As soon as I got to work I encountered issues that I hadn&#8217;t thought [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Digital+History+Projects%3A+OCR&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-02-07&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/02/07/digital-history-projects-ocr/&amp;rft.language=English"></span>
<p>Now that I&#8217;ve got all the theoretical agonising out of the way, I can actually do something about digitizing the text. This week I&#8217;m carrying out OCR and proofreading on the text of Sandall&#8217;s <em>History of 5th Battalion the Lincolnshire Regiment</em>. As soon as I got to work I encountered issues that I hadn&#8217;t thought of, and found that subjective decisions had to be made even earlier than I&#8217;d anticipated. This just shows that the only way to learn how to do something is to do it.</p>
<p><span id="more-55"></span></p>
<p>Yesterday I started by preparing the image files for OCR. I found it convenient to scan two facing pages at once, but for OCR and proofing it&#8217;s more convenient to have a single page per image, and to remove all the plates. This isn&#8217;t too difficult to achieve with batch processing. I started with Microsoft Office Picture Manager. I used it to work out where to crop the images, but couldn&#8217;t use it to do the cropping because its batch processing features seem to break if you select too many images! No problem, because Irfanview is better anyway. I discovered that a few of the page images towards the end of the book had come out slightly larger than the rest, which meant they had to be done in a separate batch. I also had to run separate batches for rectos and versos as I couldn&#8217;t find a way to split one file into two.</p>
<p>Once I had a separate image for each page, I rearranged them, deleted the plates, and used Irfanview to do a batch rename, so that the file name of each image matches the page number. First attempt had to be redone because I forgot to delete the blank pages on the reverse of the plates resulting in the page numbers getting out of sequence! Prefaratory material was kept in a separate directory as it&#8217;s numbered with roman numerals and outside the main sequence of page numbers. I also separated Teall&#8217;s epilogue (which I still don&#8217;t know whether I&#8217;m going to publish) and the end matter (medal lists and index which are formatted differently from the main text). Preparing the images took about an hour, but would be quicker if I did it again with the benefit of this experience.</p>
<p>With all the images ready, I installed the 15 day trial version of ABBYY FineReader Pro 8.0. My first impression is that it&#8217;s pretty good, and if I do any more similar projects I&#8217;ll probably buy a license for this rather than trying to find a free alternative. The Distributed Proofreaders site suggests that you usually get what you pay for with OCR software. I spent some time finding my way around the software, reading the help and tutorials, and trying a few test pages. Once I was ready to go, I created new a new batch for each of the four sections: prefatory (6 pages), main (196 pages), teall (10 pages), and end (17 pages). The main batch of 196 pages took about 3 minutes to import the images, and 20 minutes to be read by the OCR. Mostly the structure of the text was detected and converted into boxes fairly accurately. A few pages had more complicated layout (this was more true in the appendix with table of medals awarded) where I had to redraw the boxes manually. Once OCR reading was complete I saved all the pages as text files for future comparison with corrected text, so that I can see roughly how accurate uncorrected OCR is.</p>
<p>The next step was an automated spell check. FineReader flags all uncertain characters and non-dictionary words, then steps through them and asks for user input. I found the spellcheck window too small, but it doesn&#8217;t have to be used as you can position it so that you can see the text in the main panes. There are three main panes: an image of the whole page, showing boxes; a close-up of part of the page image; and the digitized text. These can all be resized and zoomed to suit the user. As I stepped through the checks I was able to spot various problems with the OCR and had to make decisions about how to correct things. I wrote style guidelines as I went, adding each new feature or issue as I encountered it. I already had a good idea of how to handle things &amp;mdash; ideally keep the sequence of characters the same as in the book &amp;mdash; but no plan survives contact with the enemy. Even at the stage of trying to match a sequence of characters to a finite set of characters there are subjective decisions to be made. This is a summary of the guidelines I came up with:</p>
<p>Delete running heads and page numbers. This is how <abbr title="Project Gutenburg Distributed Proofreaders">PGDP</abbr> does it. I don&#8217;t necessarily agree with everything they do, but in this case I don&#8217;t see the headers and footers being any use to the target audience and they are more likely to get in the way. Pages will be marked with the TEI pb  element anyway.</p>
<p>Anyone who wants to see the original layout can look at the images. So far I haven&#8217;t been deleting heads or numbers in the spell-check phase. I&#8217;m hoping it can be done automatically (using tools from PGDP) but if it can&#8217;t it will have to be done during manual proofing. Keep line breaks. They will eventually be marked up with TEI lb  element, provided that it can be done automatically. If it can&#8217;t then I won&#8217;t bother. The main exception is the index, where I&#8217;ve decided to keep each entry on one line. This will probably be marked up as a list or table, so line breaks will be irrelevant and possibly counter-productive.</p>
<p>Keep soft hyphens. FineReader marks possible soft hyphens as ¬ and often flags them as uncertain characters. It should be easy to replace the character with an entity reference. Sometimes it might not be clear whether a hyphen is soft or hard (eg Head-quarters). In these cases I can look for the same word elsewhere in the text. If the word only occurs once, then I can at least make an arbitrary decision without worrying about consistency.</p>
<p>Dashes will be replaced by entity references, except in the index where a hyphen stands for &#8220;ditto&#8221;, in which case it will be replaced by [do] to indicate how tables or lists should be structured. During the spell-check I&#8217;m leaving them as they are.</p>
<p>Accented characters aren&#8217;t being picked up by FineReader as it doesn&#8217;t seem to have foreign dictionaries installed. This is something I can fix for future projects. As it is, most accented characters are being flagged as uncertain, so I can add them manually using keyboard shortcuts or character map. I&#8217;m still undecided about whether to leave them as Unicode characters or convert them to entity references (I&#8217;ve read conflicting advice on this) but conversion should be trivial using Find and Replace.</p>
<p>Mistakes in the text are to be preserved. I&#8217;ve marked them with [sic]. This wouldn&#8217;t be appropriate for an edited text which already has its own [sic]s but I can get away with it here. The important principle is to flag the mistakes in a distinctive way which can be separated from the original text.</p>
<p>Roman numerals are mostly to be transcribed as they are, but there&#8217;s one exception. Sandall consistently uses I instead of 1. This means that my ideal of preserving the original sequence of characters is completely untenable. It would be very inconvenient for me and for users to preserve this idiosyncrasy. Since the emphasis is on accessible and useful empirical data, it had to go. Fortunately most occurrences are flagged as uncertain characters since FineReader is commendably cautious about l and 1. The necessary correction was obvious where the roman I was mixed with Arabic numerals (eg I9I7). Where I is on its own, it could often be decided from context (eg 1st, 1/5th, but I Corps). So far I haven&#8217;t encountered a truly intractable case, but there&#8217;s at least the possibility of checking against usage elsewhere in the text.</p>
<p>Abbreviations are another headache. Again I wanted to preserve the original sequence characters but am going to have to compromise over hyphens and points in abbreviations. I&#8217;ve decided to keep points but lose some hyphens.</p>
<p>Where the abbreviation is a series of single initial letters they can be separated by points but with no spaces or hyphens between them. So Q.M.S. not Q.-M.-S. or QMS or Q. M. S. Sandall sometimes has hyphens but sometimes doesn&#8217;t. It seems to depend on the rank in question (Q.-M.-S. but R.S.M.). I&#8217;ve decided to remove hyphens where they occur.</p>
<p>Where the abbreviation has one or more groups of letters from the same word, all points and hyphens are to be kept. So Lt.-Col., L.-Cpl. Maj.-Gen.</p>
<p>Some abbreviations don&#8217;t have points after every part, or use a slash instead of a hyphen (eg L/Cpl). In these cases they&#8217;re to be kept as they are. There are other abbreviations which have a space but no point or hyphen (A Coy, B Coy). These are also kept as they are (there is room for confusion here: a company or A company? I prefer to leave it as it is).</p>
<p>This sounds more confident than it actually is. I&#8217;m still compromising between preservation of original text and ease of use. Having tried some experiments with Google, it doesn&#8217;t seem to worry about points in abbreviations. A search for &#8220;ramc&#8221; seems to return the same results as a search for &#8220;r.a.m.c.&#8221;, but &#8220;r.-a.-m.-c.&#8221; confuses it. The Find function in Firefox isn&#8217;t so forgiving and will consider an abbreviation with points to be a different string from the same letters without points. If I want the best of both worlds, I could try marking the points up in XML and using AJAX to enable users to turn them on and off, but that might well be more trouble than it&#8217;s worth. Most of these abbreviations are ranks which will be marked up as part of personal names, or corps which will be marked up as organisation names, both with regularised forms in the attributes (I&#8217;m going to have to draw up a list of regularised ranks, corps, and formations). I haven&#8217;t been formatting abbreviations during the spell-check as my aim there is just to resolve issues flagged by FineReader and get the text closer to the original. I&#8217;ll be checking abbreviations later during manual proofreading, or using Find to track them down.</p>
<p>Fractions have caused problems for FineReader and usually come out as junk, but as they&#8217;re flagged as uncertain it&#8217;s easy to spot them. At this stage I&#8217;ve rendered them in PGDP style (1-1/2). During markup they&#8217;ll be marked up as numbers.</p>
<p>Another unexpected problem was the use of a large brace (ie } but bigger) to combine two rows of tabular data. My solution so far is to flag them with }} at the end of each row. Later they&#8217;ll be marked up as tables, with reference to the images to make sure the right hand column spans the correct rows.</p>
<p>Small capitals are another thing which I didn&#8217;t consider worth keeping. In the medal table in the appendix all names and ranks are in small caps. FineReader renders them as small caps in Rich Text but converts them to mixed letters when it exports to plain text. I think this is satisfactory and I can&#8217;t see much point marking them up as small caps.</p>
<p>Columns are to be marked up with the TEI element cb.</p>
<p>There was also some junk to be removed caused by shadows around the edges of the page images, but this wasn&#8217;t a major problem. It could be eliminated by cropping out the edges of the images.<br />
Stepping through the automated checks for the main batch (196 pages) took 2 hours 50 minutes (1.1 pages per minute). In contrast, the end matter (only 17 pages) took 1 hour 30 minutes (0.2 pages per minute). This was mainly because of the smaller type, more complicated structure of the appendices and index, and more numbers (every occurrence of &#8220;1&#8243; was flagged as uncertain). At the end of this stage I saved another plain text copy of the pages in a different directory, again for comparison purposes. Once I get some software that can automatically calculate the differences between the two sets of files I&#8217;ll have a clearer idea of the accuracy of the basic OCR and the improvements that can be made by automated checks.</p>
<p>My impression so far is that FineReader is reasonably accurate, and that it consistently flags uncertain letters. I spotted a few unflagged scannos as I was stepping through and might find more during manual proofing, but I&#8217;m generally pleased with what the software can do. So far I&#8217;ve only been using standard settings and haven&#8217;t explored the training features. If I was only making a digital text for my own personal use rather than for publication I think I&#8217;d be happy enough to run FineReader over it and rely on the automated checks without any manual proofing. It wouldn&#8217;t be perfectly accurate or consistent, but it would give me a useable text with many advantages over a printed book.</p>
<p>The next stage is manual proofreading to try to pick up anything that FineReader missed and to enforce consistent style where my style guidelines demand changes from the original text. Before I do that I&#8217;ll be thinking more carefully about my style choices and about any other potential issues that need to be resolved. I&#8217;ll also be looking at TEI in more detail to make sure I don&#8217;t do anything at the proofing stage which makes markup more difficult. I was hoping to break the process down into independent stages, but I&#8217;m starting to realise things are more interrelated than abstract models allow.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/02/07/digital-history-projects-ocr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital History Projects: Planning</title>
		<link>http://www.investigations.4-lom.com/2007/01/10/digital-history-planning/</link>
		<comments>http://www.investigations.4-lom.com/2007/01/10/digital-history-planning/#comments</comments>
		<pubDate>Wed, 10 Jan 2007 19:30:02 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[copyright]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[theory]]></category>
		<category><![CDATA[ww1]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/01/10/digital-history-planning/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Digital+History+Projects%3A+Planning&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-01-10&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/01/10/digital-history-planning/&amp;rft.language=English"></span>
In my New Year post, I mentioned that I&#8217;m thinking about carrying out a couple of digital history projects in connection with my First World War research. These projects are very small and should be relatively easy to carry out on my own, but there will almost certainly be challenges. Overcoming these will give me [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Digital+History+Projects%3A+Planning&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-01-10&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/01/10/digital-history-planning/&amp;rft.language=English"></span>
<p>In my New Year post, I mentioned that I&#8217;m thinking about carrying out a couple of digital history projects in connection with my First World War research. These projects are very small and should be relatively easy to carry out on my own, but there will almost certainly be challenges. Overcoming these will give me more experience of carrying out a digital history project (this is starting to sound like a job application again!), and produce useful resources. After that, I can move on to consider some more advanced issues, such as collaborating with other people, and dealing with seventeenth-century manuscripts. To make the experience even more useful, I&#8217;m trying to blog it as I go. This post is an outline of my plans so far. Now that I&#8217;ve published my plans I&#8217;ll have to carry them out!</p>
<p><span id="more-43"></span></p>
<h3>Background:</h3>
<p>Before starting any digital history project, you should read <a href="http://chnm.gmu.edu/digitalhistory/" title="Digital History">Digital History</a> by Dan Cohen and Roy Rosenzweig which is available online free of charge. It&#8217;s an easy to understand introduction which will tell you about different approaches to digital history, costs, benefits, potential problems, and generally help you to think clearly about what you want to achieve and how you can achieve it. The authors ask aspiring digital historians to get the right balance between caution and risk-taking. While a certain amount of planning is always necessary, there is also a risk that academic historians will spend too much time thinking rather than doing.</p>
<p>Having read the book and thought about the potential problems to be overcome, I think I&#8217;m in a good position to put my IT experience to use. I&#8217;ve been running my band&#8217;s website since April 2000, so I&#8217;m a veteran HTML coder. There have been many changes in web technology in that time, and I&#8217;ve always kept up with new developments while being careful not to jump on the latest bandwagon too soon. I have a good knowledge of (X)HTML, CSS, PHP, MySQL, image editing, Apache server management, and web accessibility issues.</p>
<p>I&#8217;ve studied the <a href="http://www.oldbaileyonline.org" title="Old Bailey Proceedings">Old Bailey Proceedings</a>, a large collection of digitized source material marked up with XML, got acquainted with the <a href="http://www.tei-c.org/" title="Text Encoding Initiative">Text Encoding Initiative</a> standards, and have experimented with XML myself. Recently I volunteered my services to <a href="http://www.pgdp.net/c/" title="Distributed Proofreaders">Distributed Proofreaders</a>, and have already proofread over 100 pages of text. I&#8217;ll be continuing with this, and once I&#8217;ve done 300 pages I can get experience of the next stages of the process. I&#8217;ll also be reading <a href="http://www.digitalhumanities.org/companion/" title="A Companion to Digital Humanities">A Companion to Digital Humanities</a>, another free online book.</p>
<p>I&#8217;m breaking my own rule that you should never rely on just one person when creating digital texts, but this is an experiment to see how much I can do on my own. I also want to avoid the added complications of managing a collaborative project. If things go well, I can start to look at ways of working as a team on unofficial and unfunded projects.</p>
<h3>Project Outlines:</h3>
<p>Project Sandall: This project will produce a digital edition of T. E. Sandall, <em>A History of 5th Battalion Lincolnshire Regiment</em> (Oxford, Blackwell, 1922). The text will be marked up with TEI compliant XML and published on the web. It will include a hyperlinked index of people, places, and organizations.</p>
<p>Project Wenham: This project will digitize and publish the correspondence of my great-grandfather, William A. Wenham, relating to his experiences as a prisoner of war during the First World War. The collection consists of a number of letters and postcards sent to his family from prison camps in Germany, with some with photographs. The text will be transcribed and marked up with TEI compliant XML, and published on the web, along with background information written by me. There will be an index of people, and possibly places. Another optional extra will be selections of relevant documents from other sources, such as battalion war diaries.</p>
<h3>Theory:</h3>
<p>One of the great things about digitization projects is that you can avoid some of the more difficult theoretical controversies. The most relevant theory is information theory, and Shannon pointed out that meaning is irrelevant to information. The aim of text digitization and basic markup is to represent the information in the original document as accurately as possible, minimising the noise which can be introduced as part of the process (see my post on <a href="http://www.investigations.4-lom.com/2006/11/21/historical-information-noise/" title="Investigations of a Dog: Historical Information and Noise">Historical Information and Noise</a>). Once the text is published, individual users can decide what it means to them and how far, if at all, it relates to the reality of the past.</p>
<p>However, record linkage brings problems of meaning and epistemology into play. This is likely to be worse in early modern manuscripts which are more noisy to begin with. For example, the same name might be spelt several different ways. Dealing with 20th century sources is easier because of standardized spelling, but linkage still involves taking an epistemological position. One advantage of TEI XML is that it has a built in mechanism for representing uncertainty and quantifying epistemic probabilities using the &lt;certainty&gt; element. Furthermore, XML preserves the original text. Editorial decisions made during the tagging process have little or no impact on the integrity of the text itself. If the XML code is made freely available to users under GPL, they can download the source files and edit them according to their own needs. Therefore, users who don&#8217;t agree with an editorial decision can easily change it in their own local copy of the text.</p>
<h3>Copyright:</h3>
<p>I believe that T. E. Sandall&#8217;s work is out of copyright in most of the world, since the book was published in 1922 and the author died in 1931, but I will be looking for more definite proof before I proceed with the project. The copyright status of the epilogue written by G. H. Teall is unknown, but it can be omitted if I can&#8217;t determine that it&#8217;s definitely out of copyright. The plates are more problematic and will probably be omitted.</p>
<p>William Wenham&#8217;s letters are more straightforward. There is no possibility of anyone outside the family having a claim on the copyright and no possibility of anyone being able to make money out of their publication. The status of the photographs taken at Cottbus is unknown. They might be out of copyright or they might be orphan works which are still under copyright. In this case I&#8217;m prepared to risk publishing them, but will remove them if a genuine copyright holder objects to their publication.</p>
<h3>Image Capture:</h3>
<p>I have a cheap A4 size flatbed scanner which gives adequate quality. The main limitation is that it&#8217;s slow, but the quantity of images to be captured is relatively small. A digital camera can capture images faster, but the quality of text is lower than scanning and would most likely lead to more OCR errors. Sandall will be scanned as 300dpi monochrome JPEGs. There is no need to create a high quality archival copy for preservation purposes. The main aim of the project is to make the text more accessible and add value in the form of XML markup and indexing. The family papers will be scanned as full colour 300dpi TIFFs which will be kept as archival copies. These masters can be used to create 100dpi JPEGs for web publication. It is intended to make images of the documents freely available in addition to the digital text.</p>
<h3>Digitizing Text:</h3>
<p>The scanned images of Sandall will be converted to digital text using OCR software. I have two free OCR packages: IRIS (which came free with a scanner) and Microsoft Office Document Imaging, but both have limited features and would be inadequate for this project. Abbyy Finereader Pro has more advanced features and is recommended by Distributed Proofreaders. For this project I can use the free trial version, and if I want to do more projects in the future I can buy a licence for under £100. PC Pro magazine considers Omnipage to be superior, but it costs around £400! Since no OCR is perfect, it will be necessary to proofread and correct the entire text. Finereader asks for user input when it encounters a doubtful character, and its interface should allow manual proofreading alongside that.</p>
<p>There is no possibility of using OCR for handwritten text, so the letters will have to be transcribed. Although we have the originals to work from, there might be some need to work from digital images. Apart from saving wear and tear on the documents, digital images are more flexible. Difficult text can be enlarged on screen, and contrast can be adjusted to bring out faded text. However, there is the added problem of viewing an image and typing at the same time, which might require specialised software. I&#8217;ve found that Zotero notes can be very useful for transcribing text from images or PDFs and might be adequate to start with. I could also use HTML/PHP/MySQL to cobble together something like the Distributed Proofreaders interface for my own use. The front end is a simple web based interface, and although I don&#8217;t know how their back end works, a local version just for my own use could be much simpler.</p>
<h3>XML Markup:</h3>
<p>XML can be daunting precisely because it&#8217;s so powerful and flexible: where do you start and how far do you go? Using TEI will give me a basic framework which covers the needs of the project and guarantees a certain amount of cross-compatibility. However, there are many different ways to implement TEI compliant XML and many subjective decisions to be made. At its most basic TEI markup need not be much more complex than HTML, but at the other extreme there are provisions for extremely detailed and complex markup, such as stylistic analysis of the structure of sentences. I&#8217;ve decided to split the markup into three phases:</p>
<p>First: mark up the basic structure of the text, such as paragraphs, headings, and page numbers. It might be possible to do some of this automatically, but manual checking and correction will be necessary.</p>
<p>Second: tag names of people, places, and organizations. For the printed text of Sandall it might be possible to automate some of this using regular expressions and Find and Replace functions. Proper nouns can be expected to have capital letters, and ranks and common referents (he, officer, man etc) should be easy to detect. Again manual checking and correction will be necessary. At this stage names will only be tagged to indicate that they are names. No assumptions about identity will be made until the third phase.</p>
<p>Third: link/index records by assigning ID numbers and/or regularized forms to the name tag attributes. At this point some subjective decisions might be necessary, and certainty will have to be evaluated and assigned.</p>
<p>XML tagging can be done in any text editor, but a dedicated XML editor might be useful. I have previously used a free version of Altova XML Spy for experimenting with XML, but some more advanced features might be necessary so I&#8217;ve downloaded a trial version of oXygen. If it proves useful I can buy an academic licence for $48.</p>
<h3>Web Publishing:</h3>
<p>The documents and supporting material will be published in HTML or XHTML. All authoring will be done in XML, and XSLT will be used to transform the source code into other formats. For this reason there is no immediate need to decide whether to use HTML or XHTML, and in practice there is very little difference between XHTML and good HTML (anyone who has problems with the transition to XHTML has written bad HTML to start with!). The XML source files will also be made available so that users can download and manipulate them. Ideally the site would be published in XML only, with XSLT style sheets transforming the code to (X)HTML on the client side, but this would exclude older browsers which can&#8217;t handle XML.</p>
<p>There will be little need for a database back-end as the sites will be small and the content will not be changed or added to frequently. A search engine which recognises XML fields (as at the OBP) would be useful, but as names will be linked to an index page, Google might be adequate for full text searches.</p>
<h3>Publicity:</h3>
<p>This is not much of an issue as the content will only appeal to a niche audience and the projects do not have to justify themselves to funding bodies. The main aim is to develop and demonstrate my digital history skills, so if I get it right it will be a useful demonstration of how to do it. The main avenues for promotion will be my blog, other people&#8217;s blogs, and the <a href="http://1914-1918.invisionzone.com/forums/" title="Great War forum">Great War forum</a>.</p>
<h3>Bibliography</h3>
<ol>
<li>Daniel J. Cohen and Roy Rosenzweig, <span style="font-style:italic;">Digital History: A Guide to Gathering, Preserving, And Presenting the Past on the Web</span> (University of Pennsylvania Press, November 2005). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Digital%20History%3A%20A%20Guide%20to%20Gathering%2C%20Preserving%2C%20And%20Presenting%20the%20Past%20on%20the%20Web&amp;rft.publisher=University%20of%20Pennsylvania%20Press&amp;rft.aufirst=Daniel%20J.&amp;rft.aulast=Cohen&amp;rft.au=Daniel%20J.%20Cohen&amp;rft.au=Roy%20Rosenzweig&amp;rft.date=2005-11-30&amp;rft.pages=316"></span></li>
<li>Thomas Edward Sandall, <span style="font-style:italic;">A History of the 5th Batt. the Lincolnshire Regiment. by Colonel T. E. Sandall, Etc</span> (pp. vi. 221. Basil Blackwell: Oxford, 1922). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=A%20History%20of%20the%205th%20Batt.%20the%20Lincolnshire%20Regiment.%20by%20Colonel%20T.%20E.%20Sandall%2C%20Etc&amp;rft.place=pp.%20vi.%20221.%20Basil%20Blackwell%3A%20Oxford%2C%201922&amp;rft.aufirst=Thomas%20Edward&amp;rft.aulast=Sandall&amp;rft.au=Thomas%20Edward%20Sandall&amp;rft.pages=8"></span></li>
<li>Ray Siemens, John Unsworth, and Susan Schreibman, <span style="font-style:italic;">Companion to Digital Humanities (Blackwell Companions to Literature and Culture)</span> (Blackwell Publishing Professional: Oxford, December 2004). <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_id=urn%3Aisbn%3A1405103213&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=book&amp;rft.btitle=Companion%20to%20Digital%20Humanities%20(Blackwell%20Companions%20to%20Literature%20and%20Culture)&amp;rft.place=Oxford&amp;rft.publisher=Blackwell%20Publishing%20Professional&amp;rft.edition=Hardcover&amp;rft.series=Blackwell%20Companions%20to%20Literature%20and%20Culture&amp;rft.aufirst=Ray&amp;rft.aulast=Siemens&amp;rft.au=Ray%20Siemens&amp;rft.au=John%20Unsworth&amp;rft.au=Susan%20Schreibman&amp;rft.date=2004-12-12&amp;rft.isbn=1405103213"></span></li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/01/10/digital-history-planning/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

