Proofreading

[posted by Gavin Robinson, 12:01 pm, 23 February 2007]

In my last project update I described how I used FineReader to OCR the text of Sandall’s History of 5th Lincolnshire Regiment. Since then I’ve manually proofread the text and inserted some basic XML markup. Proofing and basic tagging have given me a more detailed understanding of the text and the features in it, and I’ve been noting potential issues as I go. I’ll post more about how I’m using XML later, but this post is a more detailed description of the process of proofreading.

I used FineReader for proofreading, as its interface makes it easy to compare digitized text with the original image. I had previously stepped through the uncertain characters and unrecognised words, so wasn’t expecting major problems, but I still wanted to iron out any remaining scannos and impose consistent style before I exported the text. In the post on OCR I said I was exporting copies of the text at each stage for comparison purposes. I’ve now decided to abandon this because I was trying to do two contradictory things at once: create a useable text (the main aim of the project) and produce a perfect rendition of the sequence of characters in the original text in order to assess the accuracy of FineReader’s OCR. I now realise that it isn’t possible or desirable to do both at once. For example, where the text uses I for 1 the two objectives can’t be reconciled. If FineReader renders it as ‘I’ that’s a correct rendition but in order to create a useable text I need to change it to ‘1′. If I wanted to quantify the accuracy of the OCR I’d have to do it separately from creating a useable text, but I don’t think the effort would be justified. I’ve already got enough of a feel for FineReader to know that it’s good enough for what I want to do but that some manual proofing will always be necessary to create a releasable text.

I decided not to carefully compare the text and image line by line as this would take too long, and I wasn’t expecting enough scannos to make it worthwhile. Instead I just read through the digitized text — slowly and carefully but in Distributed Proofreaders terms it was more like smooth reading than proofing. Since I had never read the whole book before it was unfamiliar enough to make this a viable approach. However, I made a point of double checking every number and every proper noun which I wasn’t familiar with. This slowed down the reading, and rapidly switching my attention from the text in the lower pane to the image in the upper pane and back again gave me eye strain! If I was scaling the project up, this is the kind of work which I could palm off onto someone else get volunteers to help with as it doesn’t require a great deal of technical skill (apart from knowledge of spelling and punctuation) and can be split into small chunks, as at PGDP. I also checked every end of line hyphen to make sure it was correctly marked as soft or hard. FineReader had already got a lot of these right, but there were some cases where it was obviously wrong, or which were ambiguous. These ambiguities could only be resolved by looking for occurrences of the same word elsewhere in the text. As I went I built up a hyphenation list, showing whether certain words should have a hard hyphen or not. For example headquarters didn’t but out-buildings did. I was aiming for consistent usage according to the original text, regardless of what might now be considered “correct”.

Before I started reading I spent about 15 minutes manually removing running heads from the tops of the pages. I could have done it slightly quicker using Find and Replace, but that wouldn’t have saved a great deal of time because the book has over 20 chapters which are all quite short. In future I’d recommend adjusting the text capture boxes to cut out all extraneous matter. This includes junk from around the edges of the pages as well as headers and footers.

It took about 8 hours to read through the whole text, including front and back matter. I found very few scannos. The most usual one was where the original text rendered ‘11th’ as ‘IIth’ FineReader thought it was ‘nth’. Many of these had been flagged as uncertain but a few had slipped through. There were a few obviously wrong letters but generally everything made sense in context. Some stealth scannos could have slipped through, but I’m not prepared to put in the effort to find them. The perfect is the enemy of the good. Double checking names and numbers which were not self-evidently wrong turned out to be more trouble than it was worth as errors here were negligible. Most of the changes I made during proofing were actually formatting. As well as checking hyphens I had to make sure that abbreviations conformed to my style guidelines. In future I must make sure that FineReader isn’t automatically inserting spaces after full points as I spent a lot of time correcting this, especially with all the corps and rank abbreviations in a book like this. Obvious typographic errors in the original text were marked with [sic], but I didn’t find many.

Once proofing was finished I exported each batch to a utf-8 text file. Although FineReader marks soft hyphens as ¬ to distinguish them from hard hyphens, it converts them all to - when exporting them to text, which is slightly annoying. Because of this I ran Find and Replace over each batch to replace every ¬ with the entity reference &amp;shy; For my purposes I could leave it like this, but if I was creating individual page files for other people to proof through the DP interface I’d want to convert them back to ¬ after the export, which is easily done. FineReader has an option to insert a page break character at the end of every page when exporting a batch to a single file. This would be more useful if you could choose the character to insert, but as it is the extra step needed to replace the page break character with the TEI <pb/> element isn’t too much trouble.

With each batch exported to a single file I just had to tidy them up a bit and then I could start marking them up with XML tags. More about that in the next post.

No Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.

If you supply a false e-mail address your comment will be deleted.