Digital History Projects: OCR
Now that I’ve got all the theoretical agonising out of the way, I can actually do something about digitizing the text. This week I’m carrying out OCR and proofreading on the text of Sandall’s History of 5th Battalion the Lincolnshire Regiment. As soon as I got to work I encountered issues that I hadn’t thought of, and found that subjective decisions had to be made even earlier than I’d anticipated. This just shows that the only way to learn how to do something is to do it.
Yesterday I started by preparing the image files for OCR. I found it convenient to scan two facing pages at once, but for OCR and proofing it’s more convenient to have a single page per image, and to remove all the plates. This isn’t too difficult to achieve with batch processing. I started with Microsoft Office Picture Manager. I used it to work out where to crop the images, but couldn’t use it to do the cropping because its batch processing features seem to break if you select too many images! No problem, because Irfanview is better anyway. I discovered that a few of the page images towards the end of the book had come out slightly larger than the rest, which meant they had to be done in a separate batch. I also had to run separate batches for rectos and versos as I couldn’t find a way to split one file into two.
Once I had a separate image for each page, I rearranged them, deleted the plates, and used Irfanview to do a batch rename, so that the file name of each image matches the page number. First attempt had to be redone because I forgot to delete the blank pages on the reverse of the plates resulting in the page numbers getting out of sequence! Prefaratory material was kept in a separate directory as it’s numbered with roman numerals and outside the main sequence of page numbers. I also separated Teall’s epilogue (which I still don’t know whether I’m going to publish) and the end matter (medal lists and index which are formatted differently from the main text). Preparing the images took about an hour, but would be quicker if I did it again with the benefit of this experience.
With all the images ready, I installed the 15 day trial version of ABBYY FineReader Pro 8.0. My first impression is that it’s pretty good, and if I do any more similar projects I’ll probably buy a license for this rather than trying to find a free alternative. The Distributed Proofreaders site suggests that you usually get what you pay for with OCR software. I spent some time finding my way around the software, reading the help and tutorials, and trying a few test pages. Once I was ready to go, I created new a new batch for each of the four sections: prefatory (6 pages), main (196 pages), teall (10 pages), and end (17 pages). The main batch of 196 pages took about 3 minutes to import the images, and 20 minutes to be read by the OCR. Mostly the structure of the text was detected and converted into boxes fairly accurately. A few pages had more complicated layout (this was more true in the appendix with table of medals awarded) where I had to redraw the boxes manually. Once OCR reading was complete I saved all the pages as text files for future comparison with corrected text, so that I can see roughly how accurate uncorrected OCR is.
The next step was an automated spell check. FineReader flags all uncertain characters and non-dictionary words, then steps through them and asks for user input. I found the spellcheck window too small, but it doesn’t have to be used as you can position it so that you can see the text in the main panes. There are three main panes: an image of the whole page, showing boxes; a close-up of part of the page image; and the digitized text. These can all be resized and zoomed to suit the user. As I stepped through the checks I was able to spot various problems with the OCR and had to make decisions about how to correct things. I wrote style guidelines as I went, adding each new feature or issue as I encountered it. I already had a good idea of how to handle things — ideally keep the sequence of characters the same as in the book — but no plan survives contact with the enemy. Even at the stage of trying to match a sequence of characters to a finite set of characters there are subjective decisions to be made. This is a summary of the guidelines I came up with:
Delete running heads and page numbers. This is how PGDP does it. I don’t necessarily agree with everything they do, but in this case I don’t see the headers and footers being any use to the target audience and they are more likely to get in the way. Pages will be marked with the TEI pb element anyway.
Anyone who wants to see the original layout can look at the images. So far I haven’t been deleting heads or numbers in the spell-check phase. I’m hoping it can be done automatically (using tools from PGDP) but if it can’t it will have to be done during manual proofing. Keep line breaks. They will eventually be marked up with TEI lb element, provided that it can be done automatically. If it can’t then I won’t bother. The main exception is the index, where I’ve decided to keep each entry on one line. This will probably be marked up as a list or table, so line breaks will be irrelevant and possibly counter-productive.
Keep soft hyphens. FineReader marks possible soft hyphens as ¬ and often flags them as uncertain characters. It should be easy to replace the character with an entity reference. Sometimes it might not be clear whether a hyphen is soft or hard (eg Head-quarters). In these cases I can look for the same word elsewhere in the text. If the word only occurs once, then I can at least make an arbitrary decision without worrying about consistency.
Dashes will be replaced by entity references, except in the index where a hyphen stands for “ditto”, in which case it will be replaced by [do] to indicate how tables or lists should be structured. During the spell-check I’m leaving them as they are.
Accented characters aren’t being picked up by FineReader as it doesn’t seem to have foreign dictionaries installed. This is something I can fix for future projects. As it is, most accented characters are being flagged as uncertain, so I can add them manually using keyboard shortcuts or character map. I’m still undecided about whether to leave them as Unicode characters or convert them to entity references (I’ve read conflicting advice on this) but conversion should be trivial using Find and Replace.
Mistakes in the text are to be preserved. I’ve marked them with [sic]. This wouldn’t be appropriate for an edited text which already has its own [sic]s but I can get away with it here. The important principle is to flag the mistakes in a distinctive way which can be separated from the original text.
Roman numerals are mostly to be transcribed as they are, but there’s one exception. Sandall consistently uses I instead of 1. This means that my ideal of preserving the original sequence of characters is completely untenable. It would be very inconvenient for me and for users to preserve this idiosyncrasy. Since the emphasis is on accessible and useful empirical data, it had to go. Fortunately most occurrences are flagged as uncertain characters since FineReader is commendably cautious about l and 1. The necessary correction was obvious where the roman I was mixed with Arabic numerals (eg I9I7). Where I is on its own, it could often be decided from context (eg 1st, 1/5th, but I Corps). So far I haven’t encountered a truly intractable case, but there’s at least the possibility of checking against usage elsewhere in the text.
Abbreviations are another headache. Again I wanted to preserve the original sequence characters but am going to have to compromise over hyphens and points in abbreviations. I’ve decided to keep points but lose some hyphens.
Where the abbreviation is a series of single initial letters they can be separated by points but with no spaces or hyphens between them. So Q.M.S. not Q.-M.-S. or QMS or Q. M. S. Sandall sometimes has hyphens but sometimes doesn’t. It seems to depend on the rank in question (Q.-M.-S. but R.S.M.). I’ve decided to remove hyphens where they occur.
Where the abbreviation has one or more groups of letters from the same word, all points and hyphens are to be kept. So Lt.-Col., L.-Cpl. Maj.-Gen.
Some abbreviations don’t have points after every part, or use a slash instead of a hyphen (eg L/Cpl). In these cases they’re to be kept as they are. There are other abbreviations which have a space but no point or hyphen (A Coy, B Coy). These are also kept as they are (there is room for confusion here: a company or A company? I prefer to leave it as it is).
This sounds more confident than it actually is. I’m still compromising between preservation of original text and ease of use. Having tried some experiments with Google, it doesn’t seem to worry about points in abbreviations. A search for “ramc” seems to return the same results as a search for “r.a.m.c.”, but “r.-a.-m.-c.” confuses it. The Find function in Firefox isn’t so forgiving and will consider an abbreviation with points to be a different string from the same letters without points. If I want the best of both worlds, I could try marking the points up in XML and using AJAX to enable users to turn them on and off, but that might well be more trouble than it’s worth. Most of these abbreviations are ranks which will be marked up as part of personal names, or corps which will be marked up as organisation names, both with regularised forms in the attributes (I’m going to have to draw up a list of regularised ranks, corps, and formations). I haven’t been formatting abbreviations during the spell-check as my aim there is just to resolve issues flagged by FineReader and get the text closer to the original. I’ll be checking abbreviations later during manual proofreading, or using Find to track them down.
Fractions have caused problems for FineReader and usually come out as junk, but as they’re flagged as uncertain it’s easy to spot them. At this stage I’ve rendered them in PGDP style (1-1/2). During markup they’ll be marked up as numbers.
Another unexpected problem was the use of a large brace (ie } but bigger) to combine two rows of tabular data. My solution so far is to flag them with }} at the end of each row. Later they’ll be marked up as tables, with reference to the images to make sure the right hand column spans the correct rows.
Small capitals are another thing which I didn’t consider worth keeping. In the medal table in the appendix all names and ranks are in small caps. FineReader renders them as small caps in Rich Text but converts them to mixed letters when it exports to plain text. I think this is satisfactory and I can’t see much point marking them up as small caps.
Columns are to be marked up with the TEI element cb.
There was also some junk to be removed caused by shadows around the edges of the page images, but this wasn’t a major problem. It could be eliminated by cropping out the edges of the images.
Stepping through the automated checks for the main batch (196 pages) took 2 hours 50 minutes (1.1 pages per minute). In contrast, the end matter (only 17 pages) took 1 hour 30 minutes (0.2 pages per minute). This was mainly because of the smaller type, more complicated structure of the appendices and index, and more numbers (every occurrence of “1″ was flagged as uncertain). At the end of this stage I saved another plain text copy of the pages in a different directory, again for comparison purposes. Once I get some software that can automatically calculate the differences between the two sets of files I’ll have a clearer idea of the accuracy of the basic OCR and the improvements that can be made by automated checks.
My impression so far is that FineReader is reasonably accurate, and that it consistently flags uncertain letters. I spotted a few unflagged scannos as I was stepping through and might find more during manual proofing, but I’m generally pleased with what the software can do. So far I’ve only been using standard settings and haven’t explored the training features. If I was only making a digital text for my own personal use rather than for publication I think I’d be happy enough to run FineReader over it and rely on the automated checks without any manual proofing. It wouldn’t be perfectly accurate or consistent, but it would give me a useable text with many advantages over a printed book.
The next stage is manual proofreading to try to pick up anything that FineReader missed and to enforce consistent style where my style guidelines demand changes from the original text. Before I do that I’ll be thinking more carefully about my style choices and about any other potential issues that need to be resolved. I’ll also be looking at TEI in more detail to make sure I don’t do anything at the proofing stage which makes markup more difficult. I was hoping to break the process down into independent stages, but I’m starting to realise things are more interrelated than abstract models allow.
