Digital History Projects: Planning

In my New Year post, I mentioned that I’m thinking about carrying out a couple of digital history projects in connection with my First World War research. These projects are very small and should be relatively easy to carry out on my own, but there will almost certainly be challenges. Overcoming these will give me more experience of carrying out a digital history project (this is starting to sound like a job application again!), and produce useful resources. After that, I can move on to consider some more advanced issues, such as collaborating with other people, and dealing with seventeenth-century manuscripts. To make the experience even more useful, I’m trying to blog it as I go. This post is an outline of my plans so far. Now that I’ve published my plans I’ll have to carry them out!

Background:

Before starting any digital history project, you should read Digital History by Dan Cohen and Roy Rosenzweig which is available online free of charge. It’s an easy to understand introduction which will tell you about different approaches to digital history, costs, benefits, potential problems, and generally help you to think clearly about what you want to achieve and how you can achieve it. The authors ask aspiring digital historians to get the right balance between caution and risk-taking. While a certain amount of planning is always necessary, there is also a risk that academic historians will spend too much time thinking rather than doing.

Having read the book and thought about the potential problems to be overcome, I think I’m in a good position to put my IT experience to use. I’ve been running my band’s website since April 2000, so I’m a veteran HTML coder. There have been many changes in web technology in that time, and I’ve always kept up with new developments while being careful not to jump on the latest bandwagon too soon. I have a good knowledge of (X)HTML, CSS, PHP, MySQL, image editing, Apache server management, and web accessibility issues.

I’ve studied the Old Bailey Proceedings, a large collection of digitized source material marked up with XML, got acquainted with the Text Encoding Initiative standards, and have experimented with XML myself. Recently I volunteered my services to Distributed Proofreaders, and have already proofread over 100 pages of text. I’ll be continuing with this, and once I’ve done 300 pages I can get experience of the next stages of the process. I’ll also be reading A Companion to Digital Humanities, another free online book.

I’m breaking my own rule that you should never rely on just one person when creating digital texts, but this is an experiment to see how much I can do on my own. I also want to avoid the added complications of managing a collaborative project. If things go well, I can start to look at ways of working as a team on unofficial and unfunded projects.

Project Outlines:

Project Sandall: This project will produce a digital edition of T. E. Sandall, A History of 5th Battalion Lincolnshire Regiment (Oxford, Blackwell, 1922). The text will be marked up with TEI compliant XML and published on the web. It will include a hyperlinked index of people, places, and organizations.

Project Wenham: This project will digitize and publish the correspondence of my great-grandfather, William A. Wenham, relating to his experiences as a prisoner of war during the First World War. The collection consists of a number of letters and postcards sent to his family from prison camps in Germany, with some with photographs. The text will be transcribed and marked up with TEI compliant XML, and published on the web, along with background information written by me. There will be an index of people, and possibly places. Another optional extra will be selections of relevant documents from other sources, such as battalion war diaries.

Theory:

One of the great things about digitization projects is that you can avoid some of the more difficult theoretical controversies. The most relevant theory is information theory, and Shannon pointed out that meaning is irrelevant to information. The aim of text digitization and basic markup is to represent the information in the original document as accurately as possible, minimising the noise which can be introduced as part of the process (see my post on Historical Information and Noise). Once the text is published, individual users can decide what it means to them and how far, if at all, it relates to the reality of the past.

However, record linkage brings problems of meaning and epistemology into play. This is likely to be worse in early modern manuscripts which are more noisy to begin with. For example, the same name might be spelt several different ways. Dealing with 20th century sources is easier because of standardized spelling, but linkage still involves taking an epistemological position. One advantage of TEI XML is that it has a built in mechanism for representing uncertainty and quantifying epistemic probabilities using the <certainty> element. Furthermore, XML preserves the original text. Editorial decisions made during the tagging process have little or no impact on the integrity of the text itself. If the XML code is made freely available to users under GPL, they can download the source files and edit them according to their own needs. Therefore, users who don’t agree with an editorial decision can easily change it in their own local copy of the text.

Copyright:

I believe that T. E. Sandall’s work is out of copyright in most of the world, since the book was published in 1922 and the author died in 1931, but I will be looking for more definite proof before I proceed with the project. The copyright status of the epilogue written by G. H. Teall is unknown, but it can be omitted if I can’t determine that it’s definitely out of copyright. The plates are more problematic and will probably be omitted.

William Wenham’s letters are more straightforward. There is no possibility of anyone outside the family having a claim on the copyright and no possibility of anyone being able to make money out of their publication. The status of the photographs taken at Cottbus is unknown. They might be out of copyright or they might be orphan works which are still under copyright. In this case I’m prepared to risk publishing them, but will remove them if a genuine copyright holder objects to their publication.

Image Capture:

I have a cheap A4 size flatbed scanner which gives adequate quality. The main limitation is that it’s slow, but the quantity of images to be captured is relatively small. A digital camera can capture images faster, but the quality of text is lower than scanning and would most likely lead to more OCR errors. Sandall will be scanned as 300dpi monochrome JPEGs. There is no need to create a high quality archival copy for preservation purposes. The main aim of the project is to make the text more accessible and add value in the form of XML markup and indexing. The family papers will be scanned as full colour 300dpi TIFFs which will be kept as archival copies. These masters can be used to create 100dpi JPEGs for web publication. It is intended to make images of the documents freely available in addition to the digital text.

Digitizing Text:

The scanned images of Sandall will be converted to digital text using OCR software. I have two free OCR packages: IRIS (which came free with a scanner) and Microsoft Office Document Imaging, but both have limited features and would be inadequate for this project. Abbyy Finereader Pro has more advanced features and is recommended by Distributed Proofreaders. For this project I can use the free trial version, and if I want to do more projects in the future I can buy a licence for under £100. PC Pro magazine considers Omnipage to be superior, but it costs around £400! Since no OCR is perfect, it will be necessary to proofread and correct the entire text. Finereader asks for user input when it encounters a doubtful character, and its interface should allow manual proofreading alongside that.

There is no possibility of using OCR for handwritten text, so the letters will have to be transcribed. Although we have the originals to work from, there might be some need to work from digital images. Apart from saving wear and tear on the documents, digital images are more flexible. Difficult text can be enlarged on screen, and contrast can be adjusted to bring out faded text. However, there is the added problem of viewing an image and typing at the same time, which might require specialised software. I’ve found that Zotero notes can be very useful for transcribing text from images or PDFs and might be adequate to start with. I could also use HTML/PHP/MySQL to cobble together something like the Distributed Proofreaders interface for my own use. The front end is a simple web based interface, and although I don’t know how their back end works, a local version just for my own use could be much simpler.

XML Markup:

XML can be daunting precisely because it’s so powerful and flexible: where do you start and how far do you go? Using TEI will give me a basic framework which covers the needs of the project and guarantees a certain amount of cross-compatibility. However, there are many different ways to implement TEI compliant XML and many subjective decisions to be made. At its most basic TEI markup need not be much more complex than HTML, but at the other extreme there are provisions for extremely detailed and complex markup, such as stylistic analysis of the structure of sentences. I’ve decided to split the markup into three phases:

First: mark up the basic structure of the text, such as paragraphs, headings, and page numbers. It might be possible to do some of this automatically, but manual checking and correction will be necessary.

Second: tag names of people, places, and organizations. For the printed text of Sandall it might be possible to automate some of this using regular expressions and Find and Replace functions. Proper nouns can be expected to have capital letters, and ranks and common referents (he, officer, man etc) should be easy to detect. Again manual checking and correction will be necessary. At this stage names will only be tagged to indicate that they are names. No assumptions about identity will be made until the third phase.

Third: link/index records by assigning ID numbers and/or regularized forms to the name tag attributes. At this point some subjective decisions might be necessary, and certainty will have to be evaluated and assigned.

XML tagging can be done in any text editor, but a dedicated XML editor might be useful. I have previously used a free version of Altova XML Spy for experimenting with XML, but some more advanced features might be necessary so I’ve downloaded a trial version of oXygen. If it proves useful I can buy an academic licence for $48.

Web Publishing:

The documents and supporting material will be published in HTML or XHTML. All authoring will be done in XML, and XSLT will be used to transform the source code into other formats. For this reason there is no immediate need to decide whether to use HTML or XHTML, and in practice there is very little difference between XHTML and good HTML (anyone who has problems with the transition to XHTML has written bad HTML to start with!). The XML source files will also be made available so that users can download and manipulate them. Ideally the site would be published in XML only, with XSLT style sheets transforming the code to (X)HTML on the client side, but this would exclude older browsers which can’t handle XML.

There will be little need for a database back-end as the sites will be small and the content will not be changed or added to frequently. A search engine which recognises XML fields (as at the OBP) would be useful, but as names will be linked to an index page, Google might be adequate for full text searches.

Publicity:

This is not much of an issue as the content will only appeal to a niche audience and the projects do not have to justify themselves to funding bodies. The main aim is to develop and demonstrate my digital history skills, so if I get it right it will be a useful demonstration of how to do it. The main avenues for promotion will be my blog, other people’s blogs, and the Great War forum.

Bibliography

  1. Daniel J. Cohen and Roy Rosenzweig, Digital History: A Guide to Gathering, Preserving, And Presenting the Past on the Web (University of Pennsylvania Press, November 2005).
  2. Thomas Edward Sandall, A History of the 5th Batt. the Lincolnshire Regiment. by Colonel T. E. Sandall, Etc (pp. vi. 221. Basil Blackwell: Oxford, 1922).
  3. Ray Siemens, John Unsworth, and Susan Schreibman, Companion to Digital Humanities (Blackwell Companions to Literature and Culture) (Blackwell Publishing Professional: Oxford, December 2004).

Permanent link to this post

Digital History, History, Sandall 5th Lincs, World War 1, World War I On Web 2.0 — posted by Gavin Robinson, 7:30 pm, 10 January 2007

6 Comments »

RSS feed for comments on this post.

TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

If your comment does not appear, it has been held for moderation. Please do not submit it again.