XML Tagging: Phase 1
Having proofread and corrected the digital text captured from Sandall’s history of 1/5th Lincolnshire (corrected to an adequate standard anyway — I can’t claim that it’s perfect), I was ready to start inserting XML tags. The first phase of markup involves the use of TEI XML tags to describe the basic structure of the text. There was nothing too difficult here, and a lot of it could be done automatically rather than reading through the text and manually inserting tags at every feature. Before I started I had to decide which tags to use and where to use them, then make sure I applied them consistently. This post gives more details of the tags I used, what I used them for, and how I got them into the text with minimal effort.
I decided to use jEdit to manipulate the text files after I exported them because it’s easy to use but has powerful Find and Replace features (better than oXygen in many ways). As well as replacing page break symbols with the TEI <pb> tag, I used it to remove superfluous spaces and line breaks. I was also able to insert a lot of the basic TEI XML tags automatically using jEdit. In order to do this smoothly I had to devise a scheme to represent the structure of the text with line breaks and do some manual checking and adjustment to make sure it was consistently applied: 3 blank lines before a chapter heading, 2 blank lines after a chapter heading, and 1 blank line between paragraphs. FineReader was already set up to export text with 1 line separating paragraphs so I just had to adjust the chapter headings. This could be done during proofing by someone with no knowledge of XML. With the document set up like this all I had to do was use Find and Replace to insert <div> <head> and <p> tags. Since FineReader was set to preserve original line breaks, it was also trivial to insert TEI <lb/> elements at the start of each line. This all worked very well for the main text, only taking a few minutes. In future I could use a macro to automate a lot of the process, making it quicker and more reliable.
The structure of the Appendices was more complicated, but turned out to be easier than I thought. I didn’t need to use any tables as lists were sufficient to represent the structure adequately. With each entry separated by a line break, it was easy to find and replace line breaks with the opening and closing <item> tags. Every entry in the appendices and index was mark up as a single item. There was no need to separate the entries into items and labels (and this would also have involved more work). The medal lists consist of a man’s name and a date of award, which will be marked up as names (and their constituent parts) and dates in phase 2. I only used list markup for data which was formatted as a list or table in the original text. Inline lists in the main text were left as they are. They could be marked up as lists at a later date, but I can’t see much point to it. The main aim of the project is to index people, places, and dates. Anything which doesn’t add to that and doesn’t represent an important aspect of the structure or appearance of the original text (note that “important” is very subjective!) isn’t worth doing. I have added some tags which go beyond these aims, but in future I probably wouldn’t bother.
The next stage was to put the text into a valid TEI document. For this I moved onto oXygen, which looks immensely powerful but also quite complicated, so it might take me a while to get used to it. oXygen comes with templates that make it easy to create a TEI document in a couple of clicks without requiring any knowledge of the DTD (although it might need some manual adjustments, depending on which optional tag sets you need). With an empty TEI document with all the mandatory tags for headers etc already present, I just had to paste the main text into the body, and paste in the front and back matter. I manually added some more tags to describe quotes and distinct language. This was easy enough as I just had to search for double quote marks. This part of the project has gone surprisingly smoothly so far. I now have a TEI XML file with the chapters, headings, and paragraphs of the main text all marked up.
I’ve also added a minimal TEI header with basic publication and source information. The release version will have a more detailed header, but I might put a pre-release version on the web before I start phase 2. It would be nice to have some tangible proof of progress, and it would also be useful for other people to have access to a digital version of the text even before I add more advance features. I need to look into XSLT in more detail so I can get the transform exactly how I want it. oXygen comes with a standard TEI style sheet which is useful for quick transforms to HTML or PDF, but I need to make some adjustments. At this stage, changing the parameters in oXygen might be enough, but after phase 3 I’ll probably need to add some code of my own to handle the links between names.
Tag List:
These are the tags I’ve used in phase 1. Numbers refer to the relevant section of the TEI guidelines.
6.1 Paragraphs
<p> used to mark up paragraphs. This was unproblematic since paragraphs in the source text are clearly marked with indents. I had to do some manual checking for paragraphs which start at the top of a page as these hadn’t been marked by FineReader.
6.3 Highlighting and Quotation
All the tags in this section were used to interpret text surrounded by double quotes. I decided not to represent italic text. It is rarely used in the source text: only the names of ships (which will be marked as names anyway; the rend attribute could be used to represent the italics here, but isn’t absolutely necessary) and a few latin words, mainly “via”, which are well known enough to be classed as part of English.
<foreign> use for quoted words which are apparently quoted because they are foreign; italicised foreign words have been ignored; I wasn’t expecting to use this much, and in fact there was only one case where it was even remotely justified: “wagons” to refer to French transport, probably double quoted because it was a foreign spelling (Sandall usually prefers “waggons”). I added the attribute lang=”fr” to indicate that this was a French spelling. This led to having to declare language entities and reference a writing system declaration which was quite complicated as the guidelines don’t explain it very clearly, but I worked it out with a bit of trial and error.
<distinct> used for words which are double quoted apparently because they are army slang or technical terms eg “Whizz-bangs”, “moppers up”. <distinct> offers more scope to describe the language than <term>. In all cases I added the attributes time=”1914-1918″ and social=”british army”. This tag was used 26 times.
<term> and <gloss> used only where a word is double quoted and the author also offers a definition. All of these words were also marked as distinct, but not all distinct words have glosses. For example:
From 11.30 a.m. to 4.30 p.m., a desultory bombardment with H.E. of various calibres, together with <distinct time=”1914-1918″ social=”british army”>Whizz-bangs</distinct>, and <term id=”term4″><distinct time=”1914-1918″ social=”british army”>Woolly-bears</distinct></term> (<gloss target=”term4″>heavy shrapnel</gloss>) was directed <lb/>on our trenches;
This tag was only used 4 times.
<q> for direct speech or short inline quotations from documents
<quote> for whole documents, or substantial parts of documents. These are usually memos from Brigade, Division, or Corps level (and also a note thrown over the parapet by the Germans complaining that the battalion is throwing too many grenades at them and that it isn’t fair!). Paragraphs within the quote are marked with <p>. The names of writers/senders will be marked up as names in phase 2, but I am not likely to make any attempt to indicate in machine readable form that they are the sender.
<soCalled> this is the last resort, where the author has used double quotes which don’t fit into any other tag, or where it is reasonably obvious that some sarcasm or distancing was intended (intentional fallacy!). For example, when Sandall writes about a month of rest away from the trenches which was taken up by very strenuous training it seems reasonable to conclude that his rendition of “rest” should be marked as <soCalled>. This tag was only used 6 times.
On reflection the tags in this section are not relevant to the core aims of the project. Using them was a useful part of the learning experience, but I think that in future it would be acceptable to leave double quotes in the text without making any suggestions about what they mean. A linguistic analysis of British Army culture in the early 20th century would be a very different project and would need to consider words which are not highlighted by the author in any way.
6.4 Names, numbers, dates, abbreviations and addresses
Names, addresses, and dates are mostly to be left until phase 2, but a few name and address tags were used in the front matter. I haven’t yet decided whether to mark abbreviations which are not parts of names, ranks, or units.
6.5 Editorial changes
<corr> only used where there is an obvious typographical error, in order to maintain consistent spelling within the text. For example, the index contains an entry for “Fouquevillers”, but it occurs between “Foch” and “Fonsomme”, and the pages referenced only mention “Fonquevillers”, so this is clearly a typo. It’s important to note that the supplied correction is the form used in the rest of the text, NOT the modern French spelling “Foncquevillers”. Only 3 corrections of this type were made.
<sic> used where there is an obvious typographical error, but where the correct form cannot be determined from the rest of the text. But how can this happen? The dates of awards in the medal table are usually given in the form dd/mm/yy but in two cases the month is given as 14! I might be able to find the correct dates with a bit more digging, but at this stage I can’t confidently correct them so they’re both left as <sic>
<gap> not used so far, but might be used to indicate that Teall’s epilogue has been left out (so far I haven’t imported this part of the text into the XML document).
6.7 Lists
<list> used to mark a list which is formatted as a list or table in the original text eg medal awards and battalion establishment in the appendices, also the index. Inline lists within paragraphs have not been marked.
<item> used to mark each entry in a list. No labels used.
6.8 Notes and indexes
Not used so far. There are no footnotes or end notes in the source text. I am not intending to index any terms or subjects, only names and dates which will have their own indexing system. I have also preserved the original index. If I decide to add hyperlinks to it, they will point to numbered page break elements.
6.9 Reference systems
<pb> page breaks; added automatically at the start of every page using find and replace. Will be numbered when I work out a quick and easy way of generating the main sequence of page numbers.
<lb> linebreaks; also added automatically at the start of every line, except where the line begins a paragraph. No need for any numbering here. Soft hyphens used where a line break occurs in the middle of a word.
<cb> column breaks; only used in index which is laid out in two columns.
Page breaks are important for referring to the original text. The other two tags are not strictly necessary but were easy to add and give a more accurate representation of the source text.
6.10 Bibliographic references
Only used in the TEI header so far. The main text refers to some publications, such as The Times and the London Gazette. These will probably be marked up during phase 2.
7 Divisions
<text> so far the book is marked as a single text. If I add Teall’s epilogue it will become a group of two texts, but the front and back matter will apply to both.
<front> title page, preface, contents, and list of illustrations. I’ve included the original table of contents but it isn’t strictly necessary as the TEI XSLT transform can generate contents automatically.
<back> appendices and index. The original index will still be useful for locating terms and subjects which are not covered by my indexing of people, places, and organizations.
<div> I’ve used unnumbered divs for the chapters and appendices and added attributes to indicate preface, contents, chapter, appendix and index. I was only expecting one level of divisions, but the medal list has a section for foreign awards which I decided was best represented as a division within the appendix. Chapter numbers were added manually as there aren’t too many of them.
<head> used for titles of divisions and lists.
