Marking Up Names: Part 1
On to the next stage of digitizing Sandall’s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations.
For this phase, I decided to use the following tags:
<placeName> (used 729 times) marks up names of places. I decided to only mark up settlements which are likely to be found on Google maps. Therefore names of geographic features, trenches/fortifications, HQs etc were ignored.
<persName> (used 789 times) marks the name of a person. Contains further tags to mark up forenames, surnames, ranks etc. @key is to contain a regularized version of the name which can be used to link records together and link to external web pages.
<forename> used to mark up forenames, one tag for each forename, including inititals. @full used to denote initials or full names. In practice the vast majority of forenames in the book are given as initials
<surname> used to mark surnames. Double-barrelled names are treated as a single surname with a single tag.
<nameLink> parts of name such as “de” or “le” which are not considered part of the surname for sorting purposes. I have followed the practice of the index in the original text to decide which parts of a surname are significant.
<roleName> used mostly for military ranks.
<addName> used for service numbers, with @type given the value “servicenumber”
<rs> (used 158 times) referring strings which refer to an individual who might be identifiable but who isn’t named in the text, eg “Battalion C.O.”, “our missing man”, “Corps Commander” etc.
<name> (used 3 times) only used to mark up the names of ships. I’m not sure if I’m actually going to do anything with these yet.
<date> (used 991 times) used for all dates which are identifiable as a single day, regardless of form in text eg “4th August, 1914″, “13th October”, “the 5th”, “next day”. @when gives full date in form yyyy-mm-dd. No other tags were used to mark dates as for this project it isn’t necessary to mark up individual parts
<abbr> marks abbreviations given in text. Some obvious ones (eg “a.m.”, “p.m.”) and some ranks which were only slightly abbreviated (eg “Lieut.-Colonel”, “Sergt.”) were left unmarked but most were marked up
<expan> supplied expansion of any abbreviation marked with <abbr>
<choice> surrounds <abbr> and <expan> pairs. XSLT generates HTML <abbr> tag with the content of <expan> as the value of @title
The names and dates in the medal list in the appendix were easy to mark up fully using regular expressions as they were well organized into predictable patterns. The main text needed more manual intervention as things were much less predictable. When I say “manual” that doesn’t involve any reading or typing as oXygen can still automate things. For example, adding a tag usually only involves a couple of double clicks to select a word then select a tag from the list of tags allowed at that point. At the first pass I only added <placeName>, <persName>, <abbr>, and <date> tags. Using the Find feature in oXygen I set up a simple regular expression [A-Z] which would find every capital letter. This allowed me to cycle through, finding every proper noun and abbreviation (and a lot of false positives too, but I can’t think of a better way to do it), and mark them with the appropriate tag. This took about 4 1/2 hours. Then I used another regular expression to find any dates which hadn’t been tagged yet (many of them don’t have months given and so weren’t found by the capital letter search):
(?<= )[0-9]{1,2}(st|nd|rd|th)
This also found a few unit/formation numbers. I checked each occurrence to make sure it actually was a date, but even so it only took about 40 minutes to cycle through the whole book. Next I searched for some referring strings which wouldn’t have been picked up by the previous searches (eg “man”, “officer”, “day”, “month”).
Adding values to the @when attributes of <date> tags was done automatically for the medal list but had to be done manually for the main text as the month and year often had to be worked out from the context. With a lot of copying and pasting this took about 2 hours.
With abbreviations marked up it was easy to supply expansions using Find and Replace. This took about half an hour, partly because there were so many different abbreviations, and partly because of time spent researching particularly obscure ones. On reflection I could also have used the Find and Replace features of oXygen (which allows the use of regular expressions AND XPath) to mark up most of the name components but at least I learnt from doing it manually that it took 3 1/2 hours and was very tedious. If forenames were full I set the @full to “yes” manually, otherwise I left them, then used F&R to add full=”init” to all the ones which didn’t have @full attributes. I did a similar thing with @type=”military” for <roleName> tags.
Once all the names were fully marked up I wrote a Python script to generate a unique @id for every <persName> and <rs>. It also generated @key values for every <persName> from the component parts, in the form [surname], [forenames] [namelink] [servicenumber] [rank]. When I get to record linkage these will need to be standardized, as often the same person is referred to in different ways.
Referring strings were more difficult. These needed @key values in the same form as <persName> but obviously they can’t be generated from the content of the tag. It took a lot of research to fill these in but I did most of them in a day. Battalion COs and Adjutants were supplied entirely from internal evidence in other parts of the book. Division Commanders were easy because 1/5th Lincs was always in 46th Division and there’s a complete list of division commanders at The Long Long Trail. Someone on the Great War Forum kindly supplied me with a list of Brigadiers of 138th Brigade. Surprisingly Corps and Army commanders were more difficult as I couldn’t find a complete list of which Corps and Army 46th Division belonged to and who commanded them. With a lot of cross-checking between different parts of the book, along with the Oxford DNB, The Long Long Trail and Great War Forum, Regiments.org, and various Google searches I eventually pieced most of it together and now I’m only stuck for one Corps Commander. Google, DNB, and Wikipedia together supplied names of several non-military people, such as the Archbishop of York, Vicar of Grimsby, and President of Portugal. In the end there were only a few staff officers and COs of other battalions which I had to ask for help with on the Great War Forum. I still haven’t got everyone, but where a name is unknown I’ve given them unique strings like “unknown wounded NCO 1918-09-25″.
The next step is to pull out all the names and standardize the key strings so that they match for all instances of the same person, then feed them back into the XML document.
