<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Investigations of a Dog &#187; tei</title>
	<atom:link href="http://www.investigations.4-lom.com/tag/tei/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.investigations.4-lom.com</link>
	<description>Failing better at understanding the past</description>
	<lastBuildDate>Sun, 05 Feb 2012 09:18:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Digital Express</title>
		<link>http://www.investigations.4-lom.com/2008/02/08/digital-express/</link>
		<comments>http://www.investigations.4-lom.com/2008/02/08/digital-express/#comments</comments>
		<pubDate>Fri, 08 Feb 2008 20:00:56 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[copyright]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/02/08/digital-express/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Digital+Express&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-02-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/02/08/digital-express/&amp;rft.language=English"></span>
Having decided to leave my 5th Lincolnshire First World War project for a while, I got an offer I couldn&#8217;t refuse: someone from the Great War Forum sent me a transcript of the battalion&#8217;s medal citations from the regimental archive so that I could publish them on my site and link them in to the [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Digital+Express&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-02-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/02/08/digital-express/&amp;rft.language=English"></span>
<p>Having decided to leave my 5th Lincolnshire First World War project for a while, I got an offer I couldn&#8217;t refuse: someone from the Great War Forum sent me a transcript of the battalion&#8217;s medal citations from the regimental archive so that I could publish them on my site and link them in to the index of people that I&#8217;d created for the book. The document contains information that can&#8217;t be found elsewhere, as although awards of the Military Medal were listed in the London Gazette, full citations were not normally published. There are also three awards not mentioned in Sandall&#8217;s list, and citations for 10 people who were recommended for awards but turned down.</p>
<p>I received the list as a Word file with no semantic markup on Wednesday morning, started working on it on Thursday morning, and <a href="http://www.4-lom.com/citations/">published it on the web</a> this afternoon. It looks very basic but it&#8217;s not bad for two days, and it&#8217;s all linked in to the <a href="http://www.4-lom.com/sandall/people-index.html">index of people</a> for Sandall&#8217;s book. First of all I copied the text into jEdit and used Find and Replace to insert some basic TEI XML markup. Then I pasted it into a new TEI document in oXygen. With the automatic validation it was easy to track down and correct errors in the markup, so by lunch time I had a completely valid TEI file. In the afternoon I spent about 3 or 4 hours on linking records by inserting key attributes into &lt;persName&gt; tags. In most cases I already had the keys that I used for linking names in Sandall, but sometimes I had to change them in the light of new evidence from the citations, such as full names of people who I previously only knew by their initials. This also allowed me to clear up some ambiguities . This morning I finished the linkage by creating new keys for the 13 people not mentioned by Sandall, then got started on writing some XSLT. That was easy as I could copy or adapt a lot of the code from the style sheet for Sandall. As well as generating the HTML version of the citations, this XSLT generates an extra JSON file which is imported into the Sandall index of people to allow linking the citations. Again this only required some minor adjustments to the Exhibit page. After some testing and corrections I had a live site up this afternoon.</p>
<p>This demonstrates the potential value of the techniques I&#8217;ve been using for marking up texts, but it also raises some problems for digital history. I decided to trust a transcript from a random person off the internet. I have no way of knowing how accurate the transcript is, or even if the source document really exists! It could be Hugh Trevor Roper and the &#8220;Hitler Diaries&#8221; all over again. Therefore I&#8217;m going to think more carefully before putting myself in this situation again. There&#8217;s also a possibility that I&#8217;ve miscalculated the copyright situation. Based on internal evidence and comparison with other documents my best guess is that the list was created by the army and is therefore under Crown Copyright (and being unpublished and available for inspection in a public record repository should come under waiver of Crown Copyright), but without seeing the original it&#8217;s hard to be sure. I might be wrong, and even if I&#8217;m right the holders of the manuscript might not agree. So technology makes some things easier, but there are other problems that it can&#8217;t solve.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/02/08/digital-express/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sandall: The End of the Beginning</title>
		<link>http://www.investigations.4-lom.com/2008/02/01/sandall-the-end-of-the-beginning/</link>
		<comments>http://www.investigations.4-lom.com/2008/02/01/sandall-the-end-of-the-beginning/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 16:14:30 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[microformats]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[wikis]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/02/01/sandall-the-end-of-the-beginning/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Sandall%3A+The+End+of+the+Beginning&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-02-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/02/01/sandall-the-end-of-the-beginning/&amp;rft.language=English"></span>
Having made good progress with my project to digitize Sandall&#8217;s History of 5th Lincolnshire Regiment in the last month I&#8217;m going to leave it for a while. This month I haven&#8217;t read any books or articles, haven&#8217;t written anything other than blog posts and computer code, and have only occasionally thought about historiography and theory. [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Sandall%3A+The+End+of+the+Beginning&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-02-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/02/01/sandall-the-end-of-the-beginning/&amp;rft.language=English"></span>
<p>Having made good progress with my project to digitize <a href="http://www.4-lom.com/sandall/">Sandall&#8217;s History of 5th Lincolnshire Regiment</a> in the last month I&#8217;m going to leave it for a while. This month I haven&#8217;t read any books or articles, haven&#8217;t written anything other than blog posts and computer code, and have only occasionally thought about historiography and theory. I kind of like it like that but I have other things to get on with now.</p>
<p>I&#8217;ve made some small changes since the last post. Dates now have tool tips, so if you hover over them you can see the full date. The <a href="http://www.4-lom.com/sandall/place-index.html">place name index</a> is a bit more user-friendly. I&#8217;ve replaced the hash values with query strings in the incoming links so that the Exhibit page filters the list down to the place passed in the query instead of displaying a box with the details. This means that you just have to click on &#8220;Map&#8221; to go straight to map view with only that place displayed. Once you&#8217;re there you can easily take the filter off again to see all the other places. The map view is also zoomed out further by default so that you can see Britain and Egypt. That means that you have to zoom in a long way to get to France and Flanders but I think it&#8217;s less confusing than not being able to see Grimsby or Alexandria unless you zoom out.</p>
<p>So the site is now in a satisfactory condition with lots of cool features, and now that I&#8217;ve worked out how to do everything I could probably get another book to the same stage within a few weeks. But there are still lots of features that could, and probably should, be added. See below for more details.<span id="more-177"></span></p>
<h3>Text Snippets</h3>
<p>Ben Brumfield pointed out that the lists of references to a person or place in the index would be much better if they included a snippet of surrounding text from the target page (like Google does) to provide some context. I definitely agree with this, but I haven&#8217;t worked out how to do it yet. Ideally I&#8217;d like to do it in the XSLT so that I don&#8217;t have to introduce an extra step in the process, but I haven&#8217;t yet looked into whether that&#8217;s possible, or whether it&#8217;s unnecessarily difficult compared to running a Python script over the HTML output files.</p>
<h3>Feedback Mechanism</h3>
<p>This is one of the most important features, and something that I&#8217;ve wanted for a long time. The experience of digitizing a text has just reinforced my view that it&#8217;s futile to try to eliminate all scanning errors before publication. The perfect is the enemy of the good. In this case the output from FineReader was good enough. I spotted a few more errors during markup, but not many. I really think the way to go is to use the OCRd text with whatever automated checks you have time for and not worry about whether the text is perfect. The chances are that end users will spot the scannos in the course of reading. If you give them an easy to use feedback mechanism that lets them report errors with the minimum trouble it&#8217;s good for everyone.</p>
<p>The feedback mechanism that I originally had in mind was like a cross between Distributed Proofreaders and a wiki. The correction page would present users with a page image and a text box containing the page text, which they could edit and submit.  Once approved by a moderator the changes would go live. Now I&#8217;m not sure that this is exactly what I need. Although the text almost certainly contains scannos that I&#8217;ve missed, there might not be enough to justify developing such a system. The major technical challenge would be feeding the changes back into the TEI XML file so that the website and the original file stay in sync. If I could accept branching between the XML and HTML versions it wouldn&#8217;t be too difficult to adapt a wiki to do something like this. However, this kind of feedback is only good for simple errors like scannos. I don&#8217;t think it could help people to point out possible errors in record linkage and name identification, where it&#8217;s easier to tell than to show what&#8217;s wrong (unless the users know TEI quite well).</p>
<p>I was also thinking about using CommentPress as a feedback mechanism. I was really excited about this when I first heard about it but it&#8217;s given me so much trouble that I&#8217;ve lost all confidence in it. Is it a good idea badly implemented? Or is the idea fundamentally flawed? Maybe it&#8217;ll get better in time, but right now it&#8217;s not for me. I have some ideas for a WordPress hack of my own, but I&#8217;m not sure if I&#8217;ll ever have time to try it.</p>
<h3>Wiki</h3>
<p>Whatever I use for the feedback mechanism for the text itself, I want a wiki for biography pages. This would allow for a lot more information than is currently displayed by Exhibit on the index page. On reflection it would have been useful to have a wiki earlier in the process, especially when I was researching and disambiguating people, as I could have put biographical details straight into it as soon as I found them.</p>
<h3>Valid XHTML</h3>
<p>The main obstacle to this is the problem I found with &lt;a&gt; elements inside &lt;ul&gt; elements but not in &lt;li&gt; elements. This is a case where the HTML specification conflicts with the semantics of the documents I&#8217;m trying to represent, and something&#8217;s got to give. If I decide that validity is more important then I should be able to adjust the XSLT code to move the &lt;a&gt; inside the preceding &lt;li&gt;. I&#8217;ve already had to do a similar thing where a page break occurs inside a name because one &lt;a&gt; inside another doesn&#8217;t work.</p>
<h3>Icons</h3>
<p>It would look nicer, and take up less space, if the &#8220;view page image&#8221; links were icons instead of text.</p>
<h3>Dates</h3>
<p>All the dates in the text are already marked up in the XML but I haven&#8217;t done very much with them yet. As I&#8217;m already using Exhibit it might be possible to make a timeline.</p>
<h3>Metadata</h3>
<p>Although I&#8217;ve made it easy for humans to find people and places mentioned in the text, you&#8217;d have to be quite clever with scrapers and regular expressions to get them automatically (or you could just download the XML source). I&#8217;d like to follow best practice for providing metadata but I&#8217;m not sure what that is yet (is anyone?). A good start might be to embed some microformats in the HTML pages. And as it&#8217;s a book I should really put in some COINS data so that Zotero can grab it.</p>
<h3>Less Mark-up</h3>
<p>I&#8217;ve already taken out some of the XML tags that I originally put in but which weren&#8217;t really any use. I might take that even further.  Considering the number of blatant mistakes I found when marking up and linking people, places, and dates, it seems increasingly futile to try to correct Sandall even for internal consistency. Therefore &lt;sic&gt; and &lt;corr&gt; tags are likely to disappear in future as my editorial policy changes to preserving the original text regardless of how wrong it is. Correct identifications are supplied in name attributes anyway, so there&#8217;s no need to correct the content of these tags.</p>
<h3>And More Mark-up</h3>
<p>So far I&#8217;ve marked up dates, people, and places. Right now places are limited to named settlements that can be seen on Google Maps, but this could all change in future. There are many other things mentioned, such as hills, woods, farms, and fortifications, which could potentially be geotagged and displayed on the map. This would need a lot of extra research as many of these things can&#8217;t be found using Google alone. I could also mark up organizations such as regiments and battalions, and provide an index similar to the person and place indexes.</p>
<h3>Better Name Identification</h3>
<p>Talking of extra research, there is still work to be done on identifying and disambiguating people and places. Some of the personal names in the index have question marks after them because they might be the same as a similarly named person in another entry, or they might not. There are also several people and places which don&#8217;t appear in the index yet as I haven&#8217;t been able to identify them at all. With people these are referring strings rather than actual names (eg &#8220;a wounded man&#8221;, &#8220;his platoon officer&#8221;). If I can identify them at all it will only be through checking the battalion war diary. The places are all named, but don&#8217;t appear on Google Maps. In many cases I can work out roughly where they should be from the context, but not with enough certainty to display them on the map. Finally there are some very obscure people who are named and can be disambiguated from anyone else in the text but who I can&#8217;t positively identify in any other source. Most of them are junior officers who are only mentioned briefly (and often only by surname) as joining the battalion on a certain day but aren&#8217;t mentioned again because they never did anything noteworthy like winning a medal or getting killed. I would at least like to know their forenames. Getting to the bottom of this is going to take a lot of digging in officers&#8217; correspondence files at Kew (if they ever finish the building work&#8230;).</p>
<h3>More Precise Schema</h3>
<p>Now that I&#8217;m further on with markup I have a better idea of which tags I will and won&#8217;t need in future. I should remove all unused tags from the schema as this would make it smaller.  It would also make mark-up more efficient as oXygen displays a list of every tag that is allowed at a certain point, which you can double click on to insert that tag around the current selection. This is obviously easier if the list isn&#8217;t full of superfluous tags that you have no intention of using.</p>
<h3>And then&#8230;</h3>
<p>I also need to think about how to digitize my great-grandad&#8217;s letters, as this was always going to be part of this project. It won&#8217;t take very long as there isn&#8217;t a large amount of material, and I can re-use a lot of the techniques and code that I&#8217;ve developed for Sandall&#8217;s book. The main differences are manual typing instead of OCR, and deciding how to represent the structure of the postcards.</p>
<p>Then what? I think I should write a detailed step by step guide to digitization using these techniques to help anyone else who wants to follow in my footsteps. It would also be good to digitize another book in the same way and see how quickly it can be done. I&#8217;ve probably spent more time working out how to do things than I have actually doing them. The history of 1/5th Leicestershire Regiment, which was in the same brigade as 1/5th Lincs, is available on Project Gutenberg and would make a nice companion to Sandall. The techniques that I&#8217;ve used would also transfer to a battalion war diary quite easily. The main difference is that it&#8217;s manuscript, so no OCR, but very long. Therefore it might be useful to introduce an element of collaboration to the text capture.</p>
<p>But now I need to get back to the English Civil War for a while. Even there I should be able to make good use of some of the things I&#8217;ve learnt doing this project.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/02/01/sandall-the-end-of-the-beginning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Places</title>
		<link>http://www.investigations.4-lom.com/2008/01/28/places/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/28/places/#comments</comments>
		<pubDate>Mon, 28 Jan 2008 12:01:46 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[maps]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/28/places/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Places&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/28/places/&amp;rft.language=English"></span>
Following on from adding an interactive index of people to my digital edition of Sandall&#8217;s history of 5th Lincs, I&#8217;ve now added a similar feature for place names. It works in exactly the same way as the person index, but it also has a map view. Again this uses the Exhibit API, which makes it [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Places&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/28/places/&amp;rft.language=English"></span>
<p>Following on from adding an interactive index of people to my digital edition of Sandall&#8217;s history of 5th Lincs, I&#8217;ve now added a similar feature for <a href="http://www.4-lom.com/sandall/place-index.html">place names</a>. It works in exactly the same way as the person index, but it also has a map view. Again this uses the <a href="http://simile.mit.edu/exhibit/">Exhibit API</a>, which makes it very easy to mash up data with Google Maps without even having to know anything about the Google Maps API. The map view is a bit slower than the normal view, especially if the list isn&#8217;t filtered, but that&#8217;s an inherent limitation of using maps.</p>
<p>One of the many cool things about the map is that it strikingly illustrates the allied advances in the last months of the First World War. If you go into the map view and click &#8220;The Beginning of the Great Advance&#8221; on the list of chapters, you&#8217;ll see the battalion holding the line in Flanders, then moving behind the lines for rest near Amiens, then moving up to the front line at Saint-Quentin. Then click on each of the following chapters in turn and watch the markers surge forward as 46th Division breaks through the Hindenburg Line and pushes towards Belgium.</p>
<p>Adding the place index was mostly similar to adding the person index: I added a unique  id to each&lt;placeName&gt; tag using a Python script, pulled out the place names into an SQLite database, identified/disambiguated them and added a regularized name, then used another Python script to pull the regularized names out of the database and put them into the key attributes in the XML file. Identifying the places was easier than identifying people, and took a couple of days, although there are a few that I couldn&#8217;t find. As with people I added some code the the XSLT to generate a JSON file of all the places. Then following the <a href="http://simile.mit.edu/wiki/Exhibit/2.0/Map_View_Tutorial">map view tutorial</a> I used the Exhibit API to pull latitude and longitude co-ordinates from Google Maps and put them into another JSON file. This turned out to be a bit unreliable as about 10 per cent of the places had their co-ordinates missing. It seems to be random, as running the script again with the same set of data produced a similar error rate but with different places. I had to take the missing places from the output file, put them into another input file and run the script over them again, which produced a similar 10 per cent error rate, but the remaining few co-ordinates could be put in manually. Once I had a JSON file with all the correct geocodes it was easy to copy code from the tutorial to add a map view to the Exhibit page. In a few cases it turned out that Google had given me the wrong co-ordinates. Mostly this was because there are two or more places with the same name and it had picked the wrong one. I thought I&#8217;d put in enough information from my manual searches to disambiguate them but it seems that the results of a Google Map search can be a bit unpredictable, and don&#8217;t necessarily give you the full address of a place.</p>
<p>I&#8217;ve now done most of what I planned to do in this phase. There are still some features that could be added, especially a feedback mechanism, but I&#8217;ll be giving this project a rest soon so I can do some English Civil War work.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/28/places/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Marking Up Names: Part 3</title>
		<link>http://www.investigations.4-lom.com/2008/01/23/marking-up-names-part-3/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/23/marking-up-names-part-3/#comments</comments>
		<pubDate>Wed, 23 Jan 2008 12:52:41 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/23/marking-up-names-part-3/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+3&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/23/marking-up-names-part-3/&amp;rft.language=English"></span>
My digital edition of Sandall&#8217;s History of 5th Lincolnshire Regiment now has a new improved index of people. This uses the Exhibit API to make an interactive list which can be filtered, sorted, and searched. Exhibit provides features that would normally need a database driven back-end but it&#8217;s all done on the client side using [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+3&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/23/marking-up-names-part-3/&amp;rft.language=English"></span>
<p>My digital edition of Sandall&#8217;s History of 5th Lincolnshire Regiment now has a new improved <a href="http://www.4-lom.com/sandall/people-index.html">index of people</a>. This uses the <a href="http://simile.mit.edu/exhibit/">Exhibit</a> API to make an interactive list which can be filtered, sorted, and searched. Exhibit provides features that would normally need a database driven back-end but it&#8217;s all done on the client side using Javascript. The two disadvantages of this are that it doesn&#8217;t scale up very far, and that it isn&#8217;t very Google friendly. In this case there&#8217;s no problem because there are only ever going to be 350 records in the list, and there is no unique content on this page &#8211; it&#8217;s just an index to point users to other pages, which are Google friendly.</p>
<p>I&#8217;ve also made every occurrence of a name in the text into a link which points to the index. My worries about illegal characters in id attributes turned out to be unfounded. With Exhibit I can use the standardized names from the TEI @key attribute as hashes to make permalinks to individual records. Clicking on the link takes you to the index and displays a dialog box with all of that persons details, including links back to every mention in the text. The dialog box is also displayed by clicking on a person&#8217;s name on the index page. I just need to work out a way to display it without having to reload the page.</p>
<p>Exhibit is really easy to use and makes it possible to add some fairly advanced features with surprisingly little effort. It took some searching, copying examples, trial and error, and asking on the mailing list before I worked out how to do everything, but as the project is documented by a wiki I&#8217;ve been able to update it whenever I find out how to do something that isn&#8217;t already explained there. The JSON data file for my index page is generated automatically by XSLT which loops through every &lt;persName&gt; and &lt;rs&gt; tag in the TEI document, and pulls out extra details (date of death, links to medal cards and CWGC) from another XML file.</p>
<p>Now that person names are more or less fully implemented, it&#8217;s time to move on to place names. These should be easier to disambiguate, and with Exhibit I can do some even cooler things with them, such as generating a Google map.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/23/marking-up-names-part-3/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Marking Up Names: Part 2</title>
		<link>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/#comments</comments>
		<pubDate>Sat, 19 Jan 2008 15:01:34 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/&amp;rft.language=English"></span>
My digital edition of Sandall&#8217;s History of 1/5th Lincolnshire Regiment now has a new index of people. In my last post I described how names were marked up in the text. This post is about how I linked them together. I&#8217;d already used a Python script to generate a unique id for every &#60;persName&#62; and [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/&amp;rft.language=English"></span>
<p>My digital edition of Sandall&#8217;s History of 1/5th Lincolnshire Regiment now has a new <a href="http://www.4-lom.com/sandall/people-index.html">index of people</a>. In my last post I described how names were marked up in the text. This post is about how I linked them together.</p>
<p><span id="more-170"></span>I&#8217;d already used a Python script to generate a unique id for every &lt;persName&gt; and &lt;rs&gt; tag, and to build a key using components of the name. Next I wrote another Python script to pull all the names out of the XML document and put them into a SQLite database so that I could standardize the key values. The SQLite clients I have are good for running queries but not so good for manually editing data, so I exported the table to CSV and opened it in Excel. That was possibly a mistake: when I came to feed the data back into the XML I ran into some nasty character encoding problems, but it was partly my own fault for letting some unnecessary non-ASCII characters creep in through copying and pasting text from web pages. Apart from that, editing the data in a spreadsheet was easy and convenient.</p>
<p>I decided to standardize the keys in the following form: surname, full forenames (or initials where names not known), service number (Other Ranks only; post-1917 6 digit number preferred as these were unique within regiment), rank (officers only; highest rank known to have been held during the First World War), years of birth and death (if known). Officers commissioned from the ranks can have a rank and number. Here are a few examples:</p>
<p>Sandall, Thomas Edward, Lt-Col (d 1930)<br />
Stuart-Wortley, Edward James Montagu, Maj-Gen (1857-1934)<br />
Pickard, Herbert, 240010 (d 1917)<br />
Leadbeater, Conrad, 240252 Capt<br />
Howard, William, 241683</p>
<p>In the first pass I just wanted to disambiguate names as quickly as possible. In most cases this could be done from internal evidence, particularly the original index. There were some cases where I needed to check online sources, but even so the process only took a few hours and was completed within one day. There were a few names which couldn&#8217;t be disambiguated with any certainty so I&#8217;ve had to ask the 5th Lincs experts for help.</p>
<p>In the second pass I tried to make the keys as full and detailed as possible by searching online sources for full names, service numbers, and dates of death. This was much more labour intensive and took several days. It isn&#8217;t helped by the fact that both Sandall&#8217;s book and the online sources contain many errors and inconsistencies, especially with names and service numbers. This led to lots of lateral thinking, deduction, trial and error, and ultimately frustration. There are still some cases where I couldn&#8217;t positively identify an individual or couldn&#8217;t find any trace of any possible candidates in the available records. Most of these are junior officers who are only mentioned briefly. All this research wasn&#8217;t strictly necessary at this stage, but I&#8217;m looking ahead to the possibility of using a wiki for biography pages, which will need page names in the above form.</p>
<p>Once the keys were consistent and as detailed as possible I re-imported the CSV into the database and wrote yet another Python script to loop through every name element in the XML, pull out the database record with matching id, and write the new key value from the database to the @key attribute of the XML element. This went smoothly once I&#8217;d got rid of all the non-ASCII characters! Then I added some code to the XSLT to generate the index of people by pulling out all the name elements, grouping them by key value, and creating links to each occurrence of that name. This is still looking quite basic, but it works.</p>
<p>The next step is to make each occurrence of a name link back to the index. There&#8217;s a complication here because I need the entries in the index to have valid unique ids so that I can link to them, but I also need the full regularized forms of the names to appear in the index list. This places conflicting demands on the @key attribute, because xml:ids can&#8217;t contain spaces, brackets, or commas. TEI used to have a @reg attribute for names which could contain a regularized form, but this was removed in P5. The way you&#8217;re supposed to do it now is put the regularized form in a &lt;reg&gt; element and surround them with a &lt;choice&gt; element. I could do that and use a modified form for the @key. But I&#8217;m also thinking about putting the name data into a JSON file so I can do cool things with <a href="http://simile.mit.edu/exhibit/">Exhibit</a>. That would mean that I wouldn&#8217;t have to embed the regularized form of the name in every occurrence &#8211; just have an abbreviated form that can be used as a key for name occurrences and as an id for the index entry. It would also allow me to store and use more data than can be accommodated in TEI XML. For example, during the research/disambiguating phase I collected a lot of links to medal cards and CWGC database entries. If these were in a JSON file I could easily link to them from the index page without much extra work. Now I need to work out exactly what I want to do with this and how to structure the data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Marking Up Names: Part 1</title>
		<link>http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/#comments</comments>
		<pubDate>Tue, 15 Jan 2008 15:55:00 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[ww1]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+1&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-15&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/&amp;rft.language=English"></span>
On to the next stage of digitizing Sandall&#8217;s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations. For this phase, I decided to [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+1&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-15&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/&amp;rft.language=English"></span>
<p>On to the next stage of digitizing Sandall&#8217;s History of 5th Lincolnshire Regiment. Having marked up the structure of the text and written XSLT to split the book into several HTML pages with working internal links, I could move on to Phase 2: marking up name, dates, and abbreviations.</p>
<p><span id="more-169"></span>For this phase, I decided to use the following tags:</p>
<p>&lt;placeName&gt; (used 729 times) marks up names of places. I decided to only mark up settlements which are likely to be found on Google maps. Therefore names of geographic features, trenches/fortifications, HQs etc were ignored.</p>
<p>&lt;persName&gt; (used 789 times) marks the name of a person. Contains further tags to mark up forenames, surnames, ranks etc. @key is to contain a regularized version of the name which can be used to link records together and link to external web pages.</p>
<p>&lt;forename&gt; used to mark up forenames, one tag for each forename, including inititals. @full used to denote initials or full names. In practice the vast majority of forenames in the book are given as initials</p>
<p>&lt;surname&gt; used to mark surnames. Double-barrelled names are treated as a single surname with a single tag.</p>
<p>&lt;nameLink&gt; parts of name such as &#8220;de&#8221; or &#8220;le&#8221; which are not considered part of the surname for sorting purposes. I have followed the practice of the index in the original text to decide which parts of a surname are significant.</p>
<p>&lt;roleName&gt; used mostly for military ranks.</p>
<p>&lt;addName&gt; used for service numbers, with @type given the value &#8220;servicenumber&#8221;</p>
<p>&lt;rs&gt; (used 158 times) referring strings which refer to an individual who might be identifiable but who isn&#8217;t named in the text, eg &#8220;Battalion C.O.&#8221;, &#8220;our missing man&#8221;, &#8220;Corps Commander&#8221; etc.</p>
<p>&lt;name&gt; (used 3 times) only used to mark up the names of ships. I&#8217;m not sure if I&#8217;m actually going to do anything with these yet.</p>
<p>&lt;date&gt; (used 991 times) used for all dates which are identifiable as a single day, regardless of form in text eg &#8220;4th August, 1914&#8243;, &#8220;13th October&#8221;, &#8220;the 5th&#8221;, &#8220;next day&#8221;. @when gives full date in form yyyy-mm-dd. No other tags were used to mark dates as for this project it isn&#8217;t necessary to mark up individual parts</p>
<p>&lt;abbr&gt; marks abbreviations given in text. Some obvious ones (eg &#8220;a.m.&#8221;, &#8220;p.m.&#8221;) and some ranks which were only slightly abbreviated (eg &#8220;Lieut.-Colonel&#8221;, &#8220;Sergt.&#8221;) were left unmarked but most were marked up</p>
<p>&lt;expan&gt; supplied expansion of any abbreviation marked with &lt;abbr&gt;</p>
<p>&lt;choice&gt; surrounds &lt;abbr&gt; and &lt;expan&gt; pairs. XSLT generates HTML &lt;abbr&gt; tag with the content of &lt;expan&gt; as the value of @title</p>
<p>The names and dates in the medal list in the appendix were easy to mark up fully using regular expressions as they were well organized into predictable patterns. The main text needed more manual intervention as things were much less predictable. When I say &#8220;manual&#8221; that doesn&#8217;t involve any reading or typing as oXygen can still automate things. For example, adding a tag usually only involves a couple of double clicks to select a word then select a tag from the list of tags allowed at that point. At the first pass I only added &lt;placeName&gt;, &lt;persName&gt;, &lt;abbr&gt;, and &lt;date&gt; tags. Using the Find feature in oXygen I set up a simple regular expression [A-Z] which would find every capital letter. This allowed me to cycle through, finding every proper noun and abbreviation (and a lot of false positives too, but I can&#8217;t think of a better way to do it), and mark them with the appropriate tag. This took about 4 1/2 hours. Then I used another regular expression to find any dates which hadn&#8217;t been tagged yet (many of them don&#8217;t have months given and so weren&#8217;t found by the capital letter search):</p>
<p>(?&lt;= )[0-9]{1,2}(st|nd|rd|th)</p>
<p>This also found a few unit/formation numbers. I checked each occurrence to make sure it actually was a date, but even so it only took about 40 minutes to cycle through the whole book. Next I searched for some referring strings which wouldn&#8217;t have been picked up by the previous searches (eg &#8220;man&#8221;, &#8220;officer&#8221;, &#8220;day&#8221;, &#8220;month&#8221;).</p>
<p>Adding values to the @when attributes of &lt;date&gt; tags was done automatically for the medal list but had to be done manually for the main text as the month and year often had to be worked out from the context. With a lot of copying and pasting this took about 2 hours.</p>
<p>With abbreviations marked up it was easy to supply expansions using Find and Replace. This took about half an hour, partly because there were so many different abbreviations, and partly because of time spent researching particularly obscure ones. On reflection I could also have used the Find and Replace features of oXygen (which allows the use of regular expressions AND XPath) to mark up most of the name components but at least I learnt from doing it manually that it took 3 1/2 hours and was very tedious. If forenames were full I set the @full to &#8220;yes&#8221; manually, otherwise I left them, then used F&amp;R to add full=&#8221;init&#8221; to all the ones which didn&#8217;t have @full attributes. I did a similar thing with @type=&#8221;military&#8221; for &lt;roleName&gt; tags.</p>
<p>Once all the names were fully marked up I wrote a Python script to generate a unique @id for every &lt;persName&gt; and &lt;rs&gt;. It also generated @key values for every &lt;persName&gt; from the component parts, in the form [surname], [forenames] [namelink] [servicenumber] [rank]. When I get to record linkage these will need to be standardized, as often the same person is referred to in different ways.</p>
<p>Referring strings were more difficult. These needed @key values in the same form as &lt;persName&gt; but obviously they can&#8217;t be generated from the content of the tag. It took a lot of research to fill these in but I did most of them in a day. Battalion COs and Adjutants were supplied entirely from internal evidence in other parts of the book. Division Commanders were easy because 1/5th Lincs was always in 46th Division and there&#8217;s a complete list of division commanders at <a href="http://www.1914-1918.net/46div.htm">The Long Long Trail</a>. Someone on the Great War Forum kindly supplied me with a list of Brigadiers of 138th Brigade. Surprisingly Corps and Army commanders were more difficult as I couldn&#8217;t find a complete list of which Corps and Army 46th Division belonged to and who commanded them. With a lot of cross-checking between different parts of the book, along with the Oxford DNB, The Long Long Trail and Great War Forum, Regiments.org, and various Google searches I eventually pieced most of it together and now I&#8217;m only stuck for one Corps Commander. Google, DNB, and Wikipedia together supplied names of several non-military people, such as the Archbishop of York, Vicar of Grimsby, and President of Portugal. In the end there were only a few staff officers and COs of other battalions which I had to ask for help with on the Great War Forum. I still haven&#8217;t got everyone, but where a name is unknown I&#8217;ve given them unique strings like &#8220;unknown wounded NCO 1918-09-25&#8243;.</p>
<p>The next step is to pull out all the names and standardize the key strings so that they match for all instances of the same person, then feed them back into the XML document.<br />
<span lang="EN-GB"></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/15/marking-up-names-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sandall Update</title>
		<link>http://www.investigations.4-lom.com/2008/01/09/sandall-update/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/09/sandall-update/#comments</comments>
		<pubDate>Wed, 09 Jan 2008 19:32:41 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[ww1]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/09/sandall-update/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Sandall+Update&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-09&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/09/sandall-update/&amp;rft.language=English"></span>
I&#8217;ve now uploaded a new version of Sandall&#8217;s history of 5th Lincs with each chapter on a separate page. I thought splitting the pages and getting the internal links to work would be difficult but it turned out to be easier than I thought, although it involves some quite complicated XPath expressions. I&#8217;ve also uploaded [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Sandall+Update&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-09&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/09/sandall-update/&amp;rft.language=English"></span>
<p>I&#8217;ve now uploaded a <a href="http://www.4-lom.com/sandall/">new version</a> of Sandall&#8217;s history of 5th Lincs with each chapter on a separate page. I thought splitting the pages and getting the internal links to work would be difficult but it turned out to be easier than I thought, although it involves some quite complicated XPath expressions. I&#8217;ve also uploaded the <a href="http://www.4-lom.com/sandallsource/splitsandall.xsl">new XSLT</a> to show how I did it. This could probably be better in some ways but for now I&#8217;m just pleased that it works. While doing this I decided to change the n attribute of the chapter divs from a number to a slug that could be used to make a Google friendly permalink.</p>
<p>Now I&#8217;m waiting for Google to re-index the site so that the custom search actually works. Meanwhile I&#8217;ve started tagging people, places, dates and abbreviations. More on that when I&#8217;ve finished. I&#8217;m also increasingly confident that the photos are in the public domain (see <a href="http://www.lr.mdx.ac.uk/copyright/">these guidelines</a>, which make things a bit clearer, if they&#8217;re right), so they&#8217;ll probably be added soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/09/sandall-update/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>More progress with Sandall</title>
		<link>http://www.investigations.4-lom.com/2008/01/05/more-progress-with-sandall/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/05/more-progress-with-sandall/#comments</comments>
		<pubDate>Sat, 05 Jan 2008 15:26:33 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/05/more-progress-with-sandall/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=More+progress+with+Sandall&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-05&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/05/more-progress-with-sandall/&amp;rft.language=English"></span>
My project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War has made very good progress this week. I&#8217;ve now uploaded a new HTML version. This features links to page images and a working index: if you click on a page number in the index it takes you [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=More+progress+with+Sandall&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-05&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/05/more-progress-with-sandall/&amp;rft.language=English"></span>
<p>My project to digitize T. E. Sandall’s history of the 1/5th Lincolnshire regiment in the First World War has made very good progress this week. I&#8217;ve now uploaded a <a href="http://www.4-lom.com/sandall/">new HTML version</a>. This features links to page images and a working index: if you click on a page number in the index it takes you to the corresponding part of the text. The whole book is still on one page as I haven&#8217;t worked out how to split it yet but it&#8217;s an improvement over the previous interim version. Below are more details of what I&#8217;ve done and how I&#8217;ve done it.</p>
<p><span id="more-166"></span>The first part of getting the index to work was adding &lt;ref&gt; tags to the page numbers. This was easy to do using a regular expression in jEdit.</p>
<p><span style="font-size: 12pt; font-family: Arial" lang="EN-GB">Find: (?&lt;=, )[0-9]+(?=,|&lt;)</span></p>
<p>Replace with: &lt;ref target=&#8221;p$0&#8243;&gt;$0&lt;/ref&gt;</p>
<p>This was based on the assumption that every page reference in the index would be preceeded by a space and a comma, and followed by a comma or a closing tag (thankfully index terms were usually followed by commas which made things much easier). In practice there were a few page numbers which got missed because of exceptions to this rule. Where the whole index term was in double quotes, the closing quote came after the comma. That&#8217;s easy to allow for in future. In one case the comma was missing in the original text, and in another it had been mis-scanned as a full stop. There were a few other inexplicable cases where the regex should have matched but apparently didn&#8217;t. I&#8217;m not sure what was going on there but it didn&#8217;t happen very often.</p>
<p>The regex generated &lt;ref&gt; tags with target attributes pointing to the relevant page numbers. Next I had to give the &lt;pb&gt; elements a corresponding id so that the refs would have something to point to. There was a bit of a dilemma about whether &lt;pb&gt; tags at the start of a chapter should go inside or outside the chapter &lt;div1&gt;. In the end I avoided it by adding the id attribute to the &lt;div1&gt; instead and getting rid of the &lt;pb&gt;. As well as ids for internal linking I added facs attributes to allow linking to page images. This was all done automatically with a Python script which parses an XML document, pulls out all the &lt;pb&gt; elements, loops through them adding consecutively numbered id and facs attributes to each one that doesn&#8217;t already have them (I did the front matter manually because it&#8217;s very short and has its own sequence of roman numerals rather than page numbers in the main sequence). Then the attributes of the first &lt;pb&gt; in each chapter are copied to the &lt;div1&gt; and the &lt;pb&gt; is deleted. Finally the XML is written back to a file.</p>
<pre>
from xml.dom import minidom
import codecs

book = minidom.parse('sandall2.xml')
pagebreaks = book.getElementsByTagName('pb')
i = 1
for x in pagebreaks:
    if x.hasAttribute('xml:id') == False and x.hasAttribute('facs') == False:
        x.setAttribute('xml:id', 'p' + unicode(str(i)))
        x.setAttribute('facs', unicode(str(i).zfill(3) + '.png'))
        i += 1

chapters = book.getElementsByTagName('div1')
for y in chapters:
    divbreaks = y.getElementsByTagName('pb')
    y.setAttribute('xml:id', divbreaks[0].attributes['xml:id'].value)
    y.setAttribute('facs', divbreaks[0].attributes['facs'].value)
    y.removeChild(divbreaks[0])

bookfile = codecs.open('sandall3.xml', 'w', 'utf-8')
book.writexml(bookfile)
bookfile.close()</pre>
<p>There were two things that caught me out here. First, the built-in open() function only works with ASCII text. If you want to write unicode XML to a file you need to use codecs.open() and specify utf-8 in the parameters. Second, I forgot that I&#8217;d omitted a chapter, so the pages in the back matter went out of sync, but that was easy to fix manually as there aren&#8217;t very many pages there.</p>
<p>Once I&#8217;d remembered that Major Teall&#8217;s epilogue was missing I added a &lt;gap&gt; element to represent it and indicate why it&#8217;s missing.</p>
<p>Next I wrote some XSL to transform the XML document into HTML, and some CSS to style the HTML. It&#8217;s very easy to get HTML that works but getting completely valid HTML is going to be very difficult and might not be worth bothering with. One particular problem is that I&#8217;ve transformed TEI &lt;pb&gt; elements into HTML &lt;a&gt; elements which link to corresponding page images. These can occur anywhere in the text, including in the middle of a list. HTML doesn&#8217;t like &lt;a&gt; being inside &lt;ul&gt; but not inside &lt;li&gt;. Moving the &lt;a&gt; inside the next &lt;li&gt; would be inconvenient and not true to the structure of the original document. Putting the &lt;a&gt; inside its own &lt;li&gt; would be even worse and would screw up the layout. So I might just have to live with slightly invalid HTML.</p>
<p>I also needed to clean up the HTML output file afterwards. Removing superfluous white space reduced the file size by about 25%! As this is from my own XSLT I can&#8217;t blame anyone else this time. I think part of the problem is the way that oXygen prettifies XML by indenting with spaces and inserting line breaks instead of wrapping lines.</p>
<p>The next step is to work out how to get the XSLT to split the output into separate files for each chapter, and how to keep the internal links working. Then I can start marking up people, places, dates, and abbreviations. As well as the HTML version, I&#8217;ve uploaded the <a href="http://www.4-lom.com/sandall/sandall0.2.xml">XML file</a>, <a href="http://www.4-lom.com/sandall/sandall.xsl">XSLT style sheet</a>, <a href="http://www.4-lom.com/sandall/style1.css">CSS style sheet</a>, and <a href="http://www.4-lom.com/tei_sandall.rnc">XML schema</a> in case anyone is interested. I&#8217;ve now deleted some unused elements from the schema.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/05/more-progress-with-sandall/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TEI Update</title>
		<link>http://www.investigations.4-lom.com/2008/01/02/tei-update/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/02/tei-update/#comments</comments>
		<pubDate>Wed, 02 Jan 2008 11:29:18 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[copyright]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/02/tei-update/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=TEI+Update&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/02/tei-update/&amp;rft.language=English"></span>
Nearly a year ago I started a project to digitize T. E. Sandall&#8217;s history of the 1/5th Lincolnshire regiment in the First World War, and in summer I published an interim version. I would&#8217;ve finished it a long time ago if I didn&#8217;t have anything else to do, but original work in peer reviewed printed [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=TEI+Update&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/02/tei-update/&amp;rft.language=English"></span>
<p>Nearly a year ago I started a project to digitize T. E. Sandall&#8217;s history of the 1/5th Lincolnshire regiment in the First World War, and in summer I published an <a href="http://www.4-lom.com/sandall/">interim version</a>. I would&#8217;ve finished it a long time ago if I didn&#8217;t have anything else to do, but original work in peer reviewed printed journals has to come first because it&#8217;ll look better on my CV. Now I&#8217;ve got time to do some more work on it, and having had a break from it I can reassess what I&#8217;m trying to do. Below is an update on what&#8217;s new.</p>
<p><span id="more-163"></span>First of all I discovered from searching the Times Online archive that Major Teall died in 1939, so I&#8217;ll be able to publish his epilogue to the book when it comes into the public domain in 2010. I&#8217;m still not sure about the photos. I&#8217;ve fought my way through the 1988 Copyright Act, which seems to say that copyright in any photos published before 1988 or taken before 1957 exists until it would have expired under the 1956 Copyright Act, which is 50 years after publication for published works, or 50 years after the death of the author for unpublished works. If I&#8217;ve understood correctly that would put the photos from this book (published in 1922) in the public domain. I&#8217;m still going to leave them out for now because it&#8217;s an extra complication, and because they didn&#8217;t scan very well.</p>
<p>Since I last did some work on this project, the new version of the TEI guidelines (<a href="http://www.tei-c.org/Guidelines/P5/">P5</a>) has been released. Not everyone will need to upgrade from P4 to P5, but I want to use some of the new features. There are now elements and attributes designed for linking to page images, which will be very useful. The upgrade was quite easy as most existing elements haven&#8217;t changed much. P5 is based on a schema rather than a DTD. <a href="http://www.oxygenxml.com/">oXygen</a> 9.1 (which I&#8217;ve just bought a licence for &#8211; only $48 for the academic version, and well worth it) comes with TEI P5 schemas, style sheets, and templates pre-loaded. Because I need to use the modules for names and dates, and digital facsimiles, I need to generate my own schema, but this is easy using <a href="http://tei.oucs.ox.ac.uk/Roma/">Roma</a>, a user-friendly online tool.</p>
<p>As well as making the move to P5, I decided to make some other changes to the XML before moving on to the next stage. I removed the original line breaks, which were marked up with &lt;lb&gt; elements, and soft hyphens, which were represented with entity references. This means I&#8217;ve lost something from the original text which won&#8217;t be easy to get back, but I don&#8217;t really have any use for these line breaks and they were causing more trouble than they were worth. Rather than define a new element to represent soft hyphens just in case someone somewhere might want to see the text with original line breaks, I decided to get rid of the problem. Anyone who does want to see the original layout will be able to look at the page images anyway. Removing these elements was easy using Find and Replace with regular expressions in jEdit.</p>
<p>I also decided to get rid of the various elements which were used to represent double quotes. These were &lt;term&gt;, &lt;distinct&gt;, &lt;foreign&gt;, and &lt;soCalled&gt;. I already knew that I wasn&#8217;t going to be using these elements for anything. There are so many possibilities in TEI that it&#8217;s easy to get carried away and start marking up features just because you can. Now I&#8217;ve decided to only use markup which is required by the TEI guidelines, or which has a specific purpose for the resource I&#8217;m trying to create.</p>
<p>My <a href="http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/">first attempt</a> at transforming the text to HTML using the <a href="http://www.tei-c.org/Tools/Stylesheets/">TEI style sheets</a> wasn&#8217;t very promising. I thought then that I just needed to play around with the parameters to make things better, but now, having played around with the parameters quite a lot, I don&#8217;t think it&#8217;s going to be much use. I was always going to have to write some custom XSL to deal with people, places, and dates, but it looks like the TEI XSL won&#8217;t even give me a basic HTML version split into separate files for each chapter. Although it supposedly does this, it looks like it&#8217;s based on the assumption that internal links will use query strings, whereas I just want plain old relative links to plain old HTML files at this stage. It also puts a lot of junk into the HTML whatever parameters you set. For example, if you want an id for each paragraph it doesn&#8217;t add an id attribute to the &lt;p&gt; tag, it inserts an empty anchor at the start of every paragraph! And every &lt;a&gt; tag has an XML namespace declaration, even if you&#8217;ve asked for HTML 4! So I&#8217;ll have to get better at XSL and write my own style sheet from the ground up.</p>
<p>The next step is to generate id and facs attributes based on page numbers for every &lt;pb&gt; element, and put in &lt;ref&gt; tags in the index to point to these page numbers. I also need to decide whether to keep the original table of contents or generate one automatically using XSLT. After that I can either write some XSLT and publish another interim version, or move straight on to marking up people, places, and dates.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/02/tei-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unexpected Progress</title>
		<link>http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/</link>
		<comments>http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/#comments</comments>
		<pubDate>Wed, 11 Jul 2007 17:35:56 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[copyright]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Unexpected+Progress&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-07-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/&amp;rft.language=English"></span>
It&#8217;s been a long time since I wrote anything about my First World War digitization projects, but I now have some progress to report: today I published an interim version of Sandall&#8217;s History of 5th Lincolnshire Regiment. It&#8217;s still a work in progress, and there&#8217;s a lot more to be done, but you can see [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Unexpected+Progress&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2007-07-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/&amp;rft.language=English"></span>
<p>It&#8217;s been a long time since I wrote anything about my First World War digitization projects, but I now have some progress to report: today I published an interim version of Sandall&#8217;s History of 5th Lincolnshire Regiment. It&#8217;s still a work in progress, and there&#8217;s a lot more to be done, but you can see it <a href="http://www.4-lom.com/sandall/">here</a>. It&#8217;s just a plain HTML version (and not strictly valid HTML), and the whole text is on one page (at least it makes it easy to search the whole text with your browser&#8217;s Find feature!), there&#8217;s no name linkage yet, no page images online, and no mechanism for submitting corrections. However, even in this form it should be useful to people who are researching the battalion and can&#8217;t get hold of the original book. More details on what I&#8217;ve done and how I&#8217;ve done it below.</p>
<p><span id="more-99"></span>That I haven&#8217;t posted about this project since February might suggest that it&#8217;s been slow and difficult, but actually I was just busy with other things (writing articles, applying for jobs, organising the Military History Carnival). Until then, things were going surprisingly smoothly and quickly. I was hoping that I&#8217;d be able to get everything finished before my Oxygen free trial finished, but because I was doing other things it&#8217;s now expired, and although a licence doesn&#8217;t cost much, I need the money for other things (especially work on peer reviewed things which might get published in a &#8220;proper&#8221; journal and look better on my CV than a self-published digital edition). Fortunately the last thing I did on the project in February was try an XSLT transform to make a test HTML version of the text. Initially this just showed me that I needed to play around with the transform parameters to get what I wanted, but I&#8217;ve now decided that it&#8217;s good enough to form the basis of an interim edition. (In theory I could use the free version of Saxon to do more transforms, but it&#8217;s a scary command line application!)</p>
<p>The HTML file that came out of the transform was over 900K but I cleaned it up using Find/Replace and regular expressions in jEdit, getting it down to 385K. That might still take a while to download if you&#8217;re on dial-up but it&#8217;s not too bad for a whole book. Every paragraph element had been given a name attribute, which isn&#8217;t necessary for this version as only chapters are linked to from the contents, so I stripped them all out. There were also some xmlns attributes which didn&#8217;t appear to be serving any purpose and must have added a huge amount to the file size, and a huge amount of superfluous white space.</p>
<p>As well as getting the file size down, I needed to make some other adjustments. Some of these were down to the transform settings that I&#8217;d used in Oxygen for the test run, but others showed up some possible limitations of TEI/XML. The master XML document preserves the original line breaks and hyphenation. I also left these in during the test transform but in practice an HTML version without line breaks is more useful so I used jEdit to take out all the &lt;br&gt; elements. Here I encountered a potential problem. Following the advice of the TEI guidelines I used the ­ entity ­ to represent soft hyphens. In practice this isn&#8217;t very helpful. If the HTML version is to keep the original line breaks then the soft hyphens need to be converted to hard hyphens so that they display properly, but if the original line breaks aren&#8217;t kept then the hyphens need to be removed along with the breaks. I&#8217;m not sure if XSLT can actually do this. My understanding is that it only deals with XML tags, but I could be wrong there. If it can&#8217;t do anything with entity references then there would be a need for some extra finding and replacing, but there might be anyway depending on how good I can get the output of the XSLT. Maybe it would be better to have an XML element which represents a soft hyphen, or add an attribute to the &lt;lb&gt; elements to indicate that they&#8217;re preceeded by a soft hyphen. If the line breaks and hypens could be dealt with properly by XSLT then it would make sense to not have a space before any &lt;lb&gt; elements in the source XML.</p>
<p>I found some erroneous paragraph breaks in the middle of sentences. These turned out to be mistakes in the master XML file. I&#8217;m not sure how they came about, but they&#8217;re easy to track down and fix with a regexp (just find any &lt;p&gt; element immediately preceeded by a comma or a letter). The XSLT had used &lt;em&gt; tags to highlightwords and phrases marked up with &lt;term&gt; or &lt;distinct&gt; but I changed these to double quotes to match the text. In future (if  I ever digitize another book) I&#8217;m intending to leave in the original quotes rather than marking up their meaning.</p>
<p>It was quite ironic to discover that the XSLT had automatically inserted a copyright statement at the end of the text! (Copyright is more than just a law: it&#8217;s an insidious ideology) I corrected it manually to make it clear that the work is in the public domain. Other manual adjustments included moving the contents to before the preface (more convenient for readers) and adding some CSS to limit the width of the text divs.</p>
<p>So now there&#8217;s a basic version of the text online in a form which should be of some use to at least some people. It&#8217;s certainly an improvement on taking requests on the Great War Forum and posting the excerpts that people ask for. There&#8217;s still a lot to be done but it&#8217;s good to be making some tangible progress again.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2007/07/11/unexpected-progress/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>

