<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Investigations of a Dog &#187; sql</title>
	<atom:link href="http://www.investigations.4-lom.com/tag/sql/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.investigations.4-lom.com</link>
	<description>Failing better at understanding the past</description>
	<lastBuildDate>Sun, 05 Feb 2012 09:18:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Marking Up Names: Part 2</title>
		<link>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/</link>
		<comments>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/#comments</comments>
		<pubDate>Sat, 19 Jan 2008 15:01:34 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[5th lincs]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[exhibit api]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sandall]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/&amp;rft.language=English"></span>
My digital edition of Sandall&#8217;s History of 1/5th Lincolnshire Regiment now has a new index of people. In my last post I described how names were marked up in the text. This post is about how I linked them together. I&#8217;d already used a Python script to generate a unique id for every &#60;persName&#62; and [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Marking+Up+Names%3A+Part+2&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2008-01-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/&amp;rft.language=English"></span>
<p>My digital edition of Sandall&#8217;s History of 1/5th Lincolnshire Regiment now has a new <a href="http://www.4-lom.com/sandall/people-index.html">index of people</a>. In my last post I described how names were marked up in the text. This post is about how I linked them together.</p>
<p><span id="more-170"></span>I&#8217;d already used a Python script to generate a unique id for every &lt;persName&gt; and &lt;rs&gt; tag, and to build a key using components of the name. Next I wrote another Python script to pull all the names out of the XML document and put them into a SQLite database so that I could standardize the key values. The SQLite clients I have are good for running queries but not so good for manually editing data, so I exported the table to CSV and opened it in Excel. That was possibly a mistake: when I came to feed the data back into the XML I ran into some nasty character encoding problems, but it was partly my own fault for letting some unnecessary non-ASCII characters creep in through copying and pasting text from web pages. Apart from that, editing the data in a spreadsheet was easy and convenient.</p>
<p>I decided to standardize the keys in the following form: surname, full forenames (or initials where names not known), service number (Other Ranks only; post-1917 6 digit number preferred as these were unique within regiment), rank (officers only; highest rank known to have been held during the First World War), years of birth and death (if known). Officers commissioned from the ranks can have a rank and number. Here are a few examples:</p>
<p>Sandall, Thomas Edward, Lt-Col (d 1930)<br />
Stuart-Wortley, Edward James Montagu, Maj-Gen (1857-1934)<br />
Pickard, Herbert, 240010 (d 1917)<br />
Leadbeater, Conrad, 240252 Capt<br />
Howard, William, 241683</p>
<p>In the first pass I just wanted to disambiguate names as quickly as possible. In most cases this could be done from internal evidence, particularly the original index. There were some cases where I needed to check online sources, but even so the process only took a few hours and was completed within one day. There were a few names which couldn&#8217;t be disambiguated with any certainty so I&#8217;ve had to ask the 5th Lincs experts for help.</p>
<p>In the second pass I tried to make the keys as full and detailed as possible by searching online sources for full names, service numbers, and dates of death. This was much more labour intensive and took several days. It isn&#8217;t helped by the fact that both Sandall&#8217;s book and the online sources contain many errors and inconsistencies, especially with names and service numbers. This led to lots of lateral thinking, deduction, trial and error, and ultimately frustration. There are still some cases where I couldn&#8217;t positively identify an individual or couldn&#8217;t find any trace of any possible candidates in the available records. Most of these are junior officers who are only mentioned briefly. All this research wasn&#8217;t strictly necessary at this stage, but I&#8217;m looking ahead to the possibility of using a wiki for biography pages, which will need page names in the above form.</p>
<p>Once the keys were consistent and as detailed as possible I re-imported the CSV into the database and wrote yet another Python script to loop through every name element in the XML, pull out the database record with matching id, and write the new key value from the database to the @key attribute of the XML element. This went smoothly once I&#8217;d got rid of all the non-ASCII characters! Then I added some code to the XSLT to generate the index of people by pulling out all the name elements, grouping them by key value, and creating links to each occurrence of that name. This is still looking quite basic, but it works.</p>
<p>The next step is to make each occurrence of a name link back to the index. There&#8217;s a complication here because I need the entries in the index to have valid unique ids so that I can link to them, but I also need the full regularized forms of the names to appear in the index list. This places conflicting demands on the @key attribute, because xml:ids can&#8217;t contain spaces, brackets, or commas. TEI used to have a @reg attribute for names which could contain a regularized form, but this was removed in P5. The way you&#8217;re supposed to do it now is put the regularized form in a &lt;reg&gt; element and surround them with a &lt;choice&gt; element. I could do that and use a modified form for the @key. But I&#8217;m also thinking about putting the name data into a JSON file so I can do cool things with <a href="http://simile.mit.edu/exhibit/">Exhibit</a>. That would mean that I wouldn&#8217;t have to embed the regularized form of the name in every occurrence &#8211; just have an abbreviated form that can be used as a key for name occurrences and as an id for the index entry. It would also allow me to store and use more data than can be accommodated in TEI XML. For example, during the research/disambiguating phase I collected a lot of links to medal cards and CWGC database entries. If these were in a JSON file I could easily link to them from the index page without much extra work. Now I need to work out exactly what I want to do with this and how to structure the data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2008/01/19/marking-up-names-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bibliography Databases</title>
		<link>http://www.investigations.4-lom.com/2006/10/30/bibliography-databases/</link>
		<comments>http://www.investigations.4-lom.com/2006/10/30/bibliography-databases/#comments</comments>
		<pubDate>Mon, 30 Oct 2006 21:31:01 +0000</pubDate>
		<dc:creator>Gavin Robinson</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[bibliographies]]></category>
		<category><![CDATA[digital history]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[zotero]]></category>

		<guid isPermaLink="false">http://www.investigations.4-lom.com/2006/10/30/bibliography-databases/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Bibliography+Databases&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2006-10-30&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2006/10/30/bibliography-databases/&amp;rft.language=English"></span>
Time to start filling up the &#8220;Information Technology&#8221; category then. Anyone who isn&#8217;t interested in SQL should probably look away now. I&#8217;ll be posting some thoughts on Zotero sooner or later, but this post is about my own attempts at making bibliographical databases. I&#8217;ve always preferred doing it myself to using off the shelf solutions, [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Bibliography+Databases&amp;rft.aulast=Robinson&amp;rft.aufirst=Gavin&amp;rft.subject=History&amp;rft.source=Investigations+of+a+Dog&amp;rft.date=2006-10-30&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.investigations.4-lom.com/2006/10/30/bibliography-databases/&amp;rft.language=English"></span>
<p>Time to start filling up the &#8220;Information Technology&#8221; category then. Anyone who isn&#8217;t interested in SQL should probably look away now. I&#8217;ll be posting some thoughts on <a title="Zotero" href="http://www.zotero.org/">Zotero</a> sooner or later, but this post is about my own attempts at making bibliographical databases. I&#8217;ve always preferred doing it myself to using off the shelf solutions, which can have advantages and disadvantages.</p>
<p><span id="more-14"></span></p>
<p>When I started my PhD (in 1997) I was using file cards to keep track of my reading. Once I got a laptop (not until 1999) I designed an Access database to replace the cards. It was quite simple. The records were all stored in one table with the standard fields such as author, title, publisher, place etc. Originally it could only store one location for each work. This was no problem because I had a fixed hierarchy of where to find and consult books: Reading University library was first choice; if something wasn&#8217;t there I&#8217;d look in the IHR; if it wasn&#8217;t there I&#8217;d go to the British Library. This proved inadequate when I went home from Reading in 2000 and was mostly working at Leeds University Library, so I had to add extra fields for Leeds locations. Works were assigned to one of four categories: primary, secondary, bibliography, or manuscript list. In addition I added a rudimentary subject search and another field for counties covered. With this basic table and a form as a front end it was quite easy to add data, although data entry was all manual, and by the end of my PhD I had over 800 records. With searches, sorting, filters, and queries I could usually find the records I wanted, and I could use reports to print a nicely formatted list of books to look for. This got me through my PhD, and when I&#8217;d finished my thesis (in 2001) I just mailmerged my bibliography into Word. Designing this and other databases during my PhD gave me good experience of Access.</p>
<p>This year I decided I needed something better, which didn&#8217;t run on Access 97. I now realise that Access query objects are horribly bloated front ends for what is essentially just a line of SQL code. I decided to use PHP and MySQL running on localhost. I already had a local Apache server with PHP and MySQL (installed painlessly using <a title="WAMP server" href="http://www.en.wampserver.com/">WAMP</a>) which I use for testing websites. The big advantage of this approach is being able to use HTML pages as a front end rather than learning any APIs. Also I wanted to improve my PHP and MySQL skills, which would be valuable even if the project itself was a failure. Although I had experiece of PHP and SQL I&#8217;d never used them together before. It turned out that this wasn&#8217;t a big deal because combining them only involves learning to use a couple of new functions. The biggest disadvantage was having to program everything from scratch which was quite time consuming, but this at least added a lot to my programming experience.</p>
<p>Since the database was only ever going to be for my own personal use and was being created from the ground up I had freedom to try some different ideas. It&#8217;s still a work in progress, which might be abandoned now that Zotero is out. Data entry is still all manual because I haven&#8217;t got round to learning how to automate it. I was able to import all the records from my old Access database but they needed a lot of manual tidying up because the database structures were so different. Because the new database is only for my use on my own PC I haven&#8217;t made any attempt to make it secure or efficient. It would easily kill a webserver if it was open to the public and is wide open to SQL injection attacks.</p>
<p>These are some of the features I experimented with and some of their advantages and disadvantages:</p>
<h3>Separation of authors and works:</h3>
<p>The Access database, seemingly in common with most library catalogues, had a field for authors&#8217; names in the record for each work. With the new database I decided to try a separate table for authors, with a third table to link author IDs to work IDs (and a fourth table to link the IDs if the author(s) had edited rather than written the work; in case you&#8217;re wondering I made myself number 1, which is probably enough hubris to ensure that I never get anything published in print). The idea was that it would be easier to keep track of works written by the same author, and to differentiate between authors with the same or similar names. Also, having a master record allows me to keep additional notes on an author and to split the name into constituent parts. This gives much more flexibility, because I can choose whether to output the surname with initials, first forename, full forenames, or just the surname on its own (doesn&#8217;t work so well with books credited to an organisation rather than individuals eg &#8220;Great Britain &#8211; War Office&#8221;).</p>
<p>The biggest problem with this approach is that it&#8217;s so different from how most OPACs are structured that fully automated data scraping would be virtually impossible. There would always be a need for some manual intervention in linking authors to works. Selecting the correct author ID is always a subjective decision, and I&#8217;ve already had some problems disambiguating authors with the same name. If the author is already in the database, the linking process only involves clicking on links to select the author from the list, but if the author isn&#8217;t already present, details have to be entered manually. It would be nice if there was a central database of all published authors with biographical details and complete lists of all works published. Since there isn&#8217;t, I think trying to organise my data like this is probably more trouble than it&#8217;s worth. There is also an efficiency issue, because retrieving the details of authors requires lots of extra queries. Because of this searches which return hundreds of works can be noticeably slow.</p>
<h3>Containers:</h3>
<p>Journal articles, essays, and volumes of serial publications are entered in the main works table in their own right, but also linked to the containing journal, collection, or series. This means that I only have to enter the details of a journal once and those details can be shared by all the articles linked to that journal. The record for each article stores its own year, volume number, and page numbers, but pulls the journal title from the journal&#8217;s record. This gives much more consistency and creates less work, especially with long and obscure titles of local antiquarian journals. Like the separation of authors this can lead to inefficiency and presents problems for fully automated data capture. However, there are no problems with disambiguation like there are with authors and so more likelihood of the computer being able to match journals automatically (although I haven&#8217;t tried it yet).</p>
<h3>Multiple Locations:</h3>
<p>Each library that I might use has an ID and a priority number stored in a master locations table. This table is linked to a work locations table, which stores shelf marks of each work in each library that it can be located in, along with the work ID to link to the records in the main works table. When a work record is displayed, it pulls all the locations of that work from the locations table and displays them in order of priority numbers. A copy of the highest priority location of each work is stored in the work record itself to allow more efficient searches. This is automatically updated whenever the work is linked to a new location, or when priorities are changed. Articles and essays can be linked to the location of their containing journal or collection, or be given their own location. There is effectively no limit on the number of locations I can store, and I can change their priorities by editing the numbers in the master locations table. So if I start using a different library as a result of getting a job at a university at the other end of the country the database can easily accommodate the changed circumstances. This flexibility and future-proofing make it a pretty good feature. However, it multiplies the number of tables and queries, so again not good for efficiency.</p>
<h3>No distinction between primary and secondary works:</h3>
<p>This is an absence of a feature rather than a feature, but it needs explaining. Originally I did have a field for primary or secondary, just like the old Access database. Then I decided to see if I could do without it. Since getting interested in theory I&#8217;ve been finding the distinction between primary and secondary at best blurred and at worst redundant and even misleading. The database seems to be working well enough without it, but the empirical part of my brain still knows the difference so I&#8217;ve either failed to break out of the old ideology, or succeeded in keeping my grip on reality (depending on your point of view).</p>
<p>Overall creating the new database has been good experience and I&#8217;ve learnt a lot from it, but I can&#8217;t help wondering if the time taken up by design, programming, and data entry could have been better spent. Even if Zotero turns out to not do everything I want, it has a lot of advantages over anything I could create myself. One of the most exciting things is the potential for collaboration, especially if it becomes a de facto standard for academics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investigations.4-lom.com/2006/10/30/bibliography-databases/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

