How To Make A Bookmarklet
Knowing how to program can save you from tedious repetitive tasks, such as inserting templates into a wiki page. Recently I’ve been spending more time editing the UK National Archives wiki Your Archives. I created a category for women’s wills, and while I was adding pages to it, I found that a lot of them didn’t have the correct template. Wills that were proved in the Prerogative Court of Canterbury are held by the National Archives and can be downloaded from their DocumentsOnline service. Transcripts of these wills can be posted on Your Archives, and we have a template for them which automatically creates a link back to DocumentsOnline based on an ID code, and formats some key data (testator’s name, dates, catalogue reference) in a standard form. Most of the data which goes into the template can be found in the DocumentsOnline index. We used to copy and paste each value manually, which was not the best use of a human’s time. Faced with the prospect of doing this an awful lot, I decided to write a program to do it automatically. First I threw together a Python script, which was alright for me but no use for people who don’t have Python and BeautifulSoup (and I also wrote it in such a way that it relied on Linux with xclip installed). So then I decided to rewrite it in JavaScript, so that other people could use it in their browsers. You can find the finished version and documentation on the PCC Will Bookmarklet page. Below is a walk through of how I did it.
I’ve tried to make this guide fairly accessible, but I haven’t explained everything. You should be able to follow it if you’re familiar with The Programming Historian. Adam Crymble’s How To Write A Zotero Translator also has some useful tips on scraping with JavaScript, but be aware that it uses XPath, which won’t work in Internet Explorer.
Browsers can run JavaScript programs in several ways. The code might be embedded in a web page, or part of a Firefox extension. You can also run JavaScript from the address bar by typing javascript: followed by some code, and you can save code in the address of a bookmark in the same format. Then when you click on the bookmark, the code runs on the current page. This makes it easy to write and run a script which scrapes data from a page and does something useful with it. Bookmarks which run a script are usually known as bookmarklets.
To write and test a bookmarklet it’s a good idea to install the Firefox extension Firebug. This does two things that are really useful. First, it helps to analyse the underlying code of a web page and find the elements containing the data you’re interested in. Second, it has a command line which can run and debug JavaScript. Making a bookmarklet without Firebug is possible, but much more difficult than it needs to be.
As an example I’m going to refer to the will of Sarah Rawlinson, who was the widow of saddler Nathaniel Rawlinson. This will has a transcript on Your Archives and an index entry on DocumentsOnline. The source code of the Your Archives wiki page includes this template:
{{PCCWill
|Piece=PROB 11/317
|Testator=Sarah Rawlinson, Widow of London
|Signed=1 September 1665
|Proved=30 September 1665
|Code=825891}}
The signed date has to be read by a human from the text of the will (you can see from the text of the wiki page that it’s not in a very machine readable form!). All other data can be scraped from the DocumentsOnline page. By viewing the source code of the page we can find which HTML elements contain the facts that we’re interested in. The data is arranged in a table, which is the fifth table element on the page. The table has two columns. The first contains th elements with the field name, and the second contains td elements with the actual data. Knowing this we can use DOM (a representation of HTML elements in JavaScript objects) to pull out what we want and put it into a variable:
var a=document.getElementsByTagName('table')[4].getElementsByTagName('td');
The variable a now contains a sequence of objects representing all of the td elements in the fifth table element. We can use this later to get the values for the testator’s name, the date the will was proved, and the catalogue reference.
The Code parameter in the template is a unique id number for documents on DocumentsOnline. It can be found in the query string in the URL of the page, after Edoc_Id=. The easiest way to get it is to run a regular expression over the URL to match “Edoc_Id=” followed by a series of digits and then select the digits as a group.
var b=window.location.href.match(/Edoc_Id=([0-9]+)/)[1];
The variable b now contains the code number. The template will use this to construct a link from Your Archives to DocumentsOnline.
Since the date the will was signed isn’t shown on the DocumentsOnline page, users have to enter it manually. The basic version of the bookmarklet just ignores this value, leaving users to enter it in the template in the wiki edit box. This is generally faster and less annoying if you know how to use wiki templates. For people who are less confident with entering parameters directly into a template, I made an extra version of the script, which prompts the user to enter the signed date. Fortunately JavaScript has a simple function which displays a prompt box and returns the entered value.
var c=prompt('Please enter date signed (leave blank if unknown):','');
Now the variable c contains the date the will was signed (or nothing if nothing was entered).
From April 1653 to June 1660 the Prerogative Court of Canterbury and all other ecclesiastical probate courts were replaced by the secular Court for Probate of Wills. Your Archives has a separate template for this court called CPWWill. It has all the same parameters as PCCWill but displays different text. CPW wills are held in the same class as PCC wills and are also available on DocumentsOnline in the same way. The DocumentsOnline index treats them all as PCC wills, but we can easily tell the difference by checking the date that a will was proved. JavaScript has a date object which can turn a string into a date and do calculations on it. Luckily the text from DocumentsOnline is in a format which can be automatically converted, so we just need to pull it out and pass it to the constructor of a new date object. Then we can compare it with other dates.
var pr=new Date(a[1].innerHTML);
The variable pr is the date the will was proved. We get this by going back to variable a, which contains all of the data cells from the fifth table, and stepping to the second cell in the sequence, and using innerHTML to pull out all of its text content. Then the text is converted into a date object which we can compare with other dates.
var bf=new Date('1653/4/7');
var af=new Date('1660/7/3');
The variables bf (before) and af (after) are the dates we need to compare with. The CPW was created by an ordinance of parliament passed on 8 April 1653, so it can’t have proved any wills before this date. After the Restoration, the PCC began sitting again on 3 July 1660, so the CPW can’t have proved any wills on or after this date. Therefore if the date of probate is between these dates it must be a CPW will, and if it isn’t it must be a PCC will.
if (pr>bf && pr<af){var t='CPW';}
else {var t='PCC';}
The variable t now contains a string for the type of will (PCC or CPW) which will be used to generate the correct template name. The if test compares the dates and assigns the correct value to t.
Now we have all the data we need to create the template. We just have to put it all together. I did this in one long complicated statement:
var x='<textarea rows="15" cols="40">{{' + t + 'Will\n|Piece=' + a[2].innerHTML.match(/>(.+)<img/i)[1] + '\n|Testator=' + a[0].innerHTML.slice(8).replace(' ',' ').replace(' ,', ',') + '\n|Signed=' + c + '\n|Proved=' + a[1].innerHTML+'\n|Code='+b+'}}</textarea>';
The overall effect is to end up with a variable called x containing a string with all of the template code in it, which can be written to a window and pasted into the wiki. We use the + operator to concatenate several literal strings and string variables.
The string starts with the HTML textarea element. This will help to display the text to the user and allow them to copy it.
onFocus="this.select()" will make the text in the text area select itself when clicked, which should speed up copying and pasting (except in Safari where it immediately deselects itself for some reason!).
The default content of the textarea is the wiki template with all the data inserted into it. By adding the variable t we get the correct template: it will either start with {{PCCWill or {{CPWWill.
\n adds new lines, which will make the template code more readable.
Next we add the Piece parameter and its value. This is the catalogue reference from the third row in the table. This cell is in a[2], that is the third element in the sequence stored in variable a. We can’t just use innerHTML to get the text because the cell also contains an image. Therefore we have to use a regular expression to find the text which comes between the opening td tag and the img tag:
a[2].innerHTML.match(/>(.+)<img/i)[1]
The i after the regular expression makes it case insensitive, which is necessary because Internet Explorer inexplicably forces HTML source code to upper case even if it was originally in lower case (as XHTML should be).
The testator’s name is in the first row of the table. Again we can’t just use innerHTML to get the text because the value in the cell always starts with “Will of” and we don’t want that. But since the pattern is so predictable, we can just slice off the first 8 characters of the string, which will leave us with the name and other details. For some reason this cell often contains superfluous spaces too, but these are easily removed with the replace() function, replacing two spaces with one, then replacing a space followed by a comma with just a comma:
a[0].innerHTML.slice(8).replace(' ',' ').replace(' ,', ',')
Compared to the previous values, getting the date proved is easy. In this case we can just use innerHTML to get all of the text out of the cell, which is in a[1] because it’s the second row of the table:
a[1].innerHTML
The Code parameter is even easier because we already have exactly what we need stored in variable b.
Now we just need to output the result so that the user can do something with it. First we open a new pop-up window:
var w=window.open('', 'DO', 'height=400,width=500,resizable=1,scrollbars=1');]
Now we can access this window through the variable w. So we can write the contents of variable x (which is the string containing the textarea element and the template text) to the document in window w:
w.document.write(x);
w.document.close();
Closing the document isn’t essential but it makes things look neater as otherwise it would give the user the false impression that the window hadn’t finished loading.
Finally the script needs to be converted into bookmarklet form. This involves putting it all onto one line, taking out any unnecessary spaces, and encoding certain special characters which aren’t allowed in URLs. Browsers aren’t always very strict and will let you get away with a lot of these characters if you put the script directly into the address bar or paste it into the properties of an existing bookmark. But if you want to embed it in a web page so that users can drag it onto their toolbar, you will need to encode some characters which would be misinterpreted. In the finished script I substituted > with %3E and < with %3C. Strictly speaking, & should be replaced with %26 and spaces should be replaced with %20, among other things, but everything seems to work as it is. Too much encoding can unnecessarily increase the length of the script. When designing your own bookmarklets it’s important to remember that there is usually a limit of 2,000 characters. This is why I’ve used short and not very meaningful variable names.
Once the code is finished it can be put into the href attribute of an <a> tag on any HTML page. Users can then drag and drop the link or right click on it to save it in their browsers. Before releasing it to the public you should test it on different browsers to make sure that it works as expected. Internet Explorer is usually the one that causes the most problems.
So to finish, here is the PCC Will bookmarklet.
