On this page

  1. Introduction
  2. De-continuations-ifying programming.reddit.com via DOM tree walking
    1. Analyzing the HTML
    2. Algorithm
    3. JavaScript code to accomplish the task - rev 1
    4. JavaScript code to accomplish the task - rev 2
    5. JavaScript code to accomplish the task - rev 3
    6. Hiding instead of removing
    7. Converting our script into a Greasemonkey user script
  3. De-continuations-ifying programming.reddit.com with XPath
    1. Enough with the DOM tree walking already
    2. What does an XPath expression look like?
    3. Predicates
    4. An XPath expression that matches continuations-related posts - rev1 - matching <tr>'s containing anchors within td's
    5. An XPath expression that matches continuations-related posts - rev2 - matching <tr>'s corresponding to post links
    6. An XPath expression that matches continuations-related posts - rev3 - matching <tr>'s corresponding to posts where the link text includes the string "continuation"
    7. An XPath expression that matches continuations-related posts - rev4 - case-folding the link text (XPath 1.0 vs. XPath 2.0)
    8. An XPath expression that matches continuations-related posts - rev5 - adding the preceding and follow-on <tr>'s to the node-list using a compound location path and the position() function
  4. Incorporating the XPath expression into a GM user script
    1. A De-continuations-ifying script using XPath - document.evaluate()
    2. Choosing a resultType
    3. An XPath-based de-continuations-ifying script
    4. The Greasemonkey user script
  5. Wrap-up
The author, Brian Donovan, is a software developer and writer who currently lives in Hong Kong with his wife and two cats.

EMAIL: i@iSubstitute the word 'heart' for the image of the heart.h4x0ring.info        AIM: iSubstitute the word 'heart' for the image of the heart.h4x0ring


Substituting XPath for DOM tree walking in Greasemonkey User Scripts

De-continuations-ifying programming.reddit.com with Greasemonkey


Try the scripts

The two scripts, one relying on DOM tree crawling and the other on XPath, developed throughout the course of this article to de-continuations-ify programming.reddit.com are linked later in the article but you can try them out immediately via the links below. To remove posts containing strings other than "continuation", simply change the value of strNeedle at the beginning of each script.

This article assumes a general familiarity with JavaScript, CSS, and the DOM.

Introduction

Programming.reddit.com can be an interesting place to burn through a few minutes, looking for articles, blog posts, and essays to read. All of the links are user-submitted and the only overt editorial control is the ability of registered users to train personalized filters and push links higher or lower on the front pages of Reddit prime and the various sub-reddits (like the Web 2.0 reddit and the arXiv reddit) by voting them up or down. Some Reddit users haven't found the filters particularly useful (see "The Reddit Recommendation Engine: Does It Work At All?") and the only filter I've been able to work up any enthusiasm for training myself is the Junk filter in Thunderbird, so I generally just scan programming.reddit.com every few days or check out the "new" links. Here's a screenshot of programming.reddit.com that I took a few moments after 2006.05.23, 9AM GMT+8:

screenshot of programming.reddit.com/new
Figure 1. Screenshot of programming.reddit.com with instances of the string "continuation" highlighted.

Let's imagine, for the purposes of this exercise, that we're absolutely sick to death of seeing links pertaining to a certain topic. For no other reason than the fact that there were several continuations-related links near the top of the page (4 above the fold in 1024x768 px view and another, as we'll see later, further down the page) at the time that I took the above screenshot and saved a copy of the page to my hdd for testing, we'll imagine that we have a grudge against continuations. Reddit allows users to hide particular articles, if they're logged in, by clicking on a "Hide" link but there's no topic-based categorization feature built into in Reddit, at least any available to the masses, so we're stuck if we want to block out links on a particular topic.

Enter Greasemonkey (GM). If you're reading this, then you're probably either already an enthusiastic user who has dozens of GM scripts installed and find the prospect of going back to browsing the Web without Greasemonkey as unthinkable as giving up popup blocking or the Flashblock extension ... or you've never used Greasemonkey but have heard people raving about how great it is and how it has changed their experience of the Web (in which case you should go and install it immediately). As a Greasemonkey user, if you haven't already, you will eventually run up against a site for which you can't find a ready, working user script (or at least one that embodies the behavior you'd like to see) and will end up tweaking an existing script or even writing one from scratch to handle the task at hand. The transition from Greasemonkey user to user script author is something that just sort of happens. Practically before you realize it, you're cranking out foo.user.js files.

What are Greasemonkey User Scripts?


User scripts are bits of JavaScript, DOM, and CSS code that can be applied to web sites viewed in Firefox. Greasemonkey user scripts differ from inline JavaScript in that inline JavaScript is executed as soon as it's encountered by the browser loading the page in which it resides while GM user scripts are executed when the DOMContentLoaded event is fired and no sooner. In both cases, you can attach listeners to events so that functions you've defined are executed later.

The most popular, though certainly not the only, use of Greasemonkey user scripts is to modify the content of web sites - adding, editing, or removing bits of pages. Typically, some of the steps in page-modififying user scripts consist of getting handles on bits of content. Depending on the structure of the page, in particular the degree to which the people or software that generate the target site's markup use class and id attributes, this can be more or less tedious. In the best-case scenario, it's just a matter of using getElementById() and, in the worst-case scenario, you are likely to find yourself doing a lot of DOM tree crawling.

This article will demonstrate the use of XPath expressions as a way of avoiding DOM tree crawling altogether. First, we'll write a Greasemonkey user script that employs traditional DOM tree crawling (with a few missteps thrown in to illustrate common pitfalls) to de-continuations-ify programming.reddit.com. After a bit of an introduction to XPath, we'll then re-write the same script using XPath instead of DOM tree crawling.


De-continuations-ifying programming.reddit.com via DOM tree walking

Analyzing the HTML

The most direct approach would be to remove links whose descriptions contain the substring "continuation". Viewing the source of the page, we can see that each post is comprised of four <tr> elements, children of a <table> element with an id attribute value of siteTable, a child, in turn, of a <div> element with an id attribute value of main. This <div> is top-level within the page's <body>.

We can select the text of the first continuations-related link on programming.reddit.com's homepage in the browser window (#7, 'Another Rebuttal to "Continuationations Considered Harmful" - Sapir-Whorf is not a Klingon') and choose "View Selection Source" in Firefox's right-click contextual menu. Here's what we see:

screenshot of selection source for link #7 on programming.reddit.com
Figure 2. Screenshot of selection source for link #7 on programming.reddit.com, 'Another Rebuttal to "Continuationations Considered Harmful" - Sapir-Whorf is not a Klingon'

This snippet isn't particularly squirrely-looking but formatting the source before looking for Greasemonkey attack points is often helpful:

<tr id="site296578">
    <td colspan="1" class="oddRow spacing top"></td>
    <td colspan="4" class="oddRow spacing top"></td>
</tr>

<tr class="oddRow">
    <td class="numbercol" rowspan="3" valign="top">
         7.
    </td>
    <td rowspan="3" valign="top">
        <div id="up296578" class="arrow up" onclick="javascript:mod(296578, 1, 'd873d0f3d12c936c8732f7ce831656b3d5837dec')"></div>
        <div id="down296578" class="arrow down" onclick="javascript:mod(296578, 0, 'd873d0f3d12c936c8732f7ce831656b3d5837dec')"></div>
    </td>
    <td colspan="3" id="titlerow296578">
        <a id="title296578" class="title loggedin" href="http://www.oreillynet.com/onlamp/blog/2006/05/sapirwhorf_is_not_a_klingon.html?CMP=OTC-6YE827253101&amp;ATT=Sapir-Whorf+is+not+a+Klingon" onmousedown="return rwt(this, 296578, '6cua', '')">          Another Rebuttal to "Continuationations Considered Harmful" - Sapir-Whorf is not a Klingon</a>
        <span class="little"> (oreillynet.com)</span>
    </td>
</tr>

<tr class="oddRow">
    <td class="wide little" colspan="3" valign="top">
        <span id="score296578">31 points</span>
        posted 1 day ago by <a href="/user/filz">filz</a>
        <a href="/info/6cua/comments" class="bylink">         comment     </a>
        <a href="/info/6cua/share" class="bylink">share</a>
        <span id="save296578"><a class="bylink" href="javascript:save(296578)">save</a></span>
        <span id="save296578"><a class="bylink" href="javascript:hideSite(296578)">hide</a></span>
    </td>
</tr>

<tr>
    <td colspan="3" class="oddRow spacing"></td>
</tr>

We can view the same section of the page in Firefox's DOM Inspector (If you don't see the DOM Inspector under "Tools" in Firefox, this is the point at which you'll wish that you'd chosen "Custom Install" and ticked the "Web Developer Tools" checkbox when you were installing Firefox). In the screenshot below, we've navigated to the the text node with the value '          Another Rebuttal to "Continuationations Considered Harmful" - Sapir-Whorf is not a Klingon', expanding nodes along the way, then highlighted all of the nodes associated with link #7:

screenshot of DOM Inspector showing nodes associated with link #7 on programming.reddit.com, expanded to show the text node with value:          Another Rebuttal to 'Continuationations Considered Harmful' - Sapir-Whorf is not a Klingon
Figure 3. Screenshot of DOM Inspector showing nodes associated with link #7 on programming.reddit.com, expanded to show the text node with value '          Another Rebuttal to "Continuationations Considered Harmful" - Sapir-Whorf is not a Klingon'.

How can we characterize the location of the string containing the substring "continuation" in this snippet? If we look at the structure of the HTML above and think in terms of the DOM, we see that it's a text node with value "          Another Rebuttal to "Continuationations Considered Harmful" - Sapir-Whorf is not a Klingon" that is a child of an anchor element node that is, in turn, nested within the third and final <td> element node child of the second of the four <tr> element nodes associated with the post. What's the closest element within this hierarchy that has some distinguishing characteristics?

The anchor element closest to the link text in question has an id attribute value of title296578. The substring "296578" appears in JavaScript function calls and id attribute values throughout this post's HTML and it would be reasonable to presume that it's a numeric identifier associated with this link in the database behind the reddit site. This anchor element is the only one anywhere in the snippet to have an id attribute. If we were to look at the entire page's source, we would see that the section of HTML corresponding to each post contains a different numeric string but that all of the corresponding anchor nodes have id attribute values following the form titleXXXXXX.

Why restrict ourselves to text node children of anchor elements that fit this profile? Though it may seem unlikely that other links within reddit will have id attributes with titleXXXXXX attribute values, it helps us to avoid accidentally stripping out non-post links that might contain the string "continuation". While it would appear unlikely, there's nothing stopping the Reddit site maintainers from adding some other links with the substring "continuation" somewhere in their link texts to other parts of the page. Being as specific as possible helps us to prevent, but doesn't insure against, our GM script badly mangling the page at some future date.

Algorithm

In brief, here's the series of steps that our DOM tree walking Greasemonkey script will take when it runs:

  • Collect all of the anchor elements in the page.
  • For those anchors with id attributes whose values begin with the substring "title":
    • Does the link text contain the substring "continuation" in some form?
      • If so, then remove the section of the page associated with this link.

JavaScript code to accomplish the task - rev 1

We'll convert the algorithm into JavaScript/DOM/CSS code the old-fashioned way - by saving the page to disk, opening it up in a text editor, adding a JavaScript block to the bottom of the page (saving us, for now, the trouble of wrapping it all in a function definition and using addEventListener to make sure that it's called after the page has loaded), and then building up our script with alert() statements where necessary to show us the values of important variables at various junctures until we have something that does what we want.

Below: rev 1 of the code with an alert that shows us the link text of the link we're going to be pruning. I've included a misstep to illustrate some of the pitfalls of working with "live" DOM nodes. If you would rather just see the correctly-working code, click here to skip directly to the final version of the script.

var nodeListAnchorElements = document.getElementsByTagName('a');
for (var i=0; i<nodeListAnchorElements.length; i++) {
    if (nodeListAnchorElements[i].hasAttribute('id')) {
        var strIdAttributeValue = nodeListAnchorElements[i].getAttribute('id');
        var intIndexOfTitle = strIdAttributeValue.indexOf('title');
        if (intIndexOfTitle === 0) {
            var strLcAnchorText = nodeListAnchorElements[i].childNodes[0].nodeValue.toLowerCase();
            var intIndexOfContinuation = strLcAnchorText.indexOf('continuation');
            if (intIndexOfContinuation >= 0) {
                alert('strLcAnchorText: ' + strLcAnchorText);
                var nodeGreatGreatGrandParentOfAnchorTextNode = nodeListAnchorElements[i].parentNode.parentNode.parentNode;
                var nodeGreatGrandParentOfAnchorTextNode = nodeListAnchorElements[i].parentNode.parentNode;
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode.previousSibling.previousSibling);
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode);
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling);
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling.nextSibling.nextSibling);
            }
        }
    }
}

The script above follows the general outline we started with. The first line gives us a NodeList of all of the anchor elements in the page. We iterate over the <a> elements, looking for id attributes. If a particular anchor element in the NodeList has an id attribute, we check to see whether its value begins with the substring "title". If it does, we then check to see whether its single childNode, the text node containing the link text that gets displayed in the browser, contains the substring "continuation". For those anchor elements whose link text contains "continuation", we do a bit of DOM gymnastics. We want to use removeChild() to get rid of the four <tr> element nodes associated with the post so, as the method's name would suggest, we need to get a handle on their parentNode, the <table> element with an id attribute value of siteTable.

normalize()


To protect against multiple adjacent text nodes under the anchor element node, something that we don't see in the actual page, we could have called normalize(), which merges adjacent text nodes, on the anchor element node.

One way to go about this would be to use getElementById() to get our hands on it - ex. var nodeSiteTable = document.getElementById('siteTable'); and run subsequent removeChild() calls against nodeSiteTable: nodeSiteTable.removeChild(nodeWeWantToRemove). This would work, but our script would break if the reddit gnomes changed the value of that id attribute or removed it altogether. To make our script a little less brittle, we can get a handle on the <table> element node relative to our node of interest using the parentNode property. If the structure of the page changes significantly, our script will still break. It's a judgement call - a combination of a guess about the way a site's maintainers will be more likely to alter their site and a decision about what is easier to write and modify later.

Referring to Figure 3, we can see that the parentNode of the anchor element node nodeListAnchorElements[i] in our script above is going to be a <td> element node (with an id attribute value of the form titlerowXXXXXX) whose parentNode, in turn, is going to be a <tr> element node with a class attribute node value of either oddRow or evenRow. The parent of this <tr> element is the <table> element that we want. From the anchor element node, then, we're hopping upwards 3 levels in the page's DOM hierarchy - hence the variable name, "nodeGreatGreatGrandParentOfAnchorTextNode" (the <table> is actually 4 levels up from the anchor text text node.).

The decision of whether to use relative references to get at the neighboring <tr> is less of a judgement call since there's no easy way to get a given node's index within the childNodes NodeList of its parent (we used childNodes to get the text node child of the anchor element because we could rely on there being a single child and on that child being the one that we wanted). The nodes that we want to dispose of are children of the <table> so are 3 levels above the anchor's text node. Being consistent in our variable naming, we use "nodeGreatGrandParentOfAnchorTextNode" as the variable name of the 3-level-up parent node. To get past the text nodes that lie in between each of the element nodes, we double up on previousSibling and nextSibling property selectors.

Running this script causes an alert dialog containing the text 'strLcAnchorText: another rebuttal to "continuationations considered harmful" - sapir-whorf is not a klingon' to pop up. Below are 2 screenshots of the page taken, respectively, when the script threw up the first, and only, alert() dialog and after the script has run. The altered page that we see after clicking through the alert()-spawned dialog is not very pretty.

screenshot of page while rev 1 of our proto-Greasemonkey-script is running
Figure 4. Screenshot of page while rev 1 of our proto-Greasemonkey-script is running
screenshot of page after rev 1 of our proto-Greasemonkey-script has run - note the JavaScript Console message.
Figure 5. Screenshot of page after rev 1 of our proto-Greasemonkey-script has run. Note the JavaScript Console message.

Where did the "up" and "down" arrows go?


Aside from the obvious beating that the page has taken, you may notice that the "up" and "down" voting images seem to have disappeared. Their absence in this and the other screenshots taken of our script running in cached versions of the page is due to those <img> elements being written into the page at runtime and using relative paths rather than any monkey business on the part of our script. If you view source on the reddit page, you'll see:

<script language="javascript">
    var a = new Image();
    a.src ="/static/aupmod.png";
    var b = new Image();
    b.src = "/static/adownmod.png";
</script>

For a clearer perspective on how we went pear-shaped, let's open Firefox's DOM Inspector. Below is a side-by-side view of the DOM structures of the original and mangled versions of the page, corresponding to the pages first and fourth screenshots (figures 1 and 4), respectively, with the nodes comprising two consecutive continuations-related posts highlighted and some of the nodes expanded to give us a sense of context:

screenshot of views in DOM inspector of page before and after we apply rev #1 of our script with some nodes expanded for context
Figure 6. Screenshot of views in DOM inspector of page before and after we apply rev #1 of our script with some nodes expanded for context.

We can see that only the first two <tr> nodes have been removed, the one with the id attribute value of site296578 that contains spacer <td> elements and the one with the <td> containing the link. The first post was the only one affected because Firefox stopped executing the script after it hit the error shown in the JavaScript console:

Error: nodeGreatGrandParentOfAnchorTextNode.nextSibling has no properties
Source File: file:///C:/work/articles2/brian-ora-greasemonkey-xpath-article/target-page-reddit/rev1.htm
Line: 3086

We've run up hard against a consequence of NodeLists being "live", meaning that they reflect any changes made to the document tree in real time. Our removeChild() calls stopped executing after we removed nodeGreatGrandParentOfAnchorTextNode since the subsequent calls referenced a property of that node (its nextSibling property).

JavaScript code to accomplish the task - rev 2

Let's try re-ordering our removeChild() calls:

var nodeListAnchorElements = document.getElementsByTagName('a');
for (var i=0; i<nodeListAnchorElements.length; i++) {
    if (nodeListAnchorElements[i].hasAttribute('id')) {
        var strIdAttributeValue = nodeListAnchorElements[i].getAttribute('id');
        var intIndexOfTitle = strIdAttributeValue.indexOf('title');
        if (intIndexOfTitle === 0) {
            var strLcAnchorText = nodeListAnchorElements[i].childNodes[0].nodeValue.toLowerCase();
            var intIndexOfContinuation = strLcAnchorText.indexOf('continuation');
            if (intIndexOfContinuation >= 0) {
                alert('strLcAnchorText: ' + strLcAnchorText);
                var nodeGreatGreatGrandParentOfAnchorTextNode = nodeListAnchorElements[i].parentNode.parentNode.parentNode;
                var nodeGreatGrandParentOfAnchorTextNode = nodeListAnchorElements[i].parentNode.parentNode;
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode.previousSibling.previousSibling);
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling.nextSibling.nextSibling);
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling);
                nodeGreatGreatGrandParentOfAnchorTextNode.removeChild(nodeGreatGrandParentOfAnchorTextNode);
            }
        }
    }
}

The results are decidedly mixed. First, the good news: no JavaScript errors, the page layout doesn't get messed up, and multiple continuations-related posts are removed. The bad news: not all of the continuations-related posts are removed. Looking at the continuations-related posts (their lower-cased titles as shown in the alert dialogs) that were/weren't removed and their locations within the original page sheds some light on our problem:

# post post link text remain after script has run?
7 another rebuttal to "continuationations considered harmful" - sapir-whorf is not a klingon
8 ongoing continuations : rebuttal to "continuations considered harmful" by avi bryant[...] X
14 continuations for user journeys in web applications considered harmful
15 continuations and microthreads on mono X
37 why the jvm won't support continuations

A pattern is evident. When one continuations-related post is followed by another, the second is not removed.

Remember: the NodeLists returned by methods like getELementsByTagName() and getElementById are "live". When we remove the four <tr> element nodes associated with post #7, the anchor element node from post #7 vanishes from the NodeList we're iterating over and the anchor element node associated with post #8 slides up to take its place. As a result, the anchor element node for post #8, whose link text contains the substring "continuation", is not visible in the next iteration over the list. Between groups of nodes, however, the spacing is large enough (6 anchor elements between the two from posts #8 and #14 and even more, 22, between those from posts #15 and 37) that they're not affected.

JavaScript code to accomplish the task - rev 3

We can fix this rather easily by postponing any node removals until the enclosing for loop has finished its run and then looping a 2nd time over another array, arrNodesToRemove, into which we've placed all of the nodes we want to remove. In the modified version of the script below, except for the variable declaration for arrNodesToRemove, I've inserted comments above the lines that are new or have been changed:

var arrNodesToRemove = [];
var nodeListAnchorElements = document.getElementsByTagName('a');
for (var i=0; i<nodeListAnchorElements.length; i++) {
    if (nodeListAnchorElements[i].hasAttribute('id')) {
        var strIdAttributeValue = nodeListAnchorElements[i].getAttribute('id');
        var intIndexOfTitle = strIdAttributeValue.indexOf('title');
        if (intIndexOfTitle === 0) {
            var strLcAnchorText = nodeListAnchorElements[i].childNodes[0].nodeValue.toLowerCase();
            var intIndexOfContinuation = strLcAnchorText.indexOf('continuation');
            if (intIndexOfContinuation >= 0) {
                alert('strLcAnchorText: ' + strLcAnchorText);
                var nodeGreatGrandParentOfAnchorTextNode = nodeListAnchorElements[i].parentNode.parentNode;
                //-------------------------------------------------
                arrNodesToRemove.push(nodeGreatGrandParentOfAnchorTextNode.previousSibling.previousSibling);
                arrNodesToRemove.push(nodeGreatGrandParentOfAnchorTextNode);
                arrNodesToRemove.push(nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling);
                arrNodesToRemove.push(nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling.nextSibling.nextSibling);
            }
        }
    }
}
//-------------------------------------------------
for (var i=0; i<arrNodesToRemove.length; i++) {
    arrNodesToRemove[i].parentNode.removeChild(arrNodesToRemove[i]);
}
alert('Hopefully, removed ' + arrNodesToRemove.length + ' nodes!');

This rev of the script works. All of the continuations-related posts are removed. We've succeeded because the relative references (in terms of nodes' previousSibling and previousSibling properties) are resolved when references to the nodes that we want to remove are pushed onto the array arrNodesToRemove.

We're not finished yet, however. We can simplify things a bit.

Hiding instead of removing

Instead of using removeChild(), we can simply hide the <tr>'s of continuations-related posts. Because we won't be altering the DOM (display is a CSS property), we won't need to store references to the <tr> element nodes that we want to nuke in a special array and run a second loop over that array:

var nodeListAnchorElements = document.getElementsByTagName('a');
for (var i=0; i<nodeListAnchorElements.length; i++) {
    if (nodeListAnchorElements[i].hasAttribute('id')) {
        var strIdAttributeValue = nodeListAnchorElements[i].getAttribute('id');
        var intIndexOfTitle = strIdAttributeValue.indexOf('title');
        if (intIndexOfTitle === 0) {
            var strLcAnchorText = nodeListAnchorElements[i].childNodes[0].nodeValue.toLowerCase();
            var intIndexOfContinuation = strLcAnchorText.indexOf('continuation');
            if (intIndexOfContinuation >= 0) {
                alert('strLcAnchorText: ' + strLcAnchorText);
                var nodeGreatGrandParentOfAnchorTextNode = nodeListAnchorElements[i].parentNode.parentNode;
                //-------------------------------------------------
                nodeGreatGrandParentOfAnchorTextNode.previousSibling.previousSibling.style.display='none';
                nodeGreatGrandParentOfAnchorTextNode.style.display='none';
                nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling.style.display='none';
                nodeGreatGrandParentOfAnchorTextNode.nextSibling.nextSibling.nextSibling.nextSibling.style.display='none';
            }
        }
    }
}

That works quite well.

Converting our script into a Greasemonkey user script

Converting our script into a Greasemonkey user script is straightforward and takes just a moment. We start with the last snippet, remove the alert() statements, and add a series of specially-formatted comments that tell Greasemonkey where to apply (or not apply) the script as well as giving anyone else who may encounter the script (if, for example, we add it to one of the web-based Greasemonkey script repositories) an idea of its purpose and who originally wrote it. For a precise description of the comments format, see the "Add metadata" step of the instructions on the Writing User Scripts page at the Greasemonkey site at mozdev (now would be a good time, incidentally, to install Greasemonkey if you haven't already done so). Finally, we save it with the extension .user.js.

Here's how the comments might look for the script that we've built up in this article:

// ==UserScript==
// @name          De-continuations-ify programming.reddit.com using DOM tree walking and CSS
// @namespace     http://projects.briandonovan.info/projects/greasemonkey-user-scripts/
// @description	  Removes posts whose link text includes the substring "continuation".
// @include       http://programming.reddit.com/*
// ==/UserScript==

Click here for the GM user script. If you have Greasemonkey installed, you should see a header across the top of the page with a small cartoonish image of a monkey's face, text including "This is a Greasemonkey User Script. Click Install to start using it.", and an Install button. After you've clicked Install, you should see an alert that says "decontinusationsifyprogra.user.js installed successfully." (don't worry about the mangling of the file name - GM user scripts are stored in a folder within your Mozilla profile and the file name is reformatted a bit in the saving process).

Now, navigate to programming.reddit.com. If there are no continuations-related links posted, you can change the string from "continuation" to some other lower-cased substring (perhaps "rails"? :D ).


De-continuations-ifying programming.reddit.com with XPath

Before we proceed further...


There are a few things that I'd like you to do now:
  1. Install Viktor Zigo's XPather, a Firefox extension that integrates into the browser's right-click contextual menu and into the DOM Inspector, exposing Firefox's XPath implementation and allowing you to generate and evaluate XPath expressions in (X)HTML and XML documents.
  2. Download Alex Chaffee's XPath Explorer, a free-standing XPath expression generator/evaluator written in Java.
  3. Read up on XPath. John E. Simpson's XPath and Xpointer (ISBN 0-596-00291-2) does an admirable job of covering XPath 1.0. If you want something online, then the MozillaZine Knowledgebase entry on XPath in Mozilla/Firefox recommends the W3Schools' XPath tutorial.

XPath Explorer's "Matching Nodes" view will be helpful in following the steps we'll take at the beginning of this section, but we'll switch over to using XPather for the final sections where keeping track of two different XPath implementations' quirks would present too big of an albatross around our necks.

Enough with the DOM tree walking already

DOM tree walking is fairly straightforward but it can seem slow (in terms of the number of steps required to achieve a given result). Now, we'll rewrite our user script to avoid tree walking completely.

What does an XPath expression look like?

First, here's a simple explanation that won't do you a lot of good right now but should become more clear as we move along: XPath expressions are a means of getting a hold of portions of XML documents, in the form of either either individual nodes or node-sets, collections of nodes related to each other in some way encoded in the XPath expression with which they were retrieved. XPath expressions are plain text (not XML) strings composed of tokenized location steps delimited by separators like /. As with mathematical expressions or Unix filesystem directory paths, you can rely on them being evaluated from left to right (except where operator precedence comes into play).

Let's begin by opening the pristine version of the saved programming.reddit.com homepage in XPath Explorer:

screenshot of pristine version of programing.reddit.com opened in XPath Explorer
Figure 7. Screenshot of pristine version of programming.reddit.com opened in XPath Explorer.

The XPath expression showing in the XPath Explorer interface on startup is //, a double slash. A double slash is used in an XPath expression to locate all of the nodes with a particular name or of a particular type (//a, for example, for example would fetch all of the anchors in a document), regardless of their location within the DOM tree. In XPath Explorer, a double slash, by itself, returns a node-set containing all of the nodes in the document. We can see the nodes in the returned node-set by clicking on XPath Explorer's "Matching Nodes" tab:

beginning of the long list of nodes within programming.reddit.com that comprise the node-list returned by the XPath expression //
Figure 8. Beginning of the long list of nodes within programming.reddit.com that comprise the node-list returned by the XPath expression //.

Firefox's XPath implementation (with XPather installed, you can evaluate an XPath expression by opening the DOM Inspector, entering an XPath expression in the XPath field, and pressing "Eval") doesn't interpret a lone double slash the same way, however:

In Firefox's XPath implementation, as exposed by the XPather extension, the XPath expression // matches no nodes
Figure 9. In Firefox's XPath implementation, as exposed by the XPather extension, the XPath expression // matches no nodes.

Adding a star after the double slash does the trick and is actually the proper way to match all of the nodes in a document:

The XPath expression //* is the proper way to match all of the nodes in a document.
Figure 10. The XPath expression //* is the proper way to match all of the nodes in a document.

XPath versions and compatibility


The takeaway here is that different XPath implementions are not necessarily 100% compatible (or even 100% compliant with the standards they're implementing). A separate, bigger issue is that of XPath versions. XPath 1.0 is currently supported in Firefox via TransforMiiX, an XSLT 1.0 implementation. XPath Explorer gets its XPath parsing capabiities from Jaxen 1.0 (the dev build). We've been using XPath Explorer here because of its ability, in addition to displaying just the matching nodes, to show the matched nodes in the document context (in the "All Nodes" view, expand the nodes, starting with html within which you'd like to see matched nodes).

The top field showing in XPath Explorer in the light blue region below the panel containing the "URL" and "XPath" fields is "Expanded". Here, we can see another, more verbose, way of stating the expression in the "XPath" field. If we pasted //* back into the "Expression" field from XPather, the "Expanded" field would show: descendant-or-self::node()/child::*. In that expression, there are two location steps, /descendant-or-self::node() and child::*.

The single slash at the beginning of the location path means that we're starting from the document root and that this is an absolute rather than a relative location path. The first location step begins with the descendant-or-self:: axis, one of thirteen available in XPath. Axes specify the direction (up, down, forward, backward, etc. in the DOM tree) of a location step. The double colon (::) is a delimiter that separates the names of axes from the names of particular nodes or sets of nodes. This is followed by the node() node test, which matches any sort of node. Node tests come in two flavors: name tests, which match nodes based on their element names and/or namespaces, and kind tests, which can be used to restrict results to nodes of particular types (while node(), matches all nodes, for example, comment() matches only comments, element() matches only elements, etc.). The second location step, child::*, begins after the / separator and consists of the child:: axis and the asterisk (wildcard), which matches any node name. Taken altogether, this expression is composed of two location steps and matches all nodes descended from and including the root node (html in this case).

The two XPath expressions, //* and /descendant-or-self::node(), are equivalent.

Location steps can be composed of 3 parts: an axis, a node test, and one or more predicates and the general form of a location step can be written as:

axis::nodetest[predicate]

Predicates

Now, let's tackle predicates. If we navigate down to the first of the continuations-related posts in our locally-saved, pristine version of programming.reddit.com, and double-click on the node corresponding to the title link of the first continuations-related post, we get a new set of XPath expressions:

Double-clicking on the anchor corresponding to the link to the first continuations-related post gives us an XPath expression with a predicate
Figure 11. Double-clicking on the anchor corresponding to the link to the first continuations-related post gives us an XPath expression with a predicate.

The names of the nodes that match the expression (just the one <a> element node in this case) are shown in blue. For every case, XPath Explorer is going to give us two XPath expressions, one terse and one verbose. Let's take the wordier "Expanded" expression first:

/descendant-or-self::node()/child::a[(attribute::id = "title296578")]

Starting from the document root, we have 2 location steps, descendant-or-self::node() and child::a[(attribute::id = "title296578")]. The first is already familiar (see above). The second begins with the child axis, which locates nodes directly descended from the context nodes (the nodes selected in the previous location step). In this case, that's potentially any node in the document below the root html node. The node test (of the name variety), a, narrows it down to anchor element nodes. The bit within the square brackets is what's known as a predicate and further constrains the node-list to anchor elements with id attribute values equal to "title296578".

This XPath expression can be (and has been - just look back up at XPath Explorer's "XPath" field) written more succinctly as:

//a[@id='title296578']

Just as // is shorthand for the descendant-or-self axis, @ is shorthand for the attribute axis.

Predicates represent boolean tests. If the condition(s) set forth within a predicate evaluate as true, then the nodes represented by the axis::nodetest portion of the expression will be included in the node-set returned. If not, then they won't be included.

This set of XPath expressions have been helpful insofar as they gave us a taste of how predicates worked. Now, let's start de-continuations-ifying programming.reddit.com.

An XPath expression that matches continuations-related posts - rev1 - matching <tr>'s containing anchors within td's

We have to craft an XPath expression that will match <tr> element nodes bearing the <td>'s holding the anchor elements that include the substring "continuation" in their one and only, to use DOM terminology, text node childNodes. We also want to get a handle on the one previous sibling and two follow-on sibling <tr> element nodes that comprise the rest of each post - but we'll set those aside for now. As a first step, let's go after all of the <tr> nodes that contain <td> nodes which, in turn, contain <a> nodes:

//tr[td/a]

The td/a goes into the predicate (and is treated as a test for existence) because we want the <tr>, not the <a> itself. In the screenshot below, you can see all the nodes in the first continuations-related post that are in the returned node-set highlighted in blue:

Results for the XPath expression //tr[td/a], expanded to show the hierarchy of nodes beneath the two <tr> element nodes in the first continuations-related post that are matched
Figure 12. Results for the XPath expression //tr[td/a], expanded to show the hierarchy of nodes beneath the two <tr> element nodes in the first continuations-related post that are matched.

For every post's set of four <tr> element nodes, two are matched (2nd and 3rd <tr>'s). The expression also matches some other <tr>'s outside of the "siteTable".

You may be wondering why the contents of the "String Value" field, which presumably represent the string value for the node-list returned by our XPath expression, are as follows:

  briandon (1) |prefs|submit|help|blog|logout

The string value (that's string-value in XPath parlance) of a node corresponds to that node's nodeValue in DOM-land and consists of the concatenated contents of all of the text nodes contained by the node in question as well as those within its descendants. So why do we see just the above text and not all the values of the text node children of all of the other <tr> element nodes containing <a> element nodes within <tr> element nodes?

The answer is simple: When an XPath expression matches a set of nodes (a node-set) rather than a single node, the string-value is the string-value of the first node in the node set. In the screenshot below, you can see that XPath Explorer's value for the "String Value" of the node-list returned by the expression //tr[td/a] corresponds to the concatenated values of the text nodes (underlined in green) within the first matching <tr> element:

'All Nodes' view of results for the XPath expression //tr[td/a] with element nodes expanded to show text node children.
Figure 13. "All Nodes" view of results for the XPath expression //tr[td/a] with the child nodes of the first matching <tr> in the node-list expanded to show text node children (underlined in green).

An XPath expression that matches continuations-related posts - rev2 - matching <tr>'s corresponding to post links

Our ultimate goal is to remove or hide the <tr> nodes corresponding to continuations-related posts, so we need to refine our expression. Let's narrow it down to just the <tr> element nodes of post links. To do this, we're going to (1.) incorporate a call to one of the XPath string functions, starts-with(), into our expression and (2.) nest our predicates:

//tr[td/a[starts-with(@id, 'title')]]

XPath has functions?


XPath includes a numberof numeric operators and functions. There are many more in XPath 2.0 than in XPath 1.0.

The inner predicate matches nodes selected by the outer layers of the expression for which the function call starts-with(@id, 'title') returns true - nodes with id attributes that begin with the substring "title". Taken altogether, then, we're matching <tr> element nodes that contain <td> element nodes which, in turn, contain anchor element nodes that have id attribute values that start with "title". Applied to the pristine, saved version of programming.reddit.com, here's what this expression gives us:

'All Nodes' view of results for the XPath expression //tr[starts-with(@id, 'site') and td/a]
Figure 14. "All Nodes" view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')]].

The string-value of the node-set returned by this expression is the concatenation of all of the text nodes within the first matching <tr> element node and its children, "1.Sapir-Whorf is not a Klingon ( programming language vocabulary and thought limitations ) (oreillynet.com)". I've scrolled down through the "All Nodes" view in XPath Explorer to show that the first continuations-related post link's <tr> was also selected.

An XPath expression that matches continuations-related posts - rev3 - matching <tr>'s corresponding to posts where the link text includes the string "continuation"

We've now got a node-list of all of the <tr>'s containing post links. To accomplish the next step, winnowing down the <tr>'s in the node-list returned by our expression to just those containing the substring "continuation", we'll be using another XPath string function, contains(). Just like starts-with(), contains() accepts two arguments, the haystack and needle. We're also going to turn our outer predicate into a compound predicate by using and to chain together two conditions. Here's our XPath expression with the second half of the predicate modified to use contains():

//tr[td/a[starts-with(@id, 'title')] and contains(td/a, 'continuation')]

From XPath Explorer's "Matching Nodes" view, we can see that the node-list returned by our expression only contains a single node:

'All Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(td/a, 'continuation')]'Matching Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(td/a, 'continuation')]
Figure 15. Left: "All Nodes" view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(td/a, 'continuation')]. Right: "Matching Nodes" view.

The lone matching node corresponds to the <tr> element node containing the link for post #37, titled "Why the JVM won't support continuations". Why weren't the other continuations-related posts' <tr>'s matched? XPath's string functions are case-sensitive and, while our needle above is "continuation" with a lowercase "c", all of the occurrences of the term except for the one in post #37 begin with a capital "C".

An XPath expression that matches continuations-related posts - rev4 - case-folding the link text (XPath 1.0 vs. XPath 2.0)

We can case-fold the link text easily enough, either to all-caps or to all-lowercase characters, but the route that we'll take depends on the version of XPath that we're working with. XPath 1.0 became a W3C Recommendation in late 1999 and XPath 2.0 is, at the time of writing, a candidate recommendation. The build date on our copy of XPath Explorer is 20030304, so it could have implemented XPath 1.0 or a working draft of XPath 2.0. Let's try both ways and see if one or both work.

For XPath 1.0 implementations, there are no to-upper/lower-case functions, but there is translate(), another string function. It takes three arguments, the string to be translated, a string of characters that you want to be replaced, and a corresponding string of characters that you want used for the replacement. To convert an arbitrary string to lower-case (the example below would give us "some string"), we'd use:

translate('SoMe StRiNg', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')

Fine points of using translate()


If there are fewer characters in the third than the second string argument in a translate() call, then trailing characters in the replacee string will be removed. If there are fewer characters in the second than the third string argument, then trailing characters in the replacee string will not be affected (lower-cased, here). For case-folding, then, make sure that you don't omit characters from either the first or second string.

Let's incorporate a call to translate() in our XPath expression and run it against the original saved version of programming.reddit.com:

//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]

We're getting the link-containing <tr> element nodes for all five continuations-related posts now:

'All Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]'Matching Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]
Figure 16. Left: "'All Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]. Right: "Matching Nodes" view. XPath 1.0-style case-folding works in XPath Explorer.

XPath 2.0 introduced, among numerous other new functions, lower-case() and upper-case(). Let's rewrite our expression using lower-case() in place of translate():

//tr[td/a[starts-with(@id, 'title')] and contains(lower-case(td/a), 'continuation')]

Though XPath Explorer relies on Jaxen 1.0 (dev build) for XPath parsing and the Jaxen developers say that they're not yet supporting XPath 2.0, it seems that lower-case() is supported in XPath Explorer:

'Matching Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(lower-case(td/a), 'continuation')]
Figure 17. "'Matching Nodes' view of results for the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(lower-case(td/a), 'continuation')]. XPath 2.0-style case-folding works in XPath Explorer.

Before going further, let's try both expressions in Firefox (using XPather) by opening up the unaltered version of the saved programming.reddit.com page in the DOM Inspector, pasting an expression into the XPath field that appears once you've installed XPather and restarted Firefox, and clicking the "Eval" button.

For //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]:

XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] evaluated in Firefox using XPather
Figure 18. XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] evaluated against the saved, pristine version of programming.reddit.com in Firefox using XPather.

XPath 1.0 vs. XPath 2.0


XPath 2.0 has an increased number of datatypes, lots of new operators, many new functions, and some new syntax (including stuff like for, if, some, and every expressions. John E. Simpson's XPath and Xpointer (ISBN 0-596-00291-2) does an admirable job of covering XPath 1.0. For your XPath 2.0 fix, I'd recommend Michael Kay's XPath 2.0 Programmer's Reference (ISBN 0764569104). In my opinion, Simpson's book provides, by far, the best introduction to XPath that you'll find in printed form and I would advise reading it before Kay's book..

For //tr[td/a[starts-with(@id, 'title')] and contains(lower-case(td/a), 'continuation')]:

XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(lower-case(td/a), 'continuation')] evaluated in Firefox using XPather
Figure 19. XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(lower-case(td/a), 'continuation')] evaluated in Firefox using XPather.

It seems that lower-case() hasn't been implemented in Firefox. A Bugzilla search turns up an enhancement request from just over a year ago, "Add support for XPath 2.0 Functions and Operators" that was "auto-resolved" due to lack of activity and a search of the Google Groups archive of netscape.public.mozilla.layout.xslt on the phrase "XPath 2.0" doesn't yield much either. All of this makes sense in light of Firefox's reliance on TransforMiiX.

From this point on, we'll evaluate our XPath expressions in Frefox only.

An XPath expression that matches continuations-related posts - rev5 - adding the preceding and follow-on <tr>'s to the node-list using a compound location path and the position() function

Adding the <tr> element node that occurs before and the two that occur after every node matched by the expression we've constructed up to this point is quite straightforward. Ultimately, since there are five continuations-related posts on our saved version of the programming.reddit.com homepage and each post is comprised of the contents of four <tr> element nodes, we should wind up with 5 * 4 = 20 <tr> element nodes selected. We'll be making use of compound location paths.

A compound location path is a way of including nodes that match different location paths in a single node-list by chaining the different location paths together using "pipe" or "union" characters, |. The key point to remember about using compound location paths is that each successive location path is evaluated independently of the previous one.

We talked about axes earlier and have, so far, used the following:

  • location-or-self::, represented by the shortcut //
  • child::, impicit
  • attribute::, represented by the shortcut @

Now, we're going to use two others, preceding-sibling:: and following-sibling::. These axes select, respectively, all of the nodes that come before and those that follow the node(s) selected in the previous location step and are at the same level within the document hierarchy (share the same parent node). Here's our first stab at the compound location path using the following-sibling:: axis (split onto two lines for the sake of readability) :

//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] | 
//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]/following-sibling::node()

Given our definition of the following-sibling axis, the result is entirely predictable (hundreds of nodes selected):

Result of combining the node-list returned by the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] with the node-list of all of those nodes' following-siblings via a compound location path
Figure 20. Result of combining the node-list returned by the XPath expression //tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] with the node-list of all of those nodes' following-siblings via a compound location path.

Of course, we don't want all of the following-siblings of the two <td> element nodes matching the first location path - just the next two <tr> element nodes. To get at them, we'll start by adding a predicate to the second location path in the expression above and using the position() function with explicit position numbers (a la [position()=1]). The position values start from 1, not 0. Keeping in mind the empty text nodes in between the element nodes that we're after, we'll be going for position values of 2 and 4:

//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] | 
//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]/following-sibling::node()[position()=2 or position()=4]
Result of restricting the second location step to apply to the 2nd and fourth following-siblings of the nodes selected by the first location step.
Figure 21. Result of restricting the second location step to apply to the 2nd and fourth following-siblings of the nodes selected by the first location step.

This expression returns a node-list containing 15 nodes, the nodes containing each post's link and the 2 following <tr>'s. Now, we just need to add another location path to snare the preceding siblings. Again, we're skipping over an empty text node:

//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')] | 
//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]/following-sibling::node()[position()=2 or position()=4] | 
//tr[td/a[starts-with(@id, 'title')] and contains(translate(td/a, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'continuation')]/preceding-sibling::node()[position()=2]

Great. We've got 20 nodes:

Result of adding a 3rd location path to catch the preceeding <tr> element node in each post's set of <tr> element nodes.
Figure 22. Result of adding a 3rd location path to catch the preceeding <tr> element node in each post's set of <tr> element nodes.

Now, let's look at incorporating our XPath expression into a Greasemonkey userscript.


Incorporating the XPath expression into a GM user script

A De-continuations-ifying script using XPath - document.evaluate()

Before we proceed further ...


Now would be a good time to read "Introduction to using XPath in JavaScript" at the Mozilla Developer Center.

The heart of our script is going to be the document.evaluate() call. It takes the form:


var xpathResult = document.evaluate(xpathExpression, contextNode, namespaceResolver, resultType, result);

Plugging in the suggested values from the Mozilla Dev Center article (document for the contextNode, null for the namespaceResolver since we're dealing with an XHTML page, XPathResult.ANY_TYPE for the resultType, and null for the result) gives us:

var xpathResult = document.evaluate(
    '//tr[td/a[starts-with(@id, "title")] and contains(translate(td/a, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "continuation")] | //tr[td/a[starts-with(@id, "title")] and contains(translate(td/a, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "continuation")]/following-sibling::node()[position()=2 or position()=4] | //tr[td/a[starts-with(@id, "title")] and contains(translate(td/a, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "continuation")]/preceding-sibling::node()[position()=2]', 
    document, 
    null, 
    XPathResult.ANY_TYPE,
    null
);

The single quotes within the XPath expression have been swapped out for double quotes so that the entire expression could be wrapped in single quotes. In XPath, strings can be either single or double quoted so long as they match and are nested properly.

Choosing a resultType

If you've read the Mozilla Dev Center tutorial, then you know that node-sets from evaluated XPath expressions can be returned in three forms: Iterator, Snapshot, or First Nodes. As you'd guess, First Nodes results, corresponding to XPathResult.FIRST_ORDERED_NODE_TYPE consist simply of the first node in the node-set that's returned by a given expression. Since we want to remove or hide all of the nodes that are going to be returned, that's obviously not what we need.

Iterator and Snapshot results each come in two types, ordered and unordered. Whether the nodes are returned in the order in which they appear in the document or not makes little difference in our case, but we need to know which of the two we're dealing with because the methods used to access the nodes within Iterators and Snapshots are different - iterateNext() for the former and snapshotItem(), which takes an integer index value, for the latter. If we used the value XPathResult.ANY_TYPE and let Firefox's XPath implementation choose the result type for us, we wouldn't know how to work our way through the results without their type.

We could ask for either an Iterator or a Snapshot. The only reason for choosing one over the other, in our case, might be that Iterator type results could be invalidated by changes to the DOM of the page after the document.evaluate() call and before or during the process of iterating through the results so we'd need to handle the exceptions that could be raised. The Reddit site doesn't seem to be doing a lot of DOM monkeying on its own after the page has loaded, so this doesn't seem to be a major concern, but we'll still go with an ordered Snapshot.

An XPath-based de-continuations-ifying script

Here's a first stab at a script to hide the nodes comprising continuations-related Reddit posts:

var xpathResult = document.evaluate('//tr[td/a[starts-with(@id, "title")] and contains(translate(td/a, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "continuation")] | //tr[td/a[starts-with(@id, "title")] and contains(translate(td/a, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "continuation")]/following-sibling::node()[position()=2 or position()=4] | //tr[td/a[starts-with(@id, "title")] and contains(translate(td/a, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "continuation")]/preceding-sibling::node()[position()=2]', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
alert('xpathResult.snapshotLength: ' + xpathResult.snapshotLength);
for (var i=0; i<xpathResult.snapshotLength; i++) {
    var nodeToHide = xpathResult.snapshotItem(i);
    alert('i: ' + i);
    nodeToHide.style.display='none';
}

The Greasemonkey user script

It works. We can remove the alert() calls, add in the Greasemonkey metadata, and save it as xpath-de-continuations-ify-programming.reddit.com.user.js.


Wrap-up

Compare the de-continuations-ifying script written around the document.evaluate() call to the one that relied on DOM tree walking. The complexity has all been moved into the XPath expression and, in a way, comparing the two scripts is analogous to comparing two functions that perform some operations on strings where one function uses the programming language's native string handling functions and the other relies on regular expressions.

The XPath version is significantly shorter and, when site owners remix their html as they are wont to do, you would only have to rewrite your XPath expression. On the other hand, you will have had to absorb enough XPath 1.0 to get rolling. If you plan on distributing your user scripts, then there's also the question of whether you feel that the level of familiarity with XPath within the community of people who write and modify Greasemonkey user scripts is high enough that others will be able to modify your user scripts as easily as many can rework DOM tree-walking-based scripts.

This last point is worth considering. Web sites change and even subtle changes not visible in a browser without viewing source can break user scripts. Our script, for example, would still break (or at least exhibit unexpected behavior) if Reddit changed its post format to use more or fewer <tr>'s, change the position of the post-link-bearing <tr> within the set of <tr>'s associated with a each post, changed the id attribute value format for post links, or abandoned the use of tables altogether.

Changing the needle


As I was tidying up this article for posting, I decided to replace the "needle" value (we use "continuation" throughout) with a variable. Continuations-related posts aren't the flavor of the month at Reddit anymore (though they'll probably burst back onto the main page of the programing subreddit at some point), so I wanted to make it easier for readers to try substituting different values for strNeedle at the top of the script. I just de-haskell-ified and de-java-ified programming.reddit.com a moment ago and both scripts still work.

Changes

  • Last modified: $Date: 2007-05-08 17:24:00 +0800 (Tue, 08 May 2007) $
  • 2007-05-08: Replaced backslash with slash. Thanks to Dave Land for pointing out my error.
  • 2007-05-03: significant reformatting
  • 2007-03-06: minor reformatting
  • 2006-10-27: originally posted