xref: Development Notes | WebReference

xref: Development Notes

By Dan Ragle


[previous]

xref [con't]


xref: Developer Notes

Both the static and dynamic version of xref share the same base code delivered to the browser. The static version is based on this core code and the dynamic version includes it in the init_js.tmpl file (which in turn is delivered to the browser, taking into account the parameters passed to xref.js). This core code, which is executed client-side by the browser itself, provides the bulk of the script's functionality and can be summarized as:
  1. Scan and aggregate the text of the page
  2. Compare the aggregated text of the page to the provided term list, noting all matching terms
  3. Re-scan the page, this time replacing the actual found terms with clickable links

Why scan the page twice? Remember that our terms list may include many, many terms; perhaps a thousand or more. By first scanning and aggregating the text of the page into a single text string, we can perform a single match against that string for each term in our list, and then work only with those terms that actually matched for the remainder of the script processing. Without this aggregation, we would have to perform a match for all of our terms against each and every text node of the page (and on any give page there could be dozens or more of text nodes). The scanning two-step described above thus prevents the need to apply regular expression matches for thousands of terms dozens of times.

The Core Function: fScanText

Providing the heavy lifting of the script (specifically, providing the functionality for both scanning passes) is fScanText:

fValidParent:function(eElement) {
  if (eElement && eElement.nodeType && 
      (eElement.nodeType == 1) && 
      (eElement.hasChildNodes())) return true;
  else return false;
},
fValidNode:function(eElement, nodeClassAllowed) {
  if (nodeClassAllowed && eElement && 
      eElement.nodeType && eElement.parentNode && 
      eElement.parentNode.nodeName &&
     (eElement.nodeType == 3) &&
     this.rAllowedElements.test(eElement.parentNode.nodeName)) return true;
  else return false;
},
fScanText:function(eElement, nodeClassAllowed, scanType) {
  if (this.fValidNode(eElement, nodeClassAllowed)) {
    if (scanType == 'scan') {
      if (!this.sCheckSplit) 
         this.sCheckSplit = (eElement.splitText) ? 1 : 2;
      this.aText[this.aText.length] = eElement.nodeValue || '';
    }
    else {
      termLoop :
      for (var i = 0; i < this.aLookupData.length; i++) {
        var theObj = this.aLookupData[i];
        if (window.location.href == theObj.u) continue;
        var theTerm = theObj.t;
        var oTermRE = new RegExp('\\b(' + 
                      theTerm.replace(this.rMetaRE, "\\$1") + 
                      ')\\b', 'i');
        while (oTermRE.test(eElement.nodeValue)) {
          var oExactRE = new RegExp('^' + 
                         theTerm.replace(this.rMetaRE, "\\$1") + 
                         '$', 'i');
          if (oExactRE.test(eElement.nodeValue)) {
            var theAnchor = document.createElement('A');
            theAnchor.href = theObj.u;
            theAnchor.className = 'autoLink';
            theAnchor.target = '_new';
            if (theObj.d) theAnchor.title=theObj.d;
            theAnchor.appendChild(document.createTextNode(eElement.nodeValue));
            eElement.parentNode.replaceChild(theAnchor, eElement);
            break termLoop;
          }
          else {
            oTermRE.lastIndex = 0;
            var theStart = eElement.nodeValue.search(oTermRE);
            var theEnd = theStart + theTerm.length;
            if (theEnd != eElement.nodeValue.length) 
               eElement.splitText(theEnd);
            if (theStart != 0) eElement.splitText(theStart);
            oTermRE.lastIndex = 0;
          }
        }
      }
    }
  }
  else if (this.fValidParent(eElement)) {
    if (scanType == 'scan') this.aText[this.aText.length] = 'WR_BREAK';
    if ((!this.rIgnoreRE.test(eElement.className)) && 
         (!this.rExcludedElements.test(eElement.nodeName))) {
      if (!nodeClassAllowed) 
         nodeClassAllowed = (this.rAllowRE.test(eElement.className)) ? 1 : 0;
      for (var i = 0; i < eElement.childNodes.length; i++) {
        this.fScanText(eElement.childNodes[i], nodeClassAllowed, scanType);
      }
    }
    if (scanType == 'scan') this.aText[this.aText.length] = 'WR_BREAK';
  }
},

(Note that your version of fScanText may be slightly different than what's displayed here; depending on the options you've passed to xref.js.)

In a nutshell, fScanText starts with a top-level document object (which is passed into the function and would usually be document.body). It walks through the children of this object, recursively walking through their children--and so on--until all of the elements of the page have been examined. For each node found, if it's a text node, it's either aggregated for scanning (if it's the first pass through the function) or it's examined for potential link insertion (on the second pass). The helper functions fValidParent and fValidNode help to determine if the node we're looking at at any given time is a valid container or text node, respectively.

Of particular interest is the code that actually inserts the links into the page when a match is found; which leverages the handy DOM methods splitText and replaceChild. If you haven't encountered them before, splitText allows us to divide a text node into two nodes, specifying the point in the original node where we want the split to occur; and replaceChild--as its name implies--allows us to replace a specified child node with another. We use splitText to chop the text nodes down until they contain only matching terms; for example, if we're looking for the term Apple and the particular text node we're examining contains this:

Node1: An apple a day keeps the doctor away.

Then we'll first split the text node into two text nodes; the first containing everything up to and including the targeted term, and the second containing everything else:

Node1: An apple
Node2:  a day keeps the doctor away.

Next, we split the node again--this time separating everything prior to the term from the term itself:

Node1: An 
Node2: apple
Node3:  a day keeps the doctor away.

Having isolated the term, we again check the resulting node to see if it contains the key term we're looking for. In this case, the resulting node is An , which doesn't match our term, so we continue with the next term in the list. This step is critical, to ensure that we check each term against each node--even the new nodes we've created. Note especially the use of the lastIndex parameter of our regular expression object:

            oTermRE.lastIndex = 0;

Resetting lastIndex to 0 enables JavaScript to recheck the current regular expression in the while loop (while (oTermRE.test(eElement.nodeValue))), which is necessary, since the results of our text node splitting above may have produced a text node with an immediate exact term match.

Once we've found an isolated text node with an exactly matching term (as would be the case with the new Node2 in the example above), we create a new link and replace the matched node with it using replaceChild and standard DOM techniques.

Note also that, as part of the fValidNode function, each text node is checked to be an immediate child of a specific list of elements; namely, divs, spans, paragraphs (p), italics (i or em), or bold text (b or strong). Additionally, the script won't include links within already existing links of a page. These checks are included in an attempt to include links only in logical areas of the page; i.e., areas where the user would expect text links to appear. This may cause the script to appear to be broken; because it may ignore terms that you believe should be automatically linked. If this is a problen for your particular implementation of xref, then you'll need to adjust this line of code in init_js.tmpl to include the additional tags that you would like considered:

this.rAllowedElements=/^(DIV|SPAN|P|STRONG|B|EM|I)$/i;

Remember that this check is made against the immediate parent of a text node; it's not made against the elements of the page as a whole. In other words, if your entire page is enclosed within a table, xref can still find the terms within it (provided your table has divs, spans, paragraphs, etc. within it).

Dynamically Inserting an External JavaScript

To conserve space with the initial JavaScript delivery to the browser, the dynamic version of xref breaks the two scan passes described above into two separate JavaScript calls: The first is included with the initial JavaScript provided, and the second is made following a second call to xref.js. By performing the scans in this way, the initial JavaScript need only contain the terms we're looking for themselves--without their associated descriptions--allowing us to save bytes with the intial JavaScript download. Bearing in mind that we want to offer this script to our affiliates without requiring them to install the script on their own servers, how can we then recall the script, passing to it the matched terms, so we can retrieve the second portion of the script (which will contain only the matched terms, but with their descriptions)?

The answer is to build the necssary script URL (including the parameters we need to tell xref.js what we want to do) and then insert a new script element into the DOM dynamically, like this:

var actionURL = 'http://example.com/cgi-bin/xref.js?action=select&term_list=' + 
                encodeURIComponent('1,3,5') +   // the terms we need
                '&prefix=' + encodeURIComponent("WR_");
var scriptEl = document.createElement('script');
scriptEl.type = 'text/javascript';
scriptEl.src = actionURL;
document.getElementsByTagName("head")[0].appendChild(scriptEl);

[Update: earlier versions of the script--prior to .53--used the simpler document.body.appendChild(scriptEl) to add the new script to the page, but this implementation proved to be problematic in Internet Explorer due to a documented bug of the browser (see http://support.microsoft.com/kb/927917/en-us for further details of the bug). Inserting the script in the <head> avoids that bug and is generally considered to be good form. -- Dan]

This is a simplified version of the actual code in xref, but it shows you enough to give you the technique, which works well in all later version browsers (including at least FF 2/3, IE 6/7, Saf 3, Opera 9). Note that we must encodeURIComponent all of the unknown variables that we pass to the script (in our above example they're just literal values; but in the actual script we don't know what they are) so the information will be passed to xref.js properly. Once xref.js receives the above call, it will examine the parameters, see that a second pass is requested (action=select), and return the terms and descriptions along with the necessary fScanText command to execute the second scan and insert the links.

Unfortunately, it's exactly this type of technique which makes cross-site scripting security vulnerabilities possible in unsuspecting Web pages. To prevent this type of abuse in xref we convert all angle brackets passed into the script into literal HTML entities; so if a malicious individual tries to call xref.js with their own script inserted, xref will just display the code literally and harmlessly within its own script as &lt;script .... etc. blocks.

Conclusion

While I believe xref does what it does pretty well, there's certainly room for improvement. Its overall design works well for lists of terms up to, say, 1,000 entries; but beyond that it may become sluggish, primarily because of the amount of data that would need to be passed to the browser. One possible remedy for this is to pass the data on the page to the back-end script for evaluation instead (as opposed to passing the terms to scan for to the browser); but there you run into potential problems having to do with the length of data that can be included within a single URL. Nonetheless, the advantage this type of approach provides (limits the initial JavaScript sent to the browser to about 4k or less) may be worth the effort.

If there's enough call to pursue this then I can do so in a future release; or, if you have something else on your mind--either about future directions of the script, or about its current functionality--don't hesitate to let me know (you can use the comment form below to do precisely that). It's my hope that the script can evolve with your input in mind. In the meantime, I hope you enjoy using xref on your own sites and networks!


[previous]