xref [con't]
|
xref: Developer Notes
Both the static and dynamic version of xref share the same base code delivered to the browser. The static version is based on this core code and the dynamic version includes it in theinit_js.tmpl file
(which in turn is delivered to the browser, taking into account the parameters
passed to xref.js). This core code, which is executed client-side
by the browser itself, provides the bulk of the script's functionality and can
be summarized as:
- Scan and aggregate the text of the page
- Compare the aggregated text of the page to the provided term list, noting all matching terms
- Re-scan the page, this time replacing the actual found terms with clickable links
Why scan the page twice? Remember that our terms list may include many, many terms; perhaps a thousand or more. By first scanning and aggregating the text of the page into a single text string, we can perform a single match against that string for each term in our list, and then work only with those terms that actually matched for the remainder of the script processing. Without this aggregation, we would have to perform a match for all of our terms against each and every text node of the page (and on any give page there could be dozens or more of text nodes). The scanning two-step described above thus prevents the need to apply regular expression matches for thousands of terms dozens of times.
The Core Function: fScanText
Providing the heavy lifting of the script (specifically, providing the functionality
for both scanning passes) is fScanText:
fValidParent:function(eElement) {
if (eElement && eElement.nodeType &&
(eElement.nodeType == 1) &&
(eElement.hasChildNodes())) return true;
else return false;
},
fValidNode:function(eElement, nodeClassAllowed) {
if (nodeClassAllowed && eElement &&
eElement.nodeType && eElement.parentNode &&
eElement.parentNode.nodeName &&
(eElement.nodeType == 3) &&
this.rAllowedElements.test(eElement.parentNode.nodeName)) return true;
else return false;
},
fScanText:function(eElement, nodeClassAllowed, scanType) {
if (this.fValidNode(eElement, nodeClassAllowed)) {
if (scanType == 'scan') {
if (!this.sCheckSplit)
this.sCheckSplit = (eElement.splitText) ? 1 : 2;
this.aText[this.aText.length] = eElement.nodeValue || '';
}
else {
termLoop :
for (var i = 0; i < this.aLookupData.length; i++) {
var theObj = this.aLookupData[i];
if (window.location.href == theObj.u) continue;
var theTerm = theObj.t;
var oTermRE = new RegExp('\\b(' +
theTerm.replace(this.rMetaRE, "\\$1") +
')\\b', 'i');
while (oTermRE.test(eElement.nodeValue)) {
var oExactRE = new RegExp('^' +
theTerm.replace(this.rMetaRE, "\\$1") +
'$', 'i');
if (oExactRE.test(eElement.nodeValue)) {
var theAnchor = document.createElement('A');
theAnchor.href = theObj.u;
theAnchor.className = 'autoLink';
theAnchor.target = '_new';
if (theObj.d) theAnchor.title=theObj.d;
theAnchor.appendChild(document.createTextNode(eElement.nodeValue));
eElement.parentNode.replaceChild(theAnchor, eElement);
break termLoop;
}
else {
oTermRE.lastIndex = 0;
var theStart = eElement.nodeValue.search(oTermRE);
var theEnd = theStart + theTerm.length;
if (theEnd != eElement.nodeValue.length)
eElement.splitText(theEnd);
if (theStart != 0) eElement.splitText(theStart);
oTermRE.lastIndex = 0;
}
}
}
}
}
else if (this.fValidParent(eElement)) {
if (scanType == 'scan') this.aText[this.aText.length] = 'WR_BREAK';
if ((!this.rIgnoreRE.test(eElement.className)) &&
(!this.rExcludedElements.test(eElement.nodeName))) {
if (!nodeClassAllowed)
nodeClassAllowed = (this.rAllowRE.test(eElement.className)) ? 1 : 0;
for (var i = 0; i < eElement.childNodes.length; i++) {
this.fScanText(eElement.childNodes[i], nodeClassAllowed, scanType);
}
}
if (scanType == 'scan') this.aText[this.aText.length] = 'WR_BREAK';
}
},
(Note that your version of fScanText may be slightly different than
what's displayed here; depending on the options you've passed to xref.js.)
In a nutshell, fScanText starts with a top-level document object (which
is passed into the function and would usually be document.body). It walks
through the children of this object, recursively walking through their children--and so
on--until all of the elements of the page have been examined. For each node found, if it's a text node, it's either aggregated for scanning (if it's the first pass through
the function) or it's examined for potential link insertion (on the second pass). The
helper functions fValidParent and fValidNode help to determine
if the node we're looking at at any given time is a valid container or text node,
respectively.
Of particular interest is the code that actually inserts the links into the
page when a match is found; which leverages the handy DOM methods splitText
and replaceChild. If you haven't encountered them before, splitText
allows us to divide a text node into two nodes, specifying the point in the original
node where we want the split to occur; and replaceChild--as its name
implies--allows us to replace a specified child node with another. We use splitText
to chop the text nodes down until they contain only matching terms; for example, if we're
looking for the term Apple and the particular text node we're examining
contains this:
Node1: An apple a day keeps the doctor away.
Then we'll first split the text node into two text nodes; the first containing everything up to and including the targeted term, and the second containing everything else:
Node1: An apple
Node2: a day keeps the doctor away.
Next, we split the node again--this time separating everything prior to the term from the term itself:
Node1: An
Node2: apple
Node3: a day keeps the doctor away.
Having isolated the term, we again check the resulting node to see if it
contains the key term we're looking for. In this case, the resulting node is
An , which doesn't match our term, so we continue
with the next term in the list. This step is critical, to ensure that we check
each term against each node--even the new nodes we've created. Note especially the
use of the lastIndex parameter of our regular expression object:
oTermRE.lastIndex = 0;
Resetting lastIndex to 0 enables JavaScript to recheck
the current regular expression in the while loop (while (oTermRE.test(eElement.nodeValue))),
which is necessary, since the results of our text node splitting above may
have produced a text node with an immediate exact term match.
Once we've found an isolated text node with an exactly matching term (as
would be the case with the new Node2 in the example above), we create a new link
and replace the matched node with it using replaceChild and standard DOM
techniques.
Note also that, as part of the fValidNode function, each text node
is checked to be an immediate child of a specific list of elements; namely, divs,
spans, paragraphs (p), italics (i or em), or bold text (b or strong). Additionally,
the script won't include links within
already existing links of a page. These checks are included
in an attempt to include links only in logical areas of the
page; i.e., areas where the user would expect text links to
appear. This may cause the script to appear to be
broken; because it may ignore terms that you believe should
be automatically linked. If this is a problen for your
particular implementation of xref, then you'll need to
adjust this line of code in init_js.tmpl to include the
additional tags that you would like considered:
this.rAllowedElements=/^(DIV|SPAN|P|STRONG|B|EM|I)$/i;
Remember that this check is made against the immediate parent of a text node; it's not made against the elements of the page as a whole. In other words, if your entire page is enclosed within a table, xref can still find the terms within it (provided your table has divs, spans, paragraphs, etc. within it).
Dynamically Inserting an External JavaScript
To conserve space with the initial JavaScript delivery to the browser, the
dynamic version of xref breaks the two scan passes described
above into two separate JavaScript calls: The first is included with the initial JavaScript
provided, and the second is made following a second call to xref.js.
By performing the scans in this way, the initial JavaScript need only contain the
terms we're looking for themselves--without their associated descriptions--allowing
us to save bytes with the intial JavaScript download. Bearing in mind that
we want to offer this script to our affiliates without requiring them
to install the script on their own servers, how can we then recall the
script, passing to it the matched terms, so we can retrieve the second
portion of the script (which will contain only the matched terms, but with
their descriptions)?
The answer is to build the necssary script URL (including the parameters we
need to tell xref.js what we want to do) and then insert a new
script element into the DOM dynamically, like this:
var actionURL = 'http://example.com/cgi-bin/xref.js?action=select&term_list=' +
encodeURIComponent('1,3,5') + // the terms we need
'&prefix=' + encodeURIComponent("WR_");
var scriptEl = document.createElement('script');
scriptEl.type = 'text/javascript';
scriptEl.src = actionURL;
document.getElementsByTagName("head")[0].appendChild(scriptEl);
[Update: earlier versions of the script--prior to .53--used the simpler document.body.appendChild(scriptEl) to add the new script to the
page, but this implementation proved to be problematic in Internet Explorer
due to a documented bug of the browser (see http://support.microsoft.com/kb/927917/en-us for further details of the bug). Inserting the script in the <head> avoids that bug and is generally considered to be good form. -- Dan]
This is a simplified version of the actual code in xref,
but it shows you enough to give you the technique, which works well in
all later version browsers (including at least FF 2/3, IE 6/7, Saf 3,
Opera 9). Note that we must encodeURIComponent all of the
unknown variables that we pass to the script (in our above example they're
just literal values; but in the actual script we don't know what they
are) so the information will be passed to xref.js properly.
Once xref.js receives the above call, it will examine the
parameters, see that a second pass is requested (action=select), and
return the terms and descriptions along with the necessary fScanText
command to execute the second scan and insert the links.
Unfortunately, it's exactly this type of technique which makes cross-site
scripting security vulnerabilities possible in unsuspecting Web pages. To
prevent this type of abuse in xref we convert all
angle brackets passed into the script into literal HTML entities; so if a
malicious individual tries to call xref.js with their own script
inserted, xref will just display the code literally and
harmlessly within its own script as <script .... etc.
blocks.
Conclusion
While I believe xref does what it does pretty well, there's certainly room for improvement. Its overall design works well for lists of terms up to, say, 1,000 entries; but beyond that it may become sluggish, primarily because of the amount of data that would need to be passed to the browser. One possible remedy for this is to pass the data on the page to the back-end script for evaluation instead (as opposed to passing the terms to scan for to the browser); but there you run into potential problems having to do with the length of data that can be included within a single URL. Nonetheless, the advantage this type of approach provides (limits the initial JavaScript sent to the browser to about 4k or less) may be worth the effort.
If there's enough call to pursue this then I can do so in a future release; or, if you have something else on your mind--either about future directions of the script, or about its current functionality--don't hesitate to let me know (you can use the comment form below to do precisely that). It's my hope that the script can evolve with your input in mind. In the meantime, I hope you enjoy using xref on your own sites and networks!




