Saturday, February 5, 2011

HtmlAgilityPack and XPath Peculiarity

Parsing several HTML pages I noticed that HtmlAgilityPack doesn't consider that its node has relative path for XPath. The following code illustrates this:

var html = @"
<div class="
"header"">
    <p>header <span>paragraph 1-1</span></p>
    <p>header <span>paragraph 1-2</span></p>
</div>
<div class="
"content"">
    <p>content <span>paragraph 2-1</span></p>
    <p>content <span>paragraph 2-2</span></p>
<div>
    "
;

var doc = new HtmlDocument();
doc.LoadHtml(html);

var node = doc.DocumentNode.SelectSingleNode("div[1]/p[1]");

Console.WriteLine("\r\n1st <p> in 1st <div>:");
Console.WriteLine(node.OuterHtml);

Console.WriteLine("\r\nCount of <span> (//):");
Console.WriteLine(node.SelectNodes("//span").Count);

Console.WriteLine("\r\nCount of <span> (.//):");
Console.WriteLine(node.SelectNodes(".//span").Count);

It produces the output:

1st <p> in 1st <div>:
<p>header <span>paragraph 1-1</span></p>

Count of <span> (//):
4

Count of <span> (.//):
1

w3schools.com says that "//" selects nodes "from the current node". So does it mean that HtmlAgilityPack works wrong?

Learning XPath on w3schools.com I had no doubt. But W3C specification says that it's alright:

//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node
.//para selects the para element descendants of the context node

All I wanna say is that you must be cautious to the information you got, even if it from the popular site with a good reputation (like w3schools is). "Trust no one", like Horde says :).

No comments:

Post a Comment