get XPATH for all the nodes - html

Is there a library that can give me the XPATH for all the nodes in an HTML page?

is there any library that can give me
XPATH for all the nodes in HTML page
Yes, if this HTML page is a well-formed XML document.
Depending on what you understand by "node"...
//*
selects all the elements in the document.
/descendant-or-self::node()
selects all elements, text nodes, processing instructions, comment nodes, and the root node /.
//text()
selects all text nodes in the document.
//comment()
selects all comment nodes in the document.
//processing-instruction()
selects all processing instructions in the document.
//#*
selects all attribute nodes in the document.
//namespace::*
selects all namespace nodes in the document.
Finally, you can combine any of the above expressions using the union (|) operator.
Thus, I believe that the following expression really selects "all the nodes" of any XML document:
/descendant-or-self::node() | //#* | //namespace::*

In case this is helpful for someone else, if you're using python/lxml, you'll first need to have a tree, and then query that tree with the XPATH paths that Dimitre lists above.
To get the tree:
import lxml
from lxml import html, etree
your_webpage_string = "<html><head><title>test<body><h1>page title</h3>"
bad_html = lxml.html.fromstring(your_webpage_string)
good_html = etree.tostring(root, pretty_print=True).strip()
your_tree = etree.fromstring(good_html)
all_xpaths = your_tree.xpath('//*')
On the last line, replace '//*' with whatever xpath you want. all_xpaths is now a list which looks like this:
[<Element html at 0x7ff740b24b90>,
<Element head at 0x7ff740b24d88>,
<Element title at 0x7ff740b24dd0>,
<Element body at 0x7ff740b24e18>,
<Element h1 at 0x7ff740b24e60>]

Related

Why is my HTML document being scrambled when using FSharp.Data HTML parser?

When trying to manipulate some HTML with the FSharp.Data library I am getting confusing results.
Here is the code:
let manipulateHtml (htmlDoc:HtmlDocument) =
htmlDoc.Html().Descendants()
|> filterFromHtml stuffToRemove
|> HtmlDocument.New
When I print the resulting Html document it is not in the correct order - it seems to reconstruct the document starting from a random node. How does HtmlDocument.New(seq) reconstruct the html document and is there a way to reconstruct the document in the correct format - e.g. its original order?
That's because the Descendants() method returns all the children in a recursive manner. This means the returned sequence will contain all the grandparent, parent, children ... nodes.
For example, when the doc is:
<html>
<tag1>
<tag2>
this is the text
</tag2>
</tag1>
</html>
Then Descendants() will return a sequence of nodes like this:
<tag1>
<tag2>
this is the text
</tag2>
</tag1>
<tag2>
this is the text
</tag2>
this is the text
But the HtmlDocument.New method constructs the doc in a flat manner, so you will have a doc look like above with tag2 repeated twice, and this is the text repeated 3 times.
So in order to solve your problem, you need to traverse the tree of htmlDoc.Html(), determine which node would be retained, and in the meanwhile construct a new tree using the HtmlNode.New***() and HtmlDocument.New***() methods.

Select all deepest nodes with XPath 1.0 containing text, ignoring markup

I want to extract elements from the HTML page, containing text, ignoring markup. For example, I want to extract node containing the text "Run, Sarah, run!" from https://en.wiktionary.org/wiki/run. I know about node test text() and function string(). I tried them both:
As you see, if I use string() it returns too many nodes (result includes the nodes that include the node I need) and if I use text() it returns nothing (because of the <b> tag).
How do I find required nodes?
UPD: I want all deepest nodes. That means if the Wikitionary page contained this sentence twice, I wanted to select two nodes.
Also, I don't know the node type.
//*[contains(string(.), "Run, Sarah, run!")] returns all elements (starting from html node till last descendant node) that contains that string.
//*[contains(text(), "Run, Sarah, run!")] returns nothing as "Run, Sarah, run!" is compound text from several text nodes, but not from single text node
You can use below to match italic node with required text:
'//i[normalize-space()="Run, Sarah, run!"]'
If you don't want to specify node name, you can try
'//*[normalize-space()="Run, Sarah, run!" and not(./*[normalize-space()="Run, Sarah, run!"])]'

How to parse HTML/XML tags according to NOT conditions in [r]

Dearest StackOverflow homies,
I'm playing with HTML that was output by EverNote and need to parse the following:
Note Title
Note anchor (hyperlink identities of the notes themselves)
Note Creation Date
Note Content, and
Intra-notebook hyperlinks (the
links within the content of a note to another note's anchor)
According to examples by Duncan Temple Lang, author of the [r] XML package and a SO answer by #jdharrison, I have been able to parse the Note Title, Note anchor, and Note Creation Dates with relative ease. For those who may be interested, the commands to do so are
require("XML")
rawHTML <- paste(readLines("EverNotebook.html"), collapse="\n") #Yes... this is noob code
doc = htmlTreeParse(rawHTML,useInternalNodes=T)
#Get Note Titles
html.titles<-xpathApply(doc, "//h1", xmlValue)
#Get Note Title Anchors
html.tAnchors<-xpathApply(doc, "//a[#name]", xmlGetAttr, "name")
#Get Note Creation Date
html.Dates<-xpathApply(doc, "//table[#bgcolor]/tr/td/i", xmlValue)
Here's a fiddle of an example HTML EverNote export.
I'm stuck on parsing 1. Note Contents and 2. Intra-notebook hyperlinks.
Taking a closer look at the code it is apparent the solution for the first part is to return every upper-most* div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?
Duncan says that it is possible to use XPath to parse XML according to NOT conditions:
"It allows us to express things such as "find me all nodes named a" or "find me all nodes named a that have no attribute named b" or "nodes a that >have an attribute b equal to 'bob'" or "find me all nodes a which have c as >an ancestor node"
However he does not go on to describe how the XML package can parse exclusions... so I'm stuck there.
Addressing the second part, consider the format of anchors to other notes in the same notebook:
<a href="#13178">
The goal with these is to procure their number and yet this is difficult because they are solely distinguished from www links by the # prefix. Information on how to parse for these particular anchors via partial matching of their value (in this case #) is sparse - maybe even requiring grep(). How can one use the XML package to parse for these special hrefs? I describe both problems here since it's possible a solution to the first part may aid the second... but perhaps I'm wrong. Any advice?
UPDATE 1
By upper-most div I intend to say outer-most div. The contents of every note in an EverNote HMTL export are within the DOMs outer-most divs. Thus the interest is to return every outer-most div that does NOT include a table with attribute bgcolor="#D4DDE5."
"....to return every upper-most div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?"
One possible way ignoring 'upper-most' as I don't know exactly how would you define it :
//div[not(table[#bgcolor='#D4DDE5'])]
Above XPath reads: select all <div> not having child element <table> with bgcolor attribute equals #D4DDE5.
I'm not sure about what you mean by "parse" in the 2nd part of the question. If you simply want to get all of those links having special href, you can partially match the href attribute using starts-with() or contains() :
//a[starts-with(#href, '#')]
//a[contains(#href, '#')]
UPDATE :
Taking "outer-most" div into consideration :
//div[not(table[#bgcolor='#D4DDE5']) and not(ancestor::div)]
Side note : I don't know exactly how XPath not() is defined, but if it works like negation in general, (this worked as confirmed by OP in the comment below) you can apply one of De Morgan's law :
"not (A or B)" is the same as "(not A) and (not B)".
so that the updated XPath can be slightly simplified to :
//div[not(table[#bgcolor='#D4DDE5'] or ancestor::div)]

Nokogiri: How do you get the parent node of a DOM element when all you have is the string index of the element you want in the dom?

Here is what I have:
DOM stored as text
I have the string index of the area I want to get the parent node of, the index may or may not be the beginning of a tag (it will never be a partial of a tag, as it is a user selection
I also have the htmltext at the index (obviously)
This is as far as I've gotten:
doc = Nokogiri::HTML(content.body)
I know nokogiri can do xpath things, but I don't know if xpath can do standard text searches? the selection text could span multiple nodes, and I think that breaks xpath searching o.o
I'm using Ruby 1.8.7, and rails 2.3.8
There is no correlation between the index in a particular serialization of the XML Document and an element. The closest you could do:
Recursively, at each level of the DOM, serialize the element and see if its length (added to what you have so far) has reached your index.
Unfortunately this is not guaranteed to work, since:
Many different (non-canonical) serializations are possible that describe the same XML document (e.g. foo="You said, "Hi!"" vs. foo='You said, "Hi!"').
Depending on whether you consider blank whitespace nodes as significant, two different XML documents might be treated the same (e.g. <foo><bar> vs. <foo>\n\t<bar>)
In HTML, additional non-significant whitespace might be stripped (e.g. <p>a b</p> vs. <p>a b</p>).

Querying html using Yahoo YQL

While trying to parse html using Yahoo Query Language and xpath functionality provided by YQL, I ran into problems of not being able to extract “text()” or attribute values.
For e.g.
perma link
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a'
gives a list of anchors as xml
<results>
<a class="question-hyperlink" href="/questions/661184/filling-the-text-area-with-the-text-when-a-button-is-clicked" title="In ASP.net, I need the code to fill the text area (in the form) when a button is clicked. Can you help me through by showing a simple .aspx code containing the script tag? ">Filling the text area with the text when a button is clicked</a>...
</results>
Now when I try to extract the node value using
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a/text()'
I get results concatenated rather than a node list
e.g.
<results>Xcode: attaching to a remote process for debuggingWhy is b
…… </results>
How do I separate it into node lists and how do I select attribute values ?
A query like this
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a[#href]'
gave me the same results for querying div/h3/a
YQL requires the xpath expression to evaluate to an itemPath rather than node text. But once you have an itemPath you can project various values from the tree
In other words an ItemPath should point to the Node in the resulting HTML rather than text content/attributes. YQL returns all matching nodes and their children when you select * from the data.
example
select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
This returns all the a's matching the xpath. Now to project the text content you can project it out using
select content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
"content" returns the text content held within the node.
For projecting out attributes, you can specify it relative to the xpath expression. In this case, since you need the href which is relative to a.
select href from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
this returns
<results>
<a href="/questions/663973/putting-a-background-pictures-with-leds"/>
<a href="/questions/663013/advantages-and-disadvantages-of-popular-high-level-languages"/>
....
</results>
If you needed both the attribute 'href' and the textContent, then you can execute the following YQL query:
select href, content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
returns:
<results> double pointer const issue issue... </results>
Hope that helps. let me know if you have more questions on YQL.