Querying html using Yahoo YQL - html

While trying to parse html using Yahoo Query Language and xpath functionality provided by YQL, I ran into problems of not being able to extract “text()” or attribute values.
For e.g.
perma link
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a'
gives a list of anchors as xml
<results>
<a class="question-hyperlink" href="/questions/661184/filling-the-text-area-with-the-text-when-a-button-is-clicked" title="In ASP.net, I need the code to fill the text area (in the form) when a button is clicked. Can you help me through by showing a simple .aspx code containing the script tag? ">Filling the text area with the text when a button is clicked</a>...
</results>
Now when I try to extract the node value using
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a/text()'
I get results concatenated rather than a node list
e.g.
<results>Xcode: attaching to a remote process for debuggingWhy is b
…… </results>
How do I separate it into node lists and how do I select attribute values ?
A query like this
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a[#href]'
gave me the same results for querying div/h3/a

YQL requires the xpath expression to evaluate to an itemPath rather than node text. But once you have an itemPath you can project various values from the tree
In other words an ItemPath should point to the Node in the resulting HTML rather than text content/attributes. YQL returns all matching nodes and their children when you select * from the data.
example
select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
This returns all the a's matching the xpath. Now to project the text content you can project it out using
select content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
"content" returns the text content held within the node.
For projecting out attributes, you can specify it relative to the xpath expression. In this case, since you need the href which is relative to a.
select href from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
this returns
<results>
<a href="/questions/663973/putting-a-background-pictures-with-leds"/>
<a href="/questions/663013/advantages-and-disadvantages-of-popular-high-level-languages"/>
....
</results>
If you needed both the attribute 'href' and the textContent, then you can execute the following YQL query:
select href, content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
returns:
<results> double pointer const issue issue... </results>
Hope that helps. let me know if you have more questions on YQL.

Related

How do I get rid of the tags in XPath

I have a bunch of html files with tons of data in it and I want to extract the important parts of it.
The files are all very similar; I've to search for a <tr> which contains a certain keyword. The third column of this table row always contains the name of the "block" I'm searching for (it's a few table rows).
//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]
with this XPath query I get the names (maybe one, maybe more)
The problem is, how do I get rid of the tags around the data?
Right now my output is something like this:
<span class="log_entry_text">Name1</span><span class="log_entry_text">Name2</span><span class="log_entry_text">Name3</span>
I want to have something like that: Name1 Name2 Name3
So I can use it for extracting these blocks more easily.
With string() i can only extract the first element (result would be: Name1)
Thanks for helping me!
Just wrap your xpath with data() element like data(//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]) for retrieve text.
Your XPath expression asks to retrieve span elements and that's what it has returned. If you're seeing tags with angle brackets in the output, that's because of the way the XPath result is being processed and rendered by the receiving application.
If you're in XPath 2.0+ or XQuery 1.0+ you can combine the several span elements into a single string using
string-join(//path/span, ' ')

How to write an XPath query for text within <script> using PhantomJS

I am trying to scrape some specific content that sits within the <script> section of a page (at the bottom of the page before the end of the tag. It is my understanding that this can't be done with regular XPath, so I will be using PhantomJs cloud via SEOTools for Excel plugin.
Please see code below:
<script> window.__INITIAL_STATE__ = {"questions":{"list":{},"status":{}},"sites":{"list":{"SEOTest":{"joined":"2016-04-17T22:00:31.000Z","threshold":[],"abn":"8724483318952",
I want to be able to scrape the text after "ABN" field, so the xpath would return "8724483318952". Does anybody know how this can be done with xpath?
To retrieve the desired target string value of "8724483318952" you can use the following XPath-1.0 expression:
substring-before(substring-after(script,'abn":'),',')
It gets the desired string from the <script> tag and its output is
"8724483318952"
The signature of XPathUrl is, according to this link:
=XPathOnUrl(
string url,
string xpath,
string attribute,
string xmlHttpSettings,
string mode
) : vector
So the whole expression could look like this:
=XPathOnUrl(A2,"substring-before(substring-after(//ul[#class='headshot']/script,'abn":'),',')")
I'm not sure that this expression really does work, but it should give you a pretty precise idea of how to handle XPath expression generally.

How to access various parts of a link with XPath

I'm fairly new to XPath and wanted to see how granular you can get when accessing various HTML components.
I'm currently using the this xpath
//*[#id=\"resultsDiv\"]/p[1]/a
to access the HTML (abbreviated) below:
<p style="margin:0;border-width:0px;">Bill%20Jones</p>
The XPath returns this: Bill%20Jones
But what I'm trying to get is simply the PersonID = 140476.
Question: Is it possible to write an XPath that results in 140476, or do I need to take what was returned and use a regular expression other method to access the PersonID.
If this XPath,
//*[#id=\"resultsDiv\"]/p[1]/a
selects this a element,
Bill%20Jones
then this XPath,
substring-after(//*[#id='resultsDiv']/p[1]/a/#href, 'PersonID=')
will return 140476 alone, as requested.

Select attribute content XPath

I have an XPath
//*[#class]
I would like to make an XPath to select the content inside this attribute.
<li class="tab-off" id="navList0">
So in this case I would like to extract the text "tab-off", is this possible with XPath?
Your original //*[#class] XPath query returns all elements which have a class attribute. What you want is //*[#class]/#class to retrieve the attribute itself.
In case you just want the value and not the attribute name try string(//*[#class]/#class) instead.
If you are specifically grabbing the data from an tag, you can do this:
//li[#class]
and loop through the result set to find a class with attribute "tab-off". Or
//li[#class='tab-off']
If you're in a position to hard code.
I assume you have already put your file through an XML parser like a DOMParser. This will make it much easier to extract any other values you may need on a specific tag.

get XPATH for all the nodes

Is there a library that can give me the XPATH for all the nodes in an HTML page?
is there any library that can give me
XPATH for all the nodes in HTML page
Yes, if this HTML page is a well-formed XML document.
Depending on what you understand by "node"...
//*
selects all the elements in the document.
/descendant-or-self::node()
selects all elements, text nodes, processing instructions, comment nodes, and the root node /.
//text()
selects all text nodes in the document.
//comment()
selects all comment nodes in the document.
//processing-instruction()
selects all processing instructions in the document.
//#*
selects all attribute nodes in the document.
//namespace::*
selects all namespace nodes in the document.
Finally, you can combine any of the above expressions using the union (|) operator.
Thus, I believe that the following expression really selects "all the nodes" of any XML document:
/descendant-or-self::node() | //#* | //namespace::*
In case this is helpful for someone else, if you're using python/lxml, you'll first need to have a tree, and then query that tree with the XPATH paths that Dimitre lists above.
To get the tree:
import lxml
from lxml import html, etree
your_webpage_string = "<html><head><title>test<body><h1>page title</h3>"
bad_html = lxml.html.fromstring(your_webpage_string)
good_html = etree.tostring(root, pretty_print=True).strip()
your_tree = etree.fromstring(good_html)
all_xpaths = your_tree.xpath('//*')
On the last line, replace '//*' with whatever xpath you want. all_xpaths is now a list which looks like this:
[<Element html at 0x7ff740b24b90>,
<Element head at 0x7ff740b24d88>,
<Element title at 0x7ff740b24dd0>,
<Element body at 0x7ff740b24e18>,
<Element h1 at 0x7ff740b24e60>]