Parsing inconsistent HTML with XPath

Parsing inconsistent HTML with XPath - html

I'm trying to gather information from a web page that has inconsistent HTML, for example:
<ul><li>Item #1</li></ul><ul><li>sub Item #1</li></ul>
and that's alright, I use the XPath expression
//div[#id="content"]/ul/li/text()
and it does the job (except that doesn't gather the information from sub Item #1.,
Also the HTML varies and this is other way:
<dl><dd><ul><li>Item #1</li></ul></dd></dl><dl><dd><ul><li>sub Item #1</li></ul></dd></dl>
Well, I'm trying to gather Item #1 and sub Item #1. But with this inconsistent HTML I'm not able to find an XPath expression that will allow me to gather the information in any case, could you help me with this?
There will always be a list, the Item #1 and sub Item #1 always will be inside a <ul><li>

You could try using descendant axis (//) to select ul/li/text() no matter how deep it is nested within a consistent ancestor/parent. For example, assuming that ancestor/parent of ul/li is always a div having id equals "content" :
//div[#id="content"]//ul/li/text()

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.

According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

how to read the multiple sub tittle using xpath?

I want to read the top stories link from CNN news site using XPath in selenium. I gave my XPath as shown below.
text = ieDriver.findElement(By.xpath(//*[#id="intl_homepage1-zone-1"]/div[2]/div/div[2]/ul/article[1]/div/div[2]/h3/a/span[1]/strong")).getText();
It read only one sub heading, but I want to read the all the top stores heading how can I do that?
I know if i change the article[2],article[3]...article[i] it will read. Is there any way to read it using single XPath?

The single xpath for fetching link of top stories is :
//div[contains(#data-analytics,'Top stories_list-hierarchical')]//h3[#class='cd__headline']/a/span[1]
You have to first store the elements in the list. For that, please use "findElements" not "findElement" as findElement will return only one element.
Then you can iterate over the list and print the text of all the Web Elements.

You need to first fetch all the webElements in the list using findElements and then you can iterate and read through all the links
List<WebElement> article_List= ieDriver.findElements(By.xpath("//*[#id="intl_homepage1-zone-1"]/div[2]/div/div[2]/ul/article");
for(WebElement article : article_List)
{
String text = article.findElement("./div/div[2]/h3/a/span[1]/strong").getText();
}
In this way you can iterate over the whole list of headings.
P.S Try to use relative xpaths over absolute xpaths.
Hope this answer was helpful for you

Using XPath to get the first child for every child of a node

I'm trying to parse some HTML with the following structure, how can I extract the first <a> element of every <li> element using xpath?
<ul>
<li>
<a>
<span>
<a>
</li>
<li>
<a>
<span>
<a>
</li>
...
</ul>

#Mathias : You are correct, I apologize. //li/a[1] did not work because it wasn't a direct child (there is an article tag in between, which I omitted for simplicity).
Then let me post this as a solution with some more explanation.
If, as you have described, //li/a[1] does not return anything while (//li//a)[1] does, then the HTML sample you show is not representative for your actual document. Then, a would be a descendant of li, but not a direct child of it.
A correct XPath expression in this case is
//li//a[1]
but only use it if the level of nesting varies, i.e. if there could be other elements nested between li and a:
<li>
<article>
<other>
<a/>
If the nesting is consistent, but it is not always the article element which is in between li and a then use
//li/*/a[1]
Which avoids the // axis that is computationally more expensive than /.
Finally, if you know that the a elements you are interested in are always grandchildren of li elements and if it is always the article element in between them, use
//li/article/a[1]
When I correct the expression to be //li/article/a[1]', I get the first a` for the first li.
//li/article/a[1] returns several results if there are several a elements that are children of article and grandchildren of li. If this only returns a single result either
you invoke this XPath expression in a context where only a single result is expected, e.g. if you use an XPath library in a programming language or
the structure of your input document is even more intricate

I think that the XPath to accomplish that would be .//ul/li/a[position()=1] .
Explanation:
The reason I spell it all out as .//ul/li/a is because, when you use the xpath, if there is an error, your stack-trace will reveal exactly what the locator pointed at, and is less vague. But, you can obviously short-hand it if you dont care: .//a .
Using the position clause, you can do =1 or >1 , or whatever. I would choose using [position()=1] over using [1] because Xpath doesn't use 0-based arrays, which might confuse others looking at your locator. I mean position=0, by logic, means null, right?
I start my locator with a . because personally, sometimes I like to chain my locators together in a combination. You don't really need to start with the dot char but since i use the // wildcard in this case, its effectively the same as starting without a dot, but with the additional ability to be chained.
Answer tested on http://the-internet.herokuapp.com/

Create nested list and not code block in markdown

I am trying to create a nested list after an equation in a markdown document in MarkdownPad but instead I am getting a code block. I am unsure how to escape it in order to get nested list (2nd order instead):
Here is the code:
1st order list:
2nd order list:
Some other text here which should be followed by a 2nd order nested list:
- 4 spaces followed by a "-" gives a code block instead of a second order list

Short version: you can't.
Since you have inserted a new paragraph (Some other text here which should be followed by a 2nd order nested list:), you have closed the list block. You can't jump straight to a sub-list[^1] without first having an enclosing list[^2].
If, however the some other text is supposed to be an aside regarding the first 2nd order item (and so the following 2nd order item is actually the 2nd 2nd order item of the list), then you can achieve it by not breaking the outer 1st order list:
- 1st item
- 2nd item
other text
- also 2nd item
[^1]: i.e. a nested list.
[^2]: This may not be true for all markdown engines, but is the case for the engine used by MarkdownPad. As a side point, the base markdown spec doesn't define a syntax for nested lists.

Excel VBA: get content from online HTML table

can anybody pleas show me part of VBA code, which will get text "hello" from this example online HTML table? first node will be found by his ID (id="something").
...
<table id="something">
<tr>
<td><TABLE><TR><TD></TD></TR><TR><TD></TD></TR></TABLE></td><td></td>
</tr>
<tr>
<td></td><td></td><td>hello</td>
</tr>
...
i think it will be something like child->sibling->child->sibling->sibling->child, but I don't know the exact way.
EDIT
updated code tags are CAPITALS. so if I use getElemenetsById("something").getElemenetsByTagName('tr') it get only two tr tags to collection, or four (with tags which are deeper children)?

If you did search for an answer, you might want to broaden your scope next time. There are plenty of questions and answers that deal with DOM stuff and VBA.
Use getElementById on HTMLElement instead of HTMLDocument
While the question (and answers) aren't exactly what you want, it will show you how to create something you can work with.
You'll need to use a mixture of getElementById() and getElemenetsByTagName() to retrieve your desired "hello"
eg: Document.getElementById("something").getElementsByTagName("tr")(1).getElementsByTagName("td")(2).innerText
Get the element "something"
Inside "something" get all "tr" tags (specifically the one at index 1)
Inside the returned tr tag get all "td" tags (specifically the one at index 2)
Get the innerText of the previous result
These objects use a 0 based array so the first item is item(0).
Update
document.getElementById() will return an (singular) IHTMLElement (which will include all of its children) or nothing/null if it does not exist.
document.getElementsByTagName() will return a collection of IHTMLElement (again, each element will include all of its children). (or an empty collection if none exist)
document.getElementsByTagName("tr") this will return all tr elements inside the "document" element.
document.getElementsByTagName("tr")(0) will return the first (singular) IHTMLElement from the collection. (note the index at the end?)
There is no (that i could find) "sibling" feature of the InternetExplorer object in VBA, so you'd have to do it manually using the child index.
Using the DOM Functions is the clean way to do it. Its much clearer than just looking at a chain "Element.Children(0).children(1).children(2)" as you've no idea what the index means without manually looking it up.

I looked all over for the answer to this question, too. I finally found the solution by talking to a coworker which was actually through recording a macro.
I know, you all think you are above this, but it is actually the best way. See the full post here: http://automatic-office.com/?p=344
In short, you want to record the macro and go to data --> from web and navigate to your website and select the table you want.
I have used the above solutions "get element by id" type stuff in the past, and it is great for a few elements, but if you want a whole table, and you aren't super experienced, just record a macro.
don't tell your friends and then reformat it to look like your own work so no one knows you used the macro tool ;)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parsing inconsistent HTML with XPath - html

You could try using descendant axis (//) to select ul/li/text() no matter how deep it is nested within a consistent ancestor/parent. For example, assuming that ancestor/parent of ul/li is always a div having id equals "content" : //div[#id="content"]//ul/li/text()

Related

How to select all elements with a specific name under every li node with the same structure?

how to read the multiple sub tittle using xpath?

Using XPath to get the first child for every child of a node

Create nested list and not code block in markdown

Excel VBA: get content from online HTML table

Categories

Resources