Using xpath - How to select column data related to table header - html

<table border="1">
<tbody>
<tr>
<th>ID</th>
<th>Product</th>
<th>Color</th>
<th>Model</th>
</tr>
<tr>
<td>22</td>
<td>Car</td>
<td>blue</td>
<td>
<ul>
</ul>
</td>
</tr>
</tbody>
</table>
Above is a snippet of a highly nested html document. To get the table level I have used the following xpath
//th[contains(text(), "ref_code")]/following-
sibling::td[contains(text(), "197")]/ancestor::table[2]
How then can I edit the same xpath and select a specific table header data and the corresponding table data column like so using xpath:
ID |Product |Color
22 |Car |Blue
Any help will be appreciated

From your comments to the answers given here:
I assume that you get the above table from an existing xpath which is :
//th[contains(text(), "ref_code")]/following-
sibling::td[contains(text(), "197")]/ancestor::table[2]
Now you want to add/edit to this xpath such that you get the values of td given a column for e.g. Color, then the below xpath should give you the td values for all columns given Color as input:
//td[position()<=(count(//tr/th[.='Color']/preceding-sibling::*)+1) ]
Assuming your first xpath works correctly, add the above xpath to that like:
//th[contains(text(), "ref_code")]/following-
sibling::td[contains(text(), "197")]/ancestor::table[2]//td[position()<=(count(//tr/th[.='Color']/preceding-sibling::*)+1) ]
Output:
<td>22</td>
<td>Car</td>
<td>blue</td>
If you want just the Color, use xpath :
//td[(count(//tr/th[.='Color']/preceding-sibling::*)+1) ]
If you want just the Product use xpath :
//td[(count(//tr/th[.='Product']/preceding-sibling::*)+1) ]
If you want just the ID use xpath :
//td[(count(//tr/th[.='ID']/preceding-sibling::*)+1) ]
Note that the xpath changes at th[.='XXX'] where XXX is the selected element.
But if you want the output to be in the form of a table , you need to use XSLT, because you are trying to get a transformed view of your html , not just selected elements.

We seach for table data //table//td by position in header of column //table//th[text()='Color']
That [count(element/preceding-sibling::*) +1] is how to find element's index
So result is:
//table//td[count(//table//th[text()='Color']/preceding-sibling::*) +1]

Related

Output is not displaying in HTML format after transforming the xml result using xslt when the attribute name contains special character

My xml has just 1 value as name =RDXXX-LOWER_DECK, value=10 mm. When this is transformed using xslt I get output correctly as below:
<table>
<tr valign="top">
<td width="200">RDXXX-LOWER_DECK</td>
<td width="200">10.000000000000 mm</td>
</tr>
</table>
But when I replace RDXXX-LOWER_DECK as RDXXX||LOWER_DECK (hyphen is replaced with double pipe) I don't get the output. Empty value is printed and name is printed as "Attribute" .
<table>
<tr valign="top">
<td width="200">Attribute</td>
<td width="200"></td>
</tr>
</table>
KIndly let me know how to retain || in the output.
My xml has just 1 value as name =RDXXX-LOWER_DECK, value=10 mm.
But when I replace RDXXX-LOWER_DECK as RDXXX||LOWER_DECK (hyphen is replaced with double pipe)...
If by that you mean that you have an XML like this:
<RDXXX-LOWER_DECK>10mm</RDXXX-LOWER_DECK>
and you changed it to look like this:
<RDXXX||LOWER_DECK>10mm</RDXXX||LOWER_DECK>
then you no longer have a well-formed XML document. The | character is not allowed in an element name.
... I don't get the output. Empty value is printed and name is printed as "Attribute" .
That is strange, because you should have been getting an error.

Xpath selector, select an element depending of innertext of its childs

I have this html :
<tbody>
<tr id="1">
<td>foo faa</td>
<td>faa fii</td>
<td>foo faa</td>
<td>faa fuu</td>
</tr>
<tr id="2">
<td>foo fuu</td>
<td>fyy fuu</td>
<td>foo foo</td>
<td>fuu fii</td>
</tr>
<tr id="3">
<td>fuu faa</td>
<td>fii fuu</td>
<td>fuu fuu</td>
<td>fyy fee</td>
</tr>
<tr id="4">
<td>foo foo</td>
<td>fee faa</td>
<td>fee fyy</td>
<td>foo fuu</td>
</tr>
</tbody>
Elements td in my example contains two words, but in my real case, elements td may contains more words. And tr elements may contains more 4 td childs.
I want select tr element(s) depending of innerText of its childs. I want be able to search multiple values.
Be example :
if I search "fuu" and "foo" and "fii", the expected result of the xpath must be the elements tr with id 1 and 2.
if I search "fuu" and "fii", the expected result of the xpath must be the elements tr with id 1 and 2 and 3.
if I search only "fee", the expected result of the xpath must be the element tr with id 3 and 4.
I tried this :
//tr[*[contains(text(), 'fuu')] and *[contains(text(), 'foo')] and *[contains(text(), 'fii')]]
Its work as expected (http://xpather.com/Tdg5OGr2). But maybe it exist a more generic/proper solution, any idea someone ?
If I want search by example ten words, the xpath will become really big x)

Unable to select multiple values using xpath

Here is my HTML code:
<table id="laptop_detail" class="table">
<tr>
<td style="padding-left:18px" class="ha">Touchscreen</td>
<td class="val"><span class="no_icon">No</span></td>
</tr>
<tr>
<td style="padding-left:18px" class="ha">Water Dispenser</td>
<td class="val"><span class="no_icon">No</span></td>
</tr>
<tr>
<td style="padding-left:18px" class="ha">Colour / Material</td>
<td class="val">Grey</td>
</tr>
</table>
Here is my xpath:
$x('//*[#id="laptop_detail"]//tr/td[contains(. ,"Touchscreen")]/following-sibling::td[1]/span/text() and //*[#id="laptop_detail"]//tr/td[contains(. ,"Water Dispenser")]/following-sibling::td[1]/span/text() and //*[#id="laptop_detail"]//tr/td[contains(. ,"Colour")]/following-sibling::td[1]/text()')
But my xpath returns "true" instead of my requirement "No, No, Grey". I know there is something wrong with my xpath but i am unable to understand it.
EDIT: Okay i had a little success, I was able to get "No, No" using this xpath:
$x('//*[#id="laptop_detail"]//tr/td[contains(. ,"Touchscreen") or contains(. ,"Water")]/following-sibling::td[1]/span/text()')
but unable to get "Grey" as that value is not inside span tag.
Here is a fix to your solution (I've added | operator):
//*[#id="laptop_detail"]/tr/td[contains(. ,"Touchscreen") or contains(. ,"Water")]/following-sibling::td[1]/span/text() | //*[#id="laptop_detail"]/tr/td[contains(. ,"Colour / Material")]/following-sibling::td[1]/text()
You can use little bit more easy syntax (run faster) if it is acceptable for your logic.
/table[#id="laptop_detail"]/tr/td[#class='val']/span/text() | /table[#id="laptop_detail"]/tr/td[#class='val']/text()

How to get line from table with Jsoup

I have table without any class or id (there are more tables on the page) with this structure:
<table cellpadding="2" cellspacing="2" width="100%">
...
<tr>
<td class="cell_c">...</td>
<td class="cell_c">...</td>
<td class="cell_c">...</td>
<td class="cell">SOME_ID</td>
<td class="cell_c">...</td>
</tr>
...
</table>
I want to get only one row, which contains <td class="cell">SOME_ID</td> and SOME_ID is an argument.
UPD.
Currently i am doing iy in this way:
doc = Jsoup.connect("http://www.bank.gov.ua/control/uk/curmetal/detail/currency?period=daily").get();
Elements rows = doc.select("table tr");
Pattern p = Pattern.compile("^.*(USD|EUR|RUB).*$");
for (Element trow : rows) {
Matcher m = p.matcher(trow.text());
if(m.find()){
System.out.println(m.group());
}
}
But why i need Jsoup if most of work is done by regexp ? To download HTML ?
If you have a generic HTML structure that always is the same, and you want a specific element which has no unique ID or identifier attribute that you can use, you can use the css selector syntax in Jsoup to specify where in the DOM-tree the element you are after is located.
Consider this HTML source:
<html>
<head></head>
<body>
<table cellpadding="2" cellspacing="2" width="100%">
<tbody>
<tr>
<td class="cell">I don't want this one...</td>
<td class="cell">Neither do I want this one...</td>
<td class="cell">Still not the right one..</td>
<td class="cell">BINGO!</td>
<td class="cell">Nothing further...</td>
</tr> ...
</tbody>
</table>
</body>
</html>
We want to select and parse the text from the fourth <td> element.
We specify that we want to select the <td> element that has the index 3 in the DOM-tree, by using td:eq(3). In the same way, we can select all <td> elements before index 3 by using td:lt(3). As you've probably figured out, this is equal and less than.
Without using first() you will get an Elements object, but we only want the first one so we specify that. We could use get(0) instead too.
So, the following code
Element e = doc.select("td:eq(3)").first();
System.out.println("Did I find it? " + e.text());
will output
Did I find it? BINGO!
Some good reading in the Jsoup cookbook!

Read td values using prototype

I would like to read the values of HTML td using prototype. For example, say you have a table as follows
<table id="myTable">
<tr>
<td>apple</td>
<td>orange</td>
</tr>
<tr>
<td>car</td>
<td>bus</td>
</tr>
</table>
I would like to read the values - apple, orange, car and bus alone. I am unable to find a
way to do it? Any help would be of great help.
Thanks,
J
This should work:
var values = $$('#myTable td').collect(function(element) {
// stripTags(), if you're only interested in the actual content
return element.innerHTML.stripTags();
});
The following returns an array of strings.
$$('#myTable td').pluck('innerHTML');