HTML parsing with XPath: flattened hierarchical data - html

My target HTML is a flattened table of elements with 2 levels of data defined by class attribute:
<tr>
<td class="type">Type 1</td>
</tr>
<tr>
<td class="name">name1</td>
<td class="year">1970</td>
<td class="rank">1</td>
</tr>
<tr>
<td class="name">name2</td>
<td class="year">1982</td>
<td class="rank">3</td>
</tr>
Goal is parse out list of name, year, rank elements, which I accomplish with these xpath expressions:
//td[#class = 'name']/text()
//td[#class = 'year']/text()
//td[#class = 'rank']/text()
Each element is under immediately preceding
<tr>
<td class="type">Type 1</td>
</tr>
I would like to have "Type 1" assigned to each element parsed above. It could be separate list of the same length. Of course, my target HTML contains many such elements within the same 2-level hierarchy: type - element (name, year, rank).

The following rather clumsy xpath concatenates the closest, previous #type td to the name td matched above.
concat(//td[#class = 'name']/preceding::td[#class='type'][1]/text(), '-',
//td[#class = 'name']/text())
This probably makes more sense when shown in the following xsl
<xsl:for-each select="//td[#class='name']">
<Name>
<xsl:value-of select="concat(preceding::td[#class='type'][1]/text(),
'-', ./text())" />
</Name>
</xsl:for-each>
Applied to the following xml
<xml>
<tr>
<td class="type">Type 1</td>
</tr>
<tr>
<td class="name">name1</td>
<td class="year">1970</td>
<td class="rank">1</td>
</tr>
<tr>
<td class="name">name2</td>
<td class="year">1982</td>
<td class="rank">3</td>
</tr>
<tr>
<td class="type">Type 2</td>
</tr>
<tr>
<td class="name">name3</td>
<td class="year">1971</td>
<td class="rank">2</td>
</tr>
<tr>
<td class="name">name4</td>
<td class="year">1983</td>
<td class="rank">4</td>
</tr>
</xml>
With the result
<Name>Type 1-name1</Name>
<Name>Type 1-name2</Name>
<Name>Type 2-name3</Name>
<Name>Type 2-name4</Name>

Solution 1
First, find the td elements of interest. For example, the name tds with the following pseudo-code:
name_tds = doc.evalXPath("//td[#class = 'name']")
Then you can find the corresponding type td using a name td as context node like this:
type_td = name_td.evalXPath("../preceding-sibling::tr[td[#class = 'type']][1]/td")
Solution 2
Simply iterate all the tds and remember the last type you found. Pseudo-code:
foreach (td in doc.evalXPath("//td") {
class = td.getAttribute("class");
if (class == "type") {
type = td.textContent();
}
else if (class == "name") {
name = td.textContent();
println("type: " + type + ", name: " + name);
}
// Same for year and rank.
}

Related

Count <tr> HTML tag grouped by their #class attribute name in XPath

I need a XPath expression that count all the <tr> rows that have a starting class attribute string: room_loop_counter grouped by their attribute name itself.
I have the following sample HTML code to extract data from:
<tbody id="container" >
<tr class="room_loop_counter1 maintr">
<td class="legibility " rowspan="6"></td>
<td colspan="4" style="padding:0;"></td>
</tr>
<tr class="room_loop_counter1">
<td ></td>
<td class=""></td>
</tr>
<tr class="room_loop_counter1"></tr>
<tr class="room_loop_counter2 maintr divider"></tr>
<tr class="room_loop_counter2"></tr>
<tr class="room_loop_counter3 maintr divider"></tr>
<tr class="room_loop_counter3"></tr>
<tr class="room_loop_counter3"></tr>
<tr class="room_loop_counter3"></tr>
<tr class="room_loop_counter3"></tr>
</tbody>
Given the above HTML I would want to get as result : 2,1,4. The count is the number of elements minus one, since I want to discard from the count the first <tr>(the one with the maintr) that is the header...
Between <tr> elements there could be other <tr> elements so their are not strictly one after the other, so we can't rely on following or preceding sibling logic.
I've tried with the following XPath expression :
count(//table[#id="maxotel_rooms"]/tbody/tr[#class=distinct-values(//table[#id="maxotel_rooms"]/tbody/tr[starts-with(#class, "room_loop_counter") and not(contains(#class, "maintr"))]/#class)]/#class])
but it doesn't work on chrome(evaluating it with $x('') on the console window) since it doesn't recognize the distinct-values function.
Could you suggest a possible solution? What is the best approach ?
Check this XPath for unique tr with class starts with some data and not followed by some other class name.
//tbody/tr[starts-with(#class, "room_loop_counter") and not(contains(#class, "maintr"))]/following::tr[not(./#class=following::tr/#class) and not(contains(#class, "maintr"))]
Javascript:
var path = "//body/div";
var uniquePathCount = window.document.evaluate('count(' + path + ')', window.document, null, 0, null);
console.log( uniquePathCount );
console.log( uniquePathCount.numberValue );
Ouput:
<tr class="room_loop_counter1"/>
<tr class="room_loop_counter2"/>
<tr class="room_loop_counter3"/>

how to add class in table in python using Dominate library

I have created Table using Dominate Library but Now I want to change my table class. can someone help me to do that ?
doc1 = dominate.document(title='Dominate your HTML')
with doc1:
with div():
attr(cls='body')
h1('Survey Report : Survey Report')
oc = dominate.document(title="whatever")
with doc1:
tags.style(".calendar_table{width:880px;}")
tags.style("body{font-family:Helvetica}")
tags.style("h1{font-size:x-large}")
tags.style("h2{font-size:large}")
tags.style("table{border-collapse:collapse}")
tags.style("th{font-size:small;border:1px solid gray;padding:4px;background-color:#DDD}")
tags.style("td{font-size:small;text-align:center;border:1px solid gray;padding:4px}")
with tags.table():
with tags.thead():
tags.th("Nominee", style = "color:#ffffff;background-color:#6A75F2")
tags.th("counts", style = "color:#ffffff;background-color:#6A75F2")
with tags.tbody():
for i in range(0,len(nom)):
with tags.tr(): #Row 1
tags.td(nom[i], style = "font-size:small;text-align:center;padding:4px")
if int(count_nom[i]) > 1:
tags.td(count_nom[i], style = "font-size:small;text-align:center;padding:4px;background-color:#F4D8D2")
else:
tags.td(count_nom[i], style = "font-size:small;text-align:center;padding:4px")
with tags.tr(): #Row 1
tags.td(b("Grand Total"), style = "font-size:small;text-align:center;padding:4px")
tags.td(b(sum(count_nom)), style = "font-size:small;text-align:center;padding:4px")
with open('/root/survey/'+'survey'+'.html', 'w') as f:
f.write(doc1.render())
with this I am able to create Table in HTML
<div class="body">
<h1>Survey Report</h1>
</div>
<style>.calendar_table{width:880px;}</style>
<style>body{font-family:Helvetica}</style>
<style>h1{font-size:x-large}</style>
<style>h2{font-size:large}</style>
<style>table{border-collapse:collapse}</style>
<style>th{font-size:small;border:1px solid gray;padding:4px;background-color:#DDD}</style>
<style>td{font-size:small;text-align:center;border:1px solid gray;padding:4px}</style>
<table>
<thead>
<th style="color:#ffffff;background-color:#6A75F2">Nominee</th>
<th style="color:#ffffff;background-color:#6A75F2">counts</th>
</thead>
<tbody>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Deepesh Ahuja</td>
<td style="font-size:small;text-align:center;padding:4px">1</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Sabyasachi Mallick</td>
<td style="font-size:small;text-align:center;padding:4px">1</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Raju Singh</td>
<td style="font-size:small;text-align:center;padding:4px">1</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Abarna Ravi</td>
<td style="font-size:small;text-align:center;padding:4px;background-color:#F4D8D2">2</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">
<b>Grand Total</b>
</td>
<td style="font-size:small;text-align:center;padding:4px">
<b>5</b>
</td>
</tr>
</tbody>
</table><br><br><br>
Now How I will set table class in python code like
<table class='calender_tabe'>
Can someone help me to set class of table and other tag using python dominate library?
Using the example syntax from github's documentation
from dominate.tags import *
testTable = table(border = 1)
print testTable
which will return:
<table border="1"></table>
with the print statement. However since you can't use the word "class" to refer to the html attribute (class being a python-reserved word) you have to go about it indirectly:
testTable.set_attribute('class','my_class_name')
Adding the above to the original instance of testTable will result in:
<table border="1" class="my_class_name"></table>

How to populate an array with text from html webscraping in ruby

I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]
Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]

HtmlAgilityPack reading table after specified table

I have similiar structure to this:
<table class="superclass">
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</table>
<table cellspacing="0">
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</table>
This is how I get the first table with class:
HtmlNode firstTable = document.DocumentNode.SelectSingleNode("//table[#class=\"superclass\"]");
Then I read the data. However I don't know how to get straight to the another table and read that data too. Any ideas?
I'd rather avoid counting which table it is and then using index to that table.
There is XPath following-sibling axis which allows you to get element following current context element at the same level :
HtmlNode firstTable = document.DocumentNode.SelectSingleNode("//table[#class=\"superclass\"]");
HtmlNode nextTable = firstTable.SelectSingleNode("following-sibling::table");
If you want to access multiple nodes, you can consider SelectNodes(xpath) method over SelectSingleNode(xpath) method.
I'll provide a sample code here for reference, it may not work towards your need.
var tables = htmlDocument.DocumentNode.SelectNodes("//table");
foreach (HtmlNode table in tables)
{
if (table.GetAttributeValue("class", "").Contains("superclass"))
{
//this is the table of class="superclass"
}
else
{
//this is the other table.
}
}

Html Agility Pack: how to scrape <tr> text?

<tr id='tr1' align='center' border=0 class='headerclass'>
Example text
<tr id='tr11' align='center' border=0 bgColor='99ccff'>
<td id='td1' class='headerclass'>Example Header 1 </td>
<td id='td2' class='headerclass'>Example Header 2 </td>
<td id='td3' class='headerclass'>Example Header 3 </td>
</tr>
<tr id='tr12' align='center"'bgColor='white'>
<td id='v1' class='colclass'>value 1</td>
<td id='v2' class='colclass'>value 2</td>
<td id='v3' class='colclass'>value 3</td>
</tr>
</tr>
Above is the HTML example that I wanted to scrape. I want to get Example text which is in <tr></tr>. I tried to use InnerText (code as shown below) but it returns me all the text in <td></td> as well which is not what I want. I would like to get Example text only.
var nodes = htmlDoc.DocumentNode.SelectNodes("//tr").Where(x => x.Attributes["id"] != null && x.Attributes["id"].Value.Contains("tr1"));
foreach (var htmlNode in nodes)
{
Console.WriteLine(htmlNode.InnerText);
}
Output:
Example text
Example Header 1
Example Header 2
Example Header 3
value 1
value 2
value 3
Thank you.
You could do it something like this :
var text = doc.DocumentNode.Descendants("tr")
.First(p => p.Attributes["id"] != null &&
p.Attributes["id"].Value.Contains("tr1")).ChildNodes[0].InnerText.Trim();
The output is :
Example text