Html Agility Pack: how to scrape <tr> text? - html

<tr id='tr1' align='center' border=0 class='headerclass'>
Example text
<tr id='tr11' align='center' border=0 bgColor='99ccff'>
<td id='td1' class='headerclass'>Example Header 1 </td>
<td id='td2' class='headerclass'>Example Header 2 </td>
<td id='td3' class='headerclass'>Example Header 3 </td>
</tr>
<tr id='tr12' align='center"'bgColor='white'>
<td id='v1' class='colclass'>value 1</td>
<td id='v2' class='colclass'>value 2</td>
<td id='v3' class='colclass'>value 3</td>
</tr>
</tr>
Above is the HTML example that I wanted to scrape. I want to get Example text which is in <tr></tr>. I tried to use InnerText (code as shown below) but it returns me all the text in <td></td> as well which is not what I want. I would like to get Example text only.
var nodes = htmlDoc.DocumentNode.SelectNodes("//tr").Where(x => x.Attributes["id"] != null && x.Attributes["id"].Value.Contains("tr1"));
foreach (var htmlNode in nodes)
{
Console.WriteLine(htmlNode.InnerText);
}
Output:
Example text
Example Header 1
Example Header 2
Example Header 3
value 1
value 2
value 3
Thank you.

You could do it something like this :
var text = doc.DocumentNode.Descendants("tr")
.First(p => p.Attributes["id"] != null &&
p.Attributes["id"].Value.Contains("tr1")).ChildNodes[0].InnerText.Trim();
The output is :
Example text

Related

xpath condition select text of one node or another node

this is my test data
<tbody>
<tr>
<td>foo 1</td>
<td>first interest</td>
<td>bar 1</td>
</tr>
<tr>
<td>foo 2</td>
<td>
<p>second interest</p>
</td>
<td>bar 2</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
</tbody>
I'd like to select every time text of second cell (td[2]) of table row but problem is that the text can be in another subelement (paragraph p).
When I execute this xpath //tbody/tr[1]/td[2]/p/text() | //tbody/tr[1]/td[2]/text() the result is ok, but if I execute this for second row //tbody/tr[2]/td[2]/p/text() | //tbody/tr[2]/td[2]/text() then I get three texts where first and last are empty. How can I modify the xpath to get everytime only the text which I'm interested in. Note: there can be also empty cell, that I don't want to get.
thanks
Try this XPath to get text from required (not empty second) table cells:
//tbody/tr/td[2]//text()[normalize-space()]

How to populate an array with text from html webscraping in ruby

I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]
Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]

HTML Table with 5 rows and 5 columns [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
How do I create an HTML table with 5 rows and 5 columns?
I tried the following arrangement
<table width="100%" border="1">
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</table>
But I only get 5 columns and 3 rows. Please help
Thanks, is there an easier way than just writing 5 lines each?
TL;DR: You only provided 3 <tr> markups, so you only got 3 rows.
In HTML, tables have the following basic structure1:
table
|- row_1
| |- data_1
| `- data_2
|- row_2
| |- ...
...
1 - Other markups exist such as th, thead, etc. but for clarity's sake I'm ignoring them. More info here. (MDN Table element)
You first declare the table with the <table> markup, and then the rows with the <tr> markup. (table row.)
Inside each row, you can declare the data containers <td>. (table data).
Here's a snippet of HTML code to create a table with 5 columns and 5 rows:
table {
border-collapse: collapse;
}
table td {
border: 1px solid black;
}
<table>
<tr>
<td>row_1/col_1</td>
<td>col_2</td>
<td>col_3</td>
<td>col_4</td>
<td>col_5</td>
</tr>
<tr>
<td>row_2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>row_3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>row_4</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>row_5</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>
<tr> is short for table row. You only have 3, so you'll only get 3 rows and not 5. Add 2 more rows.
Edit: Added some javascript.
let table = document.getElementById("table");
//fake data
const contentList = [
//row1
[{title: "title"},{title: "title"},{title: "title"}],
//row2
[{title: "title"},{title: "title"},{title: "title"}],
//row3
[{title: "title"},{title: "title"},{title: "title"}]
];
//function to generate a row.
const rowTemplate = (data, rowNumber) =>{
let rowString = "<tr>";
data.forEach((td, index)=>{
rowString += `<td>${td.title}, row: ${rowNumber}, column: ${index}</td>`;
});
rowString += "</tr>"
return rowString;
};
//function to generate all the rows
const generateRows = (data, elementToPopulate) => {
let htmlString = "";
data.forEach((row,index)=>{
htmlString += rowTemplate(row, index);
});
elementToPopulate.innerHTML = htmlString;
}
//call method
generateRows(contentList, table);
<table width="100%" border="1" id="table">
</table>

How do I write my html table to excel file using epplus

So what im trying to do here is to write a simple html table to a xlsx (excel) file using epplus. The code ive got this far is
controller:
public void saveToExcel(string tbl)
{
using (ExcelPackage p = new ExcelPackage())
{
p.Workbook.Worksheets.Add("demo");
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "Demo";
Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
Response.AddHeader("content-disposition", "attachment; filename=ExcelDemo.xlsx");
Response.BinaryWrite(p.GetAsByteArray());
}
}
now this creates a empty workbook. And all I want to do right now is to write this table I have in my
View:
<Table id="tbl" name="tbl">
<tr>
<td>
Title 1
</td>
<td >
Title 1
</td>
<td>
Title 1
</td>
</tr>
<tr>
<td >
Row 1
</td>
<td>
Row 1
</td>
<td>
Row 1
</td>
</tr>
<tr>
<td >
Row 2
</td>
<td>
Row 2
</td>
<td>
Row 2
</td>
</tr>
<tr>
<td >
Row 2
</td>
<td>
Row 2
</td>
<td>
Row 2
</td>
</tr>
</table>
#Html.ActionLink("saveToExcel", "saveToExcel")
to the workbook. But I just dont know how and where to start.
Thankful for any pointers in the right direction.
I Guess:
First of all you have to convert your HTML-table to a .NET Datatable
This can be found here Convert Table
Next you use this code (considering your created datatable is called 'data' :
Dim attachment As String = "attachment; filename=MyExcelPage.xlsx"
Dim epackage As ExcelPackage = New ExcelPackage
Dim excel As ExcelWorksheet = epackage.Workbook.Worksheets.Add("ExcelTabName")
excel.Cells("A1").LoadFromDataTable(data, True)
HttpContext.Current.Response.Clear()
HttpContext.Current.Response.ClearHeaders()
HttpContext.Current.Response.ClearContent()
HttpContext.Current.Response.AddHeader("content-disposition", attachment)
HttpContext.Current.Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
HttpContext.Current.Response.BinaryWrite(epackage.GetAsByteArray())
HttpContext.Current.Response.End()
epackage.Dispose()

HTML parsing with XPath: flattened hierarchical data

My target HTML is a flattened table of elements with 2 levels of data defined by class attribute:
<tr>
<td class="type">Type 1</td>
</tr>
<tr>
<td class="name">name1</td>
<td class="year">1970</td>
<td class="rank">1</td>
</tr>
<tr>
<td class="name">name2</td>
<td class="year">1982</td>
<td class="rank">3</td>
</tr>
Goal is parse out list of name, year, rank elements, which I accomplish with these xpath expressions:
//td[#class = 'name']/text()
//td[#class = 'year']/text()
//td[#class = 'rank']/text()
Each element is under immediately preceding
<tr>
<td class="type">Type 1</td>
</tr>
I would like to have "Type 1" assigned to each element parsed above. It could be separate list of the same length. Of course, my target HTML contains many such elements within the same 2-level hierarchy: type - element (name, year, rank).
The following rather clumsy xpath concatenates the closest, previous #type td to the name td matched above.
concat(//td[#class = 'name']/preceding::td[#class='type'][1]/text(), '-',
//td[#class = 'name']/text())
This probably makes more sense when shown in the following xsl
<xsl:for-each select="//td[#class='name']">
<Name>
<xsl:value-of select="concat(preceding::td[#class='type'][1]/text(),
'-', ./text())" />
</Name>
</xsl:for-each>
Applied to the following xml
<xml>
<tr>
<td class="type">Type 1</td>
</tr>
<tr>
<td class="name">name1</td>
<td class="year">1970</td>
<td class="rank">1</td>
</tr>
<tr>
<td class="name">name2</td>
<td class="year">1982</td>
<td class="rank">3</td>
</tr>
<tr>
<td class="type">Type 2</td>
</tr>
<tr>
<td class="name">name3</td>
<td class="year">1971</td>
<td class="rank">2</td>
</tr>
<tr>
<td class="name">name4</td>
<td class="year">1983</td>
<td class="rank">4</td>
</tr>
</xml>
With the result
<Name>Type 1-name1</Name>
<Name>Type 1-name2</Name>
<Name>Type 2-name3</Name>
<Name>Type 2-name4</Name>
Solution 1
First, find the td elements of interest. For example, the name tds with the following pseudo-code:
name_tds = doc.evalXPath("//td[#class = 'name']")
Then you can find the corresponding type td using a name td as context node like this:
type_td = name_td.evalXPath("../preceding-sibling::tr[td[#class = 'type']][1]/td")
Solution 2
Simply iterate all the tds and remember the last type you found. Pseudo-code:
foreach (td in doc.evalXPath("//td") {
class = td.getAttribute("class");
if (class == "type") {
type = td.textContent();
}
else if (class == "name") {
name = td.textContent();
println("type: " + type + ", name: " + name);
}
// Same for year and rank.
}