How to parse tables without id on HTML using HtmlAgilityPack - html

I have a problem getting the values of a table in HTML cause it doesn't have a ids. I need to get all the values on the second column and keep them into an array. I am using HtmlAgilityPack and my problems comes when selecting nodes:
Dim doc As HtmlDocument
Dim web As New HtmlWeb()
Dim str As String
doc = Web.Load("http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#")
Dim nodes_filas As HtmlNode() = doc.DocumentNode.SelectNodes("//table[#id='']//tr").ToArray
Dim nodes_columnas As HtmlNode() = doc.DocumentNode.SelectNodes("//td").ToArray
For Each row As HtmlNode In nodes_filas
For Each column As HtmlNode In nodes_columnas
str = column.InnerHtml & vbCrLf
Next
Next
This is the table:
<table cellspacing="1" cellpadding="3" width="100%" border="0">
<tr>
<td colspan="2" style="font-size:13px;color:#55711C;padding-bottom:5px;">Aporte por ración</td>
</tr>
<tr style="background-color:#EBEBEB">
<td width="125">Energía [Kcal]</td>
<td class="td_right">145,00</td>
</tr>
<tr>
<td>Proteína [g]</td>
<td class="td_right">22,20</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Hidratos carbono [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr>
<td>Fibra [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Grasa total [g]</td>
<td class="td_right">6,20</td>
</tr>
<tr>
<td>AGS [g]</td>
<td class="td_right">1,91</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGM [g]</td>
<td class="td_right">1,92</td>
</tr>
<tr>
<td>AGP [g]</td>
<td class="td_right">1,52</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGP /AGS</td>
<td class="td_right">0,79</td>
</tr>
<tr>
<td>(AGP + AGM) / AGS</td>
<td class="td_right"> 1,80</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Colesterol [mg]</td>
<td class="td_right">62,00</td>
</tr>
<tr>
<td>Alcohol [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Agua [g]</td>
<td class="td_right">71,60</td>
</tr>
</table>

Sorry I don't have VB installed but C# version should be enough to give you an idea. You have td_right class, you can use either lambda or xpath to query it.
I like lambda/linq version more because I am familiar with linq, and I don't need to remember XPATH syntax.
Lambda:
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}
var url = "http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#";
var htmlWeb = new HtmlWeb();
var htmlDoc = htmlWeb.Load(url);
var nodes = htmlDoc.DocumentNode.Descendants("td").Where(_ => _.HasClass("td_right")).Select(_ => _.InnerText);
XPATH:
var nodes2 = htmlDoc.DocumentNode.SelectNodes("//td[#class='td_right']");

Related

Get Elements from IE Table via VBScript

I'm having some problems getting the elements I need from a web page table. The example code from the table is:
<tr>
<td colspan="11" class="anscalls">Answered Calls</td>
</tr>
<tr class="daterow">
<td>01/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>02/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="changeditem">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 2</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>03/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>04/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="changeditem">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 2</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 3</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>05/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
I'm able to get the information between the "changeditem" class, which is what I need, but I also need the information from the "daterow" class to go along with the "changeditem" information. I'm currently using the following code:
For j = 0 To (.Document.getElementsByClassName("changeditem").Length - 1)
MsgBox .Document.getElementsByClassName("changeditem").Item((j + 0)).InnerText & Chr(44) & _
.Document.getElementsByClassName("changeditem").Item((j + 1)).InnerText
j = j + 1
Next
Which Outputs:
User1,#
User2,#
User1,#
User2,#
User3,#
I would need to loop through the entire table, which is much bigger than shown, and get the "daterow" class relevant to the "changeditem" classes, which I cannot figure out how to do.
What I'm aiming to get is:
02/01/2001,User 1,#
02/01/2001,User 2,#
04/01/2001,User 1,#
04/01/2001,User 2,#
04/01/2001,User 3,#
Thanks in advance.
Not a VBScript answer, but jQuery exists specifically for this kind of DOM manipulation and is very widely used, so I'll suggest it anyway. Using jQuery you could do something like the following. I am by no means fluent in jQuery and this will not output the exact desired output, but it illustrates the idea. You could
do all of this with standard DOM methods, jQuery just makes it much easier.
<script>
$(function() {
// get a reference to all changeditem rows
var $changedItem = $('tr.changeditem');
// loop the results
$changedItem.each(function() {
// contents of first td in tr
console.log( $(this).children('td').first().text());
// if there is a sibling tr daterow, get a reference
var $dateRow = $(this).next('tr.daterow');
// contents of first td in tr
console.log( $dateRow.children('td').first().text());
});
});
</script>

Get content from table by webbrowser?

I have the following table.
<table class="table1">
<tbody>
<tr>
<th></th>
<th>SEQ</th>
<th>LOGIN</th>
<th>WHATSAPP</th>
<th>E-MAIL</th>
</tr>
<tr>
<td>1</td>
<td></td>
<td>name</td>
<td>99 999999999</td>
<td>xxxxxxx#hotmail.com</td>
</tr>
</tbody>
</table>
I would like to know how to get content from each TR to write to Access Database.
Because until the moment I only managed to get to this code.
Dim PageElement As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("table")
For Each CurElement As HtmlElement In PageElement
If (CurElement.GetAttribute("className") = "table1") Then
TextBox1.Text = CurElement.InnerHtml
End If
Next

Ignoring tags in XPATH using html agility pack

I am using the following code to parse html tables from an html file into a dataset:
Public Function GetDataSet(html As String) As DataSet
Dim ds As DataSet = New DataSet
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html)
Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
.GroupBy(Function(x) x.ParentNode)
For i As Integer = 0 To tables.Count - 1
Dim rows = tables(i).ToList()
ds.Tables.Add(String.Format("Table {0}", i))
Dim headers = rows(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList()
For Each Hr In headers
ds.Tables(i).Columns.Add(Hr)
Next
For j As Integer = 1 To rows.Count - 1
Dim row = rows(j)
Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()
ds.Tables(i).Rows.Add(dr)
Next
Next
Return ds
End Function
and it works fine. But When There are a Tag placed inside the <Table> Tag before <tr> tag the table is not parsed
Simple Example:
<html>
<head><title>Test</title></head>
<body>
<div>Contents:</div>
<table>
<tr>
<th>Column1</th> <th>Column2</th>
</tr>
<tr>
<td>1</td> <td>11</td>
</tr>
<tr>
<td>2</td> <td>22</td>
</tr>
</table>
<table>
<tbody>
<tr>
<th>Column1</th> <th>Column2</th> <th>Column3</th>
</tr>
<tr>
<td>a</td> <td>aa</td> <td>aaa</td>
</tr>
<tr>
<td>b</td> <td>bb</td> <td>bbb</td>
</tr>
</tbody>
</table>
<table>
<div>
<tr>
<th>Column1</th> <th>Column2</th> <th>Column3</th>
</tr>
<tr>
<td>a</td> <td>aa</td> <td>aaa</td>
</tr>
<tr>
<td>b</td> <td>bb</td> <td>bbb</td>
</tr>
</div>
</table>
</body>
</html>
In This Example only the first table is parsed.
My question is how to ignore any tag between <Table> tag and <tr> tag in the following line of code:
Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
.GroupBy(Function(x) x.ParentNode)
and all the tables will be parsed.
You can use // to select from all descendants:
Dim rows = htmldoc.DocumentNode.SelectNodes("//table//tr");
Also based on your requirement, it seems it's better to group the result based on the first ancestor table, because the parent of tr may be a tbody or thead and you need to group rows in tables:
Dim tables = htmldoc.DocumentNode.SelectNodes("//table//tr") _
.GroupBy(Function(x) x.Ancestors("table").First())

How do I write my html table to excel file using epplus

So what im trying to do here is to write a simple html table to a xlsx (excel) file using epplus. The code ive got this far is
controller:
public void saveToExcel(string tbl)
{
using (ExcelPackage p = new ExcelPackage())
{
p.Workbook.Worksheets.Add("demo");
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "Demo";
Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
Response.AddHeader("content-disposition", "attachment; filename=ExcelDemo.xlsx");
Response.BinaryWrite(p.GetAsByteArray());
}
}
now this creates a empty workbook. And all I want to do right now is to write this table I have in my
View:
<Table id="tbl" name="tbl">
<tr>
<td>
Title 1
</td>
<td >
Title 1
</td>
<td>
Title 1
</td>
</tr>
<tr>
<td >
Row 1
</td>
<td>
Row 1
</td>
<td>
Row 1
</td>
</tr>
<tr>
<td >
Row 2
</td>
<td>
Row 2
</td>
<td>
Row 2
</td>
</tr>
<tr>
<td >
Row 2
</td>
<td>
Row 2
</td>
<td>
Row 2
</td>
</tr>
</table>
#Html.ActionLink("saveToExcel", "saveToExcel")
to the workbook. But I just dont know how and where to start.
Thankful for any pointers in the right direction.
I Guess:
First of all you have to convert your HTML-table to a .NET Datatable
This can be found here Convert Table
Next you use this code (considering your created datatable is called 'data' :
Dim attachment As String = "attachment; filename=MyExcelPage.xlsx"
Dim epackage As ExcelPackage = New ExcelPackage
Dim excel As ExcelWorksheet = epackage.Workbook.Worksheets.Add("ExcelTabName")
excel.Cells("A1").LoadFromDataTable(data, True)
HttpContext.Current.Response.Clear()
HttpContext.Current.Response.ClearHeaders()
HttpContext.Current.Response.ClearContent()
HttpContext.Current.Response.AddHeader("content-disposition", attachment)
HttpContext.Current.Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
HttpContext.Current.Response.BinaryWrite(epackage.GetAsByteArray())
HttpContext.Current.Response.End()
epackage.Dispose()

Parsing htmlagilitypack (table without id's) vb.net

Hope to get some answers from you.
I use vb.net and htmlagilitypack to fetch data and it works, but not the way I want it to =)
I have this html page (part of):
<TABLE WITH=100% BORDER=4>
<TR>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=Lok" >Lok</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=Avg" >Avgår</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=AvgS" >Station</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=Ank" >Ankommer</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=AnkS" >Station</A></TH>
<TH>Tjänstetyp</TH>
</TR>
<TR>
<TD>R1176</TD>
<TD>Mar-20-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-20-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B1</TD>
</TR>
<TR>
<TD>R1267</TD>
<TD>Mar-20-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-20-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B2</TD>
</TR>
<TR>
<TD>R1267</TD>
<TD>Mar-20-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-20-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-21-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-21-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-21-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-21-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>B2</TD>
</TR>
<TR>
<TD>RXXXXX</TD>
<TD>Mar-21-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-21-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>B1\B2</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-25-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-25-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-25-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-25-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1254</TD>
<TD>Mar-27-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-27-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B2</TD>
</TR>
<TR>
<TD>RXXXXX</TD>
<TD>Mar-27-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-27-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B1\B2</TD>
</TR>
</TABLE>
<A><A>Senast uppdaterad: Mar-20-2013 18:16:00</A><BR>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD width="20%" bgcolor="#009900" align="left">
<IMG src="http://litmgc101.greencargo.com/bottenbild.jpg" alt="Green Cargo" width=800 height=25 border=0>
</TD>
</TR>
<TR>
</table>
What I want to do is to fetch the parts with (for example) "R1176" and the date "Mar-20-2013 13:04:00". (Would prefer to NOT have the time "13:04:00"), but I can delete that in VB.net later if I can't skip it in the parsing phase.
So to simply explain what I want to do is following:
Get all the "R1234" and the date that comes with it then put it in let's say a textbox for the "R4321" and another textbox for the date or something.
In C# I'd do something like this:
var result =
doc.DocumentNode.SelectNodes("//td/a[contains(#href,'Lokindivid')]")
.Select(node => new KeyValuePair<string, DateTime>(node.InnerText, DateTime.Parse(node.SelectSingleNode("./ancestor::tr[1]/td[2]").InnerText).Date));
My VB.NET foo resulted in the following code (which is a literal translation) which works with the sample html you provided:
Dim doc As New HtmlDocument
doc.LoadHtml(Content.Html)
Dim items = doc.DocumentNode.SelectNodes("//td/a[contains(#href,'Lokindivid')]").Select(Function(node) New KeyValuePair(Of String, DateTime)(node.InnerText, DateTime.Parse(node.SelectSingleNode("./ancestor::tr[1]/td[2]").InnerText).Date))
For Each item As KeyValuePair(Of String, Date) In items
Console.WriteLine(item.Key)
Console.WriteLine(item.Value)
Next