Hope to get some answers from you.
I use vb.net and htmlagilitypack to fetch data and it works, but not the way I want it to =)
I have this html page (part of):
<TABLE WITH=100% BORDER=4>
<TR>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=Lok" >Lok</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=Avg" >Avgår</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=AvgS" >Station</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=Ank" >Ankommer</A></TH>
<TH><A HREF="http:/cgi-bin/vplata.py?tgnr=4300&val=Visa+T%C3%A5gnummer&Bek=Visa&sort=AnkS" >Station</A></TH>
<TH>Tjänstetyp</TH>
</TR>
<TR>
<TD>R1176</TD>
<TD>Mar-20-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-20-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B1</TD>
</TR>
<TR>
<TD>R1267</TD>
<TD>Mar-20-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-20-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B2</TD>
</TR>
<TR>
<TD>R1267</TD>
<TD>Mar-20-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-20-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-21-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-21-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-21-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-21-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>B2</TD>
</TR>
<TR>
<TD>RXXXXX</TD>
<TD>Mar-21-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-21-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>B1\B2</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-25-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-25-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1281</TD>
<TD>Mar-25-2013 22:05:00</TD>
<TD>ET3</TD>
<TD>Mar-25-2013 22:28:00</TD>
<TD>KBÄ</TD>
<TD>D1</TD>
</TR>
<TR>
<TD>R1254</TD>
<TD>Mar-27-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-27-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B2</TD>
</TR>
<TR>
<TD>RXXXXX</TD>
<TD>Mar-27-2013 13:04:00</TD>
<TD>HBGB</TD>
<TD>Mar-27-2013 21:21:00</TD>
<TD>ET3</TD>
<TD>B1\B2</TD>
</TR>
</TABLE>
<A><A>Senast uppdaterad: Mar-20-2013 18:16:00</A><BR>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD width="20%" bgcolor="#009900" align="left">
<IMG src="http://litmgc101.greencargo.com/bottenbild.jpg" alt="Green Cargo" width=800 height=25 border=0>
</TD>
</TR>
<TR>
</table>
What I want to do is to fetch the parts with (for example) "R1176" and the date "Mar-20-2013 13:04:00". (Would prefer to NOT have the time "13:04:00"), but I can delete that in VB.net later if I can't skip it in the parsing phase.
So to simply explain what I want to do is following:
Get all the "R1234" and the date that comes with it then put it in let's say a textbox for the "R4321" and another textbox for the date or something.
In C# I'd do something like this:
var result =
doc.DocumentNode.SelectNodes("//td/a[contains(#href,'Lokindivid')]")
.Select(node => new KeyValuePair<string, DateTime>(node.InnerText, DateTime.Parse(node.SelectSingleNode("./ancestor::tr[1]/td[2]").InnerText).Date));
My VB.NET foo resulted in the following code (which is a literal translation) which works with the sample html you provided:
Dim doc As New HtmlDocument
doc.LoadHtml(Content.Html)
Dim items = doc.DocumentNode.SelectNodes("//td/a[contains(#href,'Lokindivid')]").Select(Function(node) New KeyValuePair(Of String, DateTime)(node.InnerText, DateTime.Parse(node.SelectSingleNode("./ancestor::tr[1]/td[2]").InnerText).Date))
For Each item As KeyValuePair(Of String, Date) In items
Console.WriteLine(item.Key)
Console.WriteLine(item.Value)
Next
Related
Hi all I would like to extract 25.8 value from this html block using xpath
the html code is from a weather website, https://app.weathercloud.net/
"<div id=""gauge-rainrate""><h3>Intensidad de lluvia</h3><canvas id=""rainrate"" width=""200"" height=""200""></canvas><div class=""summary"">
<table>
<tbody><tr>
<th> mm/h</th>
<th class=""max""><i class=""icon-chevron-up icon-white""></i> Máx </th>
</tr>
<tr>
<td class=""grey"">Diaria</td>
<td><a id=""gauge-rainrate-max-day"" rel=""tooltip"" title="""" data-original-title=""22/04/2022 00:00"">0.0</a></td>
</tr>
<tr>
<td class=""grey"">Mensual</td>
<td><a id=""gauge-rainrate-max-month"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
<tr>
<td class=""grey"">Anual</td>
<td><a id=""gauge-rainrate-max-year"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
</tbody></table>
</div></div>"
I use this expression to extract in a google spreadsheet cell
=IMPORTXML("https://app.weathercloud.net/d5044837546#current";"//a[#id='gauge-rainrate-max-month']")
apparently the code is correct but my output is always
-
I don't understand why...
I am trying to extract some information from a table which appears on a webpage, but the table is unstructured with row being header and column being content like this: (My apologies for not disclosing the webpage)
<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>
At the moment, I am able to extract the table by using this script:
import pandas as pd
import requests
url = "example.com"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
df.to_csv('myfile.csv',encoding='utf-8-sig')
And the table essentially looks like the following:
However, I am a little stuck with how to achieve this on Python. I cannot seem to get my head around to getting the data. The result I want is as below:
Any help would be appreciated. Thank you so much in advance.
You can use beautifulsoup to parse the HTML. For example:
import pandas as pd
from bs4 import BeautifulSoup
txt = '''<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>'''
soup = BeautifulSoup(txt, 'html.parser')
row = {}
for h in soup.select('th:has(+td)'):
row[h.text] = h.find_next('td').get_text(strip=True)
df = pd.DataFrame([row])
print(df)
Prints:
Full name Year of birth Gender Place of birth Address
0 James Smith 1992 Male TTexas, USA Texas, USA
I want to create a table that its data is a Map< String, List < Object> >.
So the table has one header that and the rows should have the exact data.
Map.key
Object.item1
Object.item2
Object.item3
So since it is a List of Object i want one row for every Object of the List and the Map.key to be repeated.
So i need to iterate through keys like
<table>
<thead>
<tr>
<th>Code</th>
<th>Status</th>
<th>Flag</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<tr th:each= "result : ${myMap}">
<td th:text="${result.key}"></td>
<td><table>
<tr th:each="obj: ${result.value}">
<td th:text="${not #lists.isEmpty(obj.errorList)}?'Error':'Warning'"></td>
<td th:text="${obj.flag}==true?'YES':'NO'"></td>
<td th:text="${not #lists.isEmpty(obj.errorList)}?${obj.warningList}:${obj.errorList}"></td>
</tr>
</table></td>
</tr>
</tbody>
</table>
but this solution places a table in a table. I want to use one header and iterate the lists and place the variables in the main table .
I think you're looking for a structure like this:
<table>
<thead>
<tr>
<th>Code</th>
<th>Status</th>
<th>Flag</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<th:block th:each= "result : ${myMap}">
<tr th:each="obj: ${result.value}">
<td th:text="${result.key}" />
<td th:text="${not #lists.isEmpty(obj.errorList)}?'Error':'Warning'" />
<td th:text="${obj.flag}==true?'YES':'NO'" />
<td th:text="${not #lists.isEmpty(obj.errorList)}?${obj.warningList}:${obj.errorList}" />
</tr>
</th:block>
</tbody>
</table>
I'm having some problems getting the elements I need from a web page table. The example code from the table is:
<tr>
<td colspan="11" class="anscalls">Answered Calls</td>
</tr>
<tr class="daterow">
<td>01/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>02/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="changeditem">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 2</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>03/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>04/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="changeditem">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 2</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 3</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>05/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
I'm able to get the information between the "changeditem" class, which is what I need, but I also need the information from the "daterow" class to go along with the "changeditem" information. I'm currently using the following code:
For j = 0 To (.Document.getElementsByClassName("changeditem").Length - 1)
MsgBox .Document.getElementsByClassName("changeditem").Item((j + 0)).InnerText & Chr(44) & _
.Document.getElementsByClassName("changeditem").Item((j + 1)).InnerText
j = j + 1
Next
Which Outputs:
User1,#
User2,#
User1,#
User2,#
User3,#
I would need to loop through the entire table, which is much bigger than shown, and get the "daterow" class relevant to the "changeditem" classes, which I cannot figure out how to do.
What I'm aiming to get is:
02/01/2001,User 1,#
02/01/2001,User 2,#
04/01/2001,User 1,#
04/01/2001,User 2,#
04/01/2001,User 3,#
Thanks in advance.
Not a VBScript answer, but jQuery exists specifically for this kind of DOM manipulation and is very widely used, so I'll suggest it anyway. Using jQuery you could do something like the following. I am by no means fluent in jQuery and this will not output the exact desired output, but it illustrates the idea. You could
do all of this with standard DOM methods, jQuery just makes it much easier.
<script>
$(function() {
// get a reference to all changeditem rows
var $changedItem = $('tr.changeditem');
// loop the results
$changedItem.each(function() {
// contents of first td in tr
console.log( $(this).children('td').first().text());
// if there is a sibling tr daterow, get a reference
var $dateRow = $(this).next('tr.daterow');
// contents of first td in tr
console.log( $dateRow.children('td').first().text());
});
});
</script>
I have a problem getting the values of a table in HTML cause it doesn't have a ids. I need to get all the values on the second column and keep them into an array. I am using HtmlAgilityPack and my problems comes when selecting nodes:
Dim doc As HtmlDocument
Dim web As New HtmlWeb()
Dim str As String
doc = Web.Load("http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#")
Dim nodes_filas As HtmlNode() = doc.DocumentNode.SelectNodes("//table[#id='']//tr").ToArray
Dim nodes_columnas As HtmlNode() = doc.DocumentNode.SelectNodes("//td").ToArray
For Each row As HtmlNode In nodes_filas
For Each column As HtmlNode In nodes_columnas
str = column.InnerHtml & vbCrLf
Next
Next
This is the table:
<table cellspacing="1" cellpadding="3" width="100%" border="0">
<tr>
<td colspan="2" style="font-size:13px;color:#55711C;padding-bottom:5px;">Aporte por ración</td>
</tr>
<tr style="background-color:#EBEBEB">
<td width="125">Energía [Kcal]</td>
<td class="td_right">145,00</td>
</tr>
<tr>
<td>Proteína [g]</td>
<td class="td_right">22,20</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Hidratos carbono [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr>
<td>Fibra [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Grasa total [g]</td>
<td class="td_right">6,20</td>
</tr>
<tr>
<td>AGS [g]</td>
<td class="td_right">1,91</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGM [g]</td>
<td class="td_right">1,92</td>
</tr>
<tr>
<td>AGP [g]</td>
<td class="td_right">1,52</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGP /AGS</td>
<td class="td_right">0,79</td>
</tr>
<tr>
<td>(AGP + AGM) / AGS</td>
<td class="td_right"> 1,80</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Colesterol [mg]</td>
<td class="td_right">62,00</td>
</tr>
<tr>
<td>Alcohol [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Agua [g]</td>
<td class="td_right">71,60</td>
</tr>
</table>
Sorry I don't have VB installed but C# version should be enough to give you an idea. You have td_right class, you can use either lambda or xpath to query it.
I like lambda/linq version more because I am familiar with linq, and I don't need to remember XPATH syntax.
Lambda:
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}
var url = "http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#";
var htmlWeb = new HtmlWeb();
var htmlDoc = htmlWeb.Load(url);
var nodes = htmlDoc.DocumentNode.Descendants("td").Where(_ => _.HasClass("td_right")).Select(_ => _.InnerText);
XPATH:
var nodes2 = htmlDoc.DocumentNode.SelectNodes("//td[#class='td_right']");