I would like to get a different prices from financial site and store them in an Excel file.
I don't have good knowledge and I would like to know if the code I did to get data from a web site is the best or it could be done better.
The web site HTML code is this one. I would like to get the last td at the end just after 'Prezzo di chiusura' that has inside 103,74.
I have these questions:
I used the get elementsbytagname("td")(39) , I have just counted the the td number in the page, is there a better way to address that td ?
I noticed that sometime I have previous price and not the one I see in the web page, with my code do I access a different area data that the one I see in the web page and until it is not updated I will see only the previous data?
The HTML code is:
<div class="instruments_company_summary">
<table class="table-noborders">
<tr>
<td class="table_label"> </td>
<td>
<div class="floatdx" style="padding-bottom:10px">
<div class="floatsx">
<div class="standard-button">
Grafico </div>
</div>
<div class="floatsx">
<div class="standard-button">
Scheda </div>
</div>
<div class="floatsx">
<div class="standard-button">
Scarica book </div>
</div>
</div>
</td>
</tr>
<tr>
<td class="table_label">Isin</td>
<td>
<div class="floatsx" style="padding-top:4px;">IT0004785355</div>
</td>
</tr>
<tr>
<td class="table_label">Descrizione</td>
<td>Bpvi 7% 29dc16</td>
</tr>
<tr>
<td class="table_label">Prezzi aggiornati al</td>
<td>09-11-2015 21:28:48</td>
</tr>
</table>
<table>
<tr>
<th colspan="2">Book di negoziazione</th>
</tr>
<tr>
<td class="table_label">Var</td>
<td>0,05%</td>
</tr>
<tr>
<td class="table_label" style="border:0">Book a 5 livelli</td>
<td style="border:0; padding: 10px 0 5px">
<table>
<thead>
<tr>
<th>Q.tà Acquisto</th>
<th>Prezzo Acquisto</th>
<th>Prezzo Vendita</th>
<th>Q.tà Vendita</th>
</tr>
</thead>
<tbody>
<tr>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>
</td>
</tr>
</table>
<table>
<tr>
<th colspan="2">Dati ultimo contratto</th>
</tr>
<tr>
<td class="table_label">Prezzo</td>
<td>103,93</td>
</tr>
<tr>
<td class="table_label">Quantità</td>
<td>5.000</td>
</tr>
<tr>
<td class="table_label">Data e ora</td>
<td>09-11-2015 16:59:33</td>
</tr>
</table>
<table>
<tr>
<th colspan="2">Dati giornalieri</th>
</tr>
<tr>
<td class="table_label">Prezzo di chiusura</td>
<td>103,74</td>
</tr>
The Excel VBA code is this one:
Dim W As Worksheet: Set W = ActiveSheet
Dim Objie As Object
Dim xObj
Set Objie = CreateObject("InternetExplorer.Application")
Objie.Visible = False
Objie.Navigate "http://www.eurotlx.com/it/strumenti/dettaglio/IT0004785355"
While (Objie.Busy Or Objie.ReadyState <> 4)
DoEvents
Wend
Set xObj = Objie.Document.getElementsByTagName("td")(39)
W.Range("I3" ) = xObj.innerText
Set xObj = Nothing
Objie.Quit
Set Objie = Nothing
This would be better (insulates you from changing numbers of rows in previous tables)
Dim tbl, xObj
Set tbl = Objie.Document.getElementsByTagName("table")(3) '4th table on page
Set xObj = tbl.getElementsByTagName("td")(2) 'second td in that table
You can also search for your desired data ("PREZZO DI CHIUSURA") and get the value given to it using the nextElementSibling Property. This way, you can be always be sure that the value obtained is independent of the table structure that eurotlx.com delivers.
Sub Scrape()
Dim ie As Object
'Get rid of cached sites
Shell "RunDll32.exe InetCpl.cpl,ClearMyTracksByProcess 255"
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate "http://www.eurotlx.com/it/strumenti/dettaglio/IT0004785355"
While (ie.Busy Or ie.ReadyState <> 4)
DoEvents
Wend
Set tds = ie.document.getElementsByClassName("table_label")
For Each TDelement In ie.document.getElementsByClassName("table_label")
If InStr(TDelement.innerText, "Prezzo di chiusura") Then
Range("I3") = TDelement.nextElementSibling.innerText
Exit For
End If
Next
End Sub
Of course, the weakeness of this method is, that if the site administrator changes the innerText of "prezzo di chiusura", Excel will not be able to find its value.
CSS selector:
You can use a CSS selector of: table:last-child .table_label ~ td
The element with td tag following last table tag with className .table_label.
CSS query:
VBA:
You apply the CSS selector with the querySelector method of document.
Debug.Print Objie.Document.querySelector("table:last-child .table_label ~ td").innerText
Related
I'm having some problems getting the elements I need from a web page table. The example code from the table is:
<tr>
<td colspan="11" class="anscalls">Answered Calls</td>
</tr>
<tr class="daterow">
<td>01/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>02/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="changeditem">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 2</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>03/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>04/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="changeditem">
<td>User 1</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 2</td>
<td>#</td>
</tr>
<tr class="changeditem">
<td>User 3</td>
<td>#</td>
</tr>
<tr class="daterow">
<td>05/01/2001</td>
<td colspan="10"> </td>
</tr>
<tr class="item">
<td>User 1</td>
<td>#</td>
</tr>
I'm able to get the information between the "changeditem" class, which is what I need, but I also need the information from the "daterow" class to go along with the "changeditem" information. I'm currently using the following code:
For j = 0 To (.Document.getElementsByClassName("changeditem").Length - 1)
MsgBox .Document.getElementsByClassName("changeditem").Item((j + 0)).InnerText & Chr(44) & _
.Document.getElementsByClassName("changeditem").Item((j + 1)).InnerText
j = j + 1
Next
Which Outputs:
User1,#
User2,#
User1,#
User2,#
User3,#
I would need to loop through the entire table, which is much bigger than shown, and get the "daterow" class relevant to the "changeditem" classes, which I cannot figure out how to do.
What I'm aiming to get is:
02/01/2001,User 1,#
02/01/2001,User 2,#
04/01/2001,User 1,#
04/01/2001,User 2,#
04/01/2001,User 3,#
Thanks in advance.
Not a VBScript answer, but jQuery exists specifically for this kind of DOM manipulation and is very widely used, so I'll suggest it anyway. Using jQuery you could do something like the following. I am by no means fluent in jQuery and this will not output the exact desired output, but it illustrates the idea. You could
do all of this with standard DOM methods, jQuery just makes it much easier.
<script>
$(function() {
// get a reference to all changeditem rows
var $changedItem = $('tr.changeditem');
// loop the results
$changedItem.each(function() {
// contents of first td in tr
console.log( $(this).children('td').first().text());
// if there is a sibling tr daterow, get a reference
var $dateRow = $(this).next('tr.daterow');
// contents of first td in tr
console.log( $dateRow.children('td').first().text());
});
});
</script>
I am working on an Excel VBA project to scrape some specific information from a website. The view of this data on the website is as such:
Website View:
What I am looking to do is extract text based on two criteria: Name and post date. For example, I have the name Kaelan and the post date of 11/16/2016. I want to extract the amount of $365.
This is the HTML code:
<div class="familyLedgerAmountCategory" id="id_4541278">
<table>
<tr>
<td class="tdCategoryRow">
<div class="cmFloatLeft divExpandToggle expanded" id="divCategoryToggle_id_4541278"></div>
<div class="cmFloatLeft" id="divCategoryLabel_id_4541278" style="width: 430px;">
Kaelan
</div><span style="margin-left: 5px;">$ 465.00</span>
</td>
</tr>
<tbody>
<tr class="trListTableBody LedgerExisting" id="CamperFamilyLedgerRowControl_14816465">
<td class="tdCamperFamilyLedgerTableColumnDescription tdBorderTop" id="tdCamperFamilyLedgerTableColumnDescription_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnDescriptionCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
<a class="aColumnDescriptionCell" id="aColumnDescriptionCell_CamperFamilyLedgerRowControl_14816465" name="aColumnDescriptionCell_CamperFamilyLedgerRowControl_14816465" target="_self" title="Click to view details">2017 Super Early Bird Teen Camp - Tuition</a>
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnPostDate tdBorderTop" id="tdCamperFamilyLedgerTableColumnPostDate_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnPostDateCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
11/16/2016
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnEffective tdBorderTop" id="tdCamperFamilyLedgerTableColumnEffective_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnEffectiveCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
11/15/2016
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnQty tdBorderTop" id="tdCamperFamilyLedgerTableColumnQty_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnQtyCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
1
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnAmount tdBorderTop" id="tdCamperFamilyLedgerTableColumnAmount_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnAmountCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
$ 365.00
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnAction tdBorderTop" id="tdCamperFamilyLedgerTableColumnAction_CamperFamilyLedgerRowControl_14816465"></td>
</tr>
</tbody>
</table>
</div>
My attempt to pull the amount is as follows:
Sub Test()
Dim ie As Object
Dim oElement As Object
Dim wsTarget As Worksheet
Dim i As Integer
Dim NewWB As Workbook
Set NewWB = ActiveWorkbook
Set wsTarget = NewWB.Sheets(1)
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate website...
Wait 6
ie.document.All.Item("txtUserName").Value = "User"
ie.document.All.Item("pswdPassword").Value = "Pass
Wait 1
ie.document.getElementById("btnLogin").Click
Wait 5
ie.navigate website...
i = 1
For Each oElement In ie.document.getElementsByClassName("cmFloatLeft")
If oElement.innerText = "Kaelan" Then
extract1 = oElement.getElementsByClassName("divListTableBodyLabel").innerText
MsgBox extract1
Else
End If
Next
However, I get an error when running the code above. Can I get the class for cmFloatLeft that I am looking for and then try to call the divLisTableBodyLabel class immediately even though that class does not fall directly below the cmFloatLeft class?
Sorry, I'm still pretty new to scraping web data.
Thanks
That structure is a bit difficult to scrape - you could try going "up" from the "Kaelan" node to the patent table, and then looping over that to extract the various pieces of information. If the post structures are consistent then that could provide one approach.
Set doc = IE.document
Set els = doc.getElementsByClassName("cmFloatLeft")
i = 1
For Each oElement In els
Debug.Print oElement.innerText
If Trim(oElement.innerText) = "Kaelan" Then
Set tbl = GetParent(oElement, "table") '<< find the parent table
If Not tbl Is Nothing Then
'loop over the parent table
For Each rw In tbl.Rows
For Each cl In rw.Cells
Debug.Print cl.innerText
Next cl
Next rw
End If
End If
Next
Function to find a named parent (by tag name):
Function GetParent(el, tagParent)
Dim rv As Object
Set rv = el
Do While Not rv.parentElement Is Nothing
Set rv = rv.parentElement
If UCase(rv.tagName) = UCase(tagParent) Then
Set GetParent = rv
Exit Function
End If
Loop
Set GetParent = Nothing
End Function
I have a problem getting the values of a table in HTML cause it doesn't have a ids. I need to get all the values on the second column and keep them into an array. I am using HtmlAgilityPack and my problems comes when selecting nodes:
Dim doc As HtmlDocument
Dim web As New HtmlWeb()
Dim str As String
doc = Web.Load("http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#")
Dim nodes_filas As HtmlNode() = doc.DocumentNode.SelectNodes("//table[#id='']//tr").ToArray
Dim nodes_columnas As HtmlNode() = doc.DocumentNode.SelectNodes("//td").ToArray
For Each row As HtmlNode In nodes_filas
For Each column As HtmlNode In nodes_columnas
str = column.InnerHtml & vbCrLf
Next
Next
This is the table:
<table cellspacing="1" cellpadding="3" width="100%" border="0">
<tr>
<td colspan="2" style="font-size:13px;color:#55711C;padding-bottom:5px;">Aporte por ración</td>
</tr>
<tr style="background-color:#EBEBEB">
<td width="125">Energía [Kcal]</td>
<td class="td_right">145,00</td>
</tr>
<tr>
<td>Proteína [g]</td>
<td class="td_right">22,20</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Hidratos carbono [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr>
<td>Fibra [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Grasa total [g]</td>
<td class="td_right">6,20</td>
</tr>
<tr>
<td>AGS [g]</td>
<td class="td_right">1,91</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGM [g]</td>
<td class="td_right">1,92</td>
</tr>
<tr>
<td>AGP [g]</td>
<td class="td_right">1,52</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGP /AGS</td>
<td class="td_right">0,79</td>
</tr>
<tr>
<td>(AGP + AGM) / AGS</td>
<td class="td_right"> 1,80</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Colesterol [mg]</td>
<td class="td_right">62,00</td>
</tr>
<tr>
<td>Alcohol [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Agua [g]</td>
<td class="td_right">71,60</td>
</tr>
</table>
Sorry I don't have VB installed but C# version should be enough to give you an idea. You have td_right class, you can use either lambda or xpath to query it.
I like lambda/linq version more because I am familiar with linq, and I don't need to remember XPATH syntax.
Lambda:
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}
var url = "http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#";
var htmlWeb = new HtmlWeb();
var htmlDoc = htmlWeb.Load(url);
var nodes = htmlDoc.DocumentNode.Descendants("td").Where(_ => _.HasClass("td_right")).Select(_ => _.InnerText);
XPATH:
var nodes2 = htmlDoc.DocumentNode.SelectNodes("//td[#class='td_right']");
I have the following table.
<table class="table1">
<tbody>
<tr>
<th></th>
<th>SEQ</th>
<th>LOGIN</th>
<th>WHATSAPP</th>
<th>E-MAIL</th>
</tr>
<tr>
<td>1</td>
<td></td>
<td>name</td>
<td>99 999999999</td>
<td>xxxxxxx#hotmail.com</td>
</tr>
</tbody>
</table>
I would like to know how to get content from each TR to write to Access Database.
Because until the moment I only managed to get to this code.
Dim PageElement As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("table")
For Each CurElement As HtmlElement In PageElement
If (CurElement.GetAttribute("className") = "table1") Then
TextBox1.Text = CurElement.InnerHtml
End If
Next
I am using the following code to parse html tables from an html file into a dataset:
Public Function GetDataSet(html As String) As DataSet
Dim ds As DataSet = New DataSet
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html)
Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
.GroupBy(Function(x) x.ParentNode)
For i As Integer = 0 To tables.Count - 1
Dim rows = tables(i).ToList()
ds.Tables.Add(String.Format("Table {0}", i))
Dim headers = rows(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList()
For Each Hr In headers
ds.Tables(i).Columns.Add(Hr)
Next
For j As Integer = 1 To rows.Count - 1
Dim row = rows(j)
Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()
ds.Tables(i).Rows.Add(dr)
Next
Next
Return ds
End Function
and it works fine. But When There are a Tag placed inside the <Table> Tag before <tr> tag the table is not parsed
Simple Example:
<html>
<head><title>Test</title></head>
<body>
<div>Contents:</div>
<table>
<tr>
<th>Column1</th> <th>Column2</th>
</tr>
<tr>
<td>1</td> <td>11</td>
</tr>
<tr>
<td>2</td> <td>22</td>
</tr>
</table>
<table>
<tbody>
<tr>
<th>Column1</th> <th>Column2</th> <th>Column3</th>
</tr>
<tr>
<td>a</td> <td>aa</td> <td>aaa</td>
</tr>
<tr>
<td>b</td> <td>bb</td> <td>bbb</td>
</tr>
</tbody>
</table>
<table>
<div>
<tr>
<th>Column1</th> <th>Column2</th> <th>Column3</th>
</tr>
<tr>
<td>a</td> <td>aa</td> <td>aaa</td>
</tr>
<tr>
<td>b</td> <td>bb</td> <td>bbb</td>
</tr>
</div>
</table>
</body>
</html>
In This Example only the first table is parsed.
My question is how to ignore any tag between <Table> tag and <tr> tag in the following line of code:
Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
.GroupBy(Function(x) x.ParentNode)
and all the tables will be parsed.
You can use // to select from all descendants:
Dim rows = htmldoc.DocumentNode.SelectNodes("//table//tr");
Also based on your requirement, it seems it's better to group the result based on the first ancestor table, because the parent of tr may be a tbody or thead and you need to group rows in tables:
Dim tables = htmldoc.DocumentNode.SelectNodes("//table//tr") _
.GroupBy(Function(x) x.Ancestors("table").First())