Can I web scrape for a particular style in VBA? - html

I have been looking everywhere for any possible workaround to this issue.
All of the data at my company is accessed via a web portal that produces static HTML pages. Unfortunately our department cannot be given direct access to the Server which would make my life easy so I need to page scrape this portal to find the data that I need. My navigation is fine and i am quite experienced with scraping where elements are named or given an ID, however this does not have either.
Anyway, background out of the way.
I want to grab a table from the page that has a unique style of "empty-cells: show;":
<TABLE cellspacing=10 cellPadding=10 border="1" style="empty-cells: show;">
</TABLE>
Or failing that there is a heading in the first row which always contains the same text string. Once I have that table I can manipulate the data I need from it. Hugely sensitive data here guys, so I can't provide the full page code unfortunately.
I know that there have been many posts regarding GetElementByRegex but I cannot find a post or website that actually explains how to use it. Instead they all want me to install their add-on which isn't an option (I need to learn this to sate my thirst for knowledge).
To help I have added the full table code below removing the sensitive data:
<TABLE cellspacing=10 cellPadding=10 border="0" width=100%>
<tr>
<td>
<TABLE cellspacing=10 cellPadding=10 border="1" style="empty-cells: show;">
<TR class="row0">
<TD style="width: 25%; background-color: #A3DCF5;"><strong>TITLE:</strong></TD>
<TD>LINE1</TD>
</TR>
<TR class="row1">
<TD> </TD><td>LINE2</td>
</TR>
<TR class="row0">
<TD> </TD><td>LINE3</td>
</TR>
<TR class="row1">
<TD> </TD><td>LINE4</td>
</TR>
<TR class="row0">
<TD> </TD><td>LINE5</td>
</TR>
</TABLE>
</td>
</tr>
</TABLE>
There are many other tables though so using a Len check will not help me top sift through the TD tags.

Dim tbls, tbl, tr, j, td, row, sht
Set tbls = IE.document.getElementsByTagName("table")
For Each tbl in tbls
'item indexes are zero-based (AFAIR)
If tbl.Rows(0).Cells(1).innerText = "LINE1" Then
'EDIT: extracting the table contents
Set sht = ActiveSheet
row = 3
For Each tr In t.getelementsbytagname("TR")
j = 1
For Each td In tr.getelementsbytagname("TD")
sht.Cells(row + 1, j).Value = td.innerText
j = j + 1
Next
row = row + 1
Next
Exit For 'stop looping
End If
Next

Answer by Glitch_Doctor edited out of question:
Thank you for all the help Tim!
The below worked perfectly for me:
Dim tbls, tbl
Dim L1, L2, L3, L4, L5 As String
Set tbls = IE.Document.getElementsByTagName("table")
For Each tbl In tbls
If tbl.Rows(0).Cells(0).innerText = "Card Address:" Then
On Error Resume Next
L1 = tbl.Rows(0).Cells(1).innerText
L2 = tbl.Rows(1).Cells(1).innerText
L3 = tbl.Rows(2).Cells(1).innerText
L4 = tbl.Rows(3).Cells(1).innerText
L5 = tbl.Rows(4).Cells(1).innerText
Exit For
End If
Next
Worksheets("Sheet2").Range("A1").Value = L1
Worksheets("Sheet2").Range("A2").Value = L2
Worksheets("Sheet2").Range("A3").Value = L3
Worksheets("Sheet2").Range("A4").Value = L4
Worksheets("Sheet2").Range("A5").Value = L5
End Sub

Related

VBA: Get specific column from webpage using seleniumbasic

I am trying to get a specific columns data which is in a form of table in a webpage to my excel file using VBA, I'm good at opening webpage and log-in and navigate to table area but I'm unable to get the specific columns from the table. I don't have idea to pull only a column from table with in the web page.
I use chrome for the automation. Below is the sample Html code for your reference.
<table class="Performed-Detailes-Mac">
<thead class="table-head-basic">
<tr>
<th>File</th>
<th>Name</th>
<th>Date</th>
<th>Wait 1</th>
<th>Wait 2</th>
<th>Status</th>
<th class="text-right">Machines</th>
<th class="text-right">Usage</th>
</tr>
</thead>
<tbody>
<tr class="table-row">
<td data-bind="text: id">File12</td>
<td data-bind="text: Name">JCB</td>
<td data-bind="text: Date">02/01/2022</td>
<td data-bind="text: check1">10:55 </td>
<td data-bind="text: check2">12:30</td>
<td data-bind="text: Status">Completed</td>
<td class="text-right" data-bind="text: Machines">2</td>
<td class="text-right" data-bind="text: Str">100 Percent</td>
</tr>
<tr class="table-row" data-bind="visible : $root.isEditItemsOnDetailsEnabled || $root.Items().length > 0">
<td class="text-right" data-bind="text: TotalDuration">1.75</td>
</tr>
</tbody>
</table>
For reference I have provided only one line (tr) code with header details.
From the above html I would like to extract only "Date" and "Machines" column details with all rows.
The code which I tried is provided below. I did some here and there in For loop but no luck as of now.
Sub GetTable()
Dim Dr As New Selenium.ChromeDriver
Dim hTable, hBody, hTR, hTD, tb As Object
Dim bb, tr, td As Object
Dr.Get "My Webpage Url"
Dr.Wait 2000
With Sheet1
Set hTable = Dr.FindElementsByCss(".Performed-Detailes-Mac")
For Each tb In hTable
Set hBody = tb.FindElementsByTag("tbody")
For Each bb In hBody
Set hTR = bb.FindElementsByTag("tr")
For r = 1 To hTR.Count - 2
Set hTD = hTR(r).FindElementsByTag("td")
If hTD.Count = 0 Then Set hTD = hTD(r).FindElementsByTag("th")
Lastrow = .Cells(Rows.Count, 1).End(xlUp).Row + 1
For c = 1 To hTD.Count
.Cells(Lastrow, c).Value = hTD(c - 1).Text
Next c
Next r
Next bb
Exit For
Next tb
End With
End Sub
This is my first query, My apologies if I'm wrong in anywhere.
Thanks Gold

How to find attribute from HTML using XPATH

I have i am trying to retrieve text from html, so first load html
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load(url);
Here the text to find is 0.802 and is directly in element and it works
for this html
<tr>
<th>Reported-Normalized Volume Ratio</th>
<td>0.802</td>
</tr>
The search string
var t5 = doc.DocumentNode.SelectSingleNode("//th[.='Reported-Normalized Volume Ratio']/following-sibling::td").InnerText;
The value of t5 is 0.802 as expected
Now I need to find data-price-btc="3535.5500891201248734"
<tr>
<th style="width: 300px;">Reported Trading Volume</th>
<td><div data-target="price.price" data-price-btc="3535.5500891201248734">
</div></td>
</tr>
Search string
var t1 = doc.DocumentNode.SelectSingleNode("//th[.='Reported Trading Volume']/following-sibling::td/div/#data-price-btc");
t1 is empty and I dont understand why
The problem is that with HTML Agility Pack
direct selection of attributes is not supported
https://html-agility-pack.net/knowledge-base/541953/selecting-attribute-values-with-html-agility-pack

how to find location of specific <tr> each time code is run

My code below will extract a value for each hour of the day.
However, the webpage I'm scraping can change and so I want to find a way to assign the location of the to a variable so that it will know what number it is everytime. I found the current number "116" by trial and error.
I included the html structure below as well. Any suggestions?
Sub scrape()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.application")
With IE
.Visible = False
.navigate "web address"
Do Until .readyState = 4
DoEvents
Loop
.document.all.item("Login1_UserName").Value = "user"
.document.all.item("Login1_Password").Value = "pw"
.document.all.item("Login1_LoginButton").Click
Do Until .readyState = 4
DoEvents
Loop
End With
Dim htmldoc As Object
Dim r
Dim c
Dim aTable As Object
Dim TDelement As Object
Set htmldoc = IE.document
Dim td As Object
For Each td In htmldoc.getElementsByTagName("td")
On Error Resume Next
If span.Children(0).id = "ctl00_PageContent_grdReport_ctl08_Label50" Then
ThisWorkbook.Sheets("sheet1").Range("j8").Offset(r, c).Value = td.Children(1).innerText
End If
On Error GoTo 0
Next td
End Sub
HTML:
<form name="aspnetForm" id="aspnetForm" action="./MinMaxReport.aspx"
method="post">
<div>
</div>
<script type="text/javascript">...</script>
<div>
</div>
<table class="header-table">...</table>
<table class="page-area">
<tbody>
<tr>
<table id="ctl00_PageContent_Table1" border="0">...</table>
<table id="ctl00_PageContent_Table2" border="0">
<tbody>
<tr>
<td>
<div id="ctl00_PageContent_grdReport_div">
<tbody>
<tr style="background-color: beige;">
<td>...</td>
<td>
<span id="ctl00_PageContent_grdReport_ctl08_Label50">Most Restrictive
Capacity Maximum</span>
</td>
<td>
<span id="ctl00_PageContent_grdReport_ctl08_Label51">159</span>
</td>
</tr>
</tbody>
</div>
</td>
</tr>
</tbody>
</table>
</table>
</tr>
</tbody>
</table>
You could loop through all TDs and check if id= "ctl00_PageContent_grdReport_ctl08_Label50" for example:
For Each td In htmldoc.getElementsByTagName("td")
On Error Resume Next
If td.Children(0).ID = "ctl00_PageContent_grdReport_ctl08_Label50" Then
ThisWorkbook.Sheets("sheet1").Range("j8").Offset(r, c).Value = td.Children(1).innerText
End If
On Error GoTo 0
Next td
Children(0) will pick the first iHTML element contained in your table cell. On Error Resume Next is for the situation when td element has no child.
It is possible that you have more then one element with this id in your webpage. Then, you must identify table or table row first. I cannot do it because I can't see your whole HTML code.

How to use getElementsByTagName with <td> with overflow: hidden on VBA?

I am using the VBA automation to get some informations of a ticket system in my job. I am trying to get the value into the generated table but only information that doest'go to the column "A" on sheet "Plan1" is <td> which contains the overflow: hidden CSS atribute. I don't know if are them related but coincidently are the only data that don't appears. Someone can help me?
HTML code:
<div id="posicionamentoContent">
<table class="grid">
<thead>...</thead>
<tbody>
<tr id="937712" class="gridrow">
<td width="200px"> Leonardo Peixoto </td>
<td width="200px"> 23/12/2015 09:45 </td>
<td width="200px"> SIM </td>
<td width="200px"> Telhado da loja com pontos de vazamento.</td>
<td width="200px" align="center"></td>
<td width="200px" align="center"></td>
</tr>
...
...
...
The complete code: http://i.stack.imgur.com/4BsFo.png
I need to get the first 4 <td> text ( Leonardo Peixoto, 23/12/2015 09:45, SIM and Telhado da loja com pontos de vazamento.) but they are only texts which I can't get.
Obs: When I use developers tools (f12) to inspect each element, it shows me perfectly the information I need inside <td>. But when I open "source code" page to checkthe html, the code is like this:
<div id="tabPosicionamento" style="padding: 5px 0 5px 0;" class="ui-tabs-hide">
div id="posicionamentoContent"></div>
</div>
Example VBA:
Sub extractTablesData1()
'we define the essential variables
Dim IE As Object, obj As Object
Dim ticket As String
Set IE = CreateObject("InternetExplorer.Application")
ticket= InputBox("Enter the ticket code")
With IE
.Visible = False
.navigate ("https://www.example.com/details/") & ticket
While IE.ReadyState <> 4
DoEvents
Wend
ThisWorkbook.Sheets("Plan1").Range("A1:K500").ClearContents
Set data = IE.document.getElementsByClassName("thead")(0).getElementsByTagName("td")
i = 0
For Each elemCollection In data
ThisWorkbook.Sheets("Plan1").Range("A" & i + 1) = data(i).innerText
i = i + 1
Next elemCollection
End With
IE.Quit
Set IE = Nothing
....
....
End Sub
This function returns in column "A" of sheet Plan1 only <td class=info3"></td> and <td class=info4"></td> but I need <td class=info1"></td> and <td class=info2 also."></td>
I wasn't able to read the page code due the proxy blocking me, but I faced a similar issue a while ago and the solution I found out was put all data on clipboard and paste. After that I clean the data on the sheet.
Here the code I used to do that:
Set ieTable = ie.document.getElementById("ID")
If Not ieTable Is Nothing Then
Set clip = New DataObject
clip.SetText "<html>" & ieTable.outerHTML & "</html>"
clip.PutInClipboard
Sheet1.Range("A1").Select
ActiveSheet.PasteSpecial Format:="Unicode Text", link:=False, DisplayAsIcon:=False, NoHTMLFormatting:=True
End If
Considering that you need to isolate the 4 td lines, you can do that with a loop for every search.
In your sample it numerates the Data, but not using it. Also, the cell assignment should be cells(x,y).value. Here is the working code.
Sub extractTablesData1()
'we define the essential variables
Dim IE As Object, Data As Object
Dim ticket As String
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Visible = False
.navigate ("put your data url here")
While IE.ReadyState <> 4
DoEvents
Wend
Set Data = IE.document.getElementsByTagName("tr")(0).getElementsByTagName("td")
i = 1
For Each elemCollection In Data
ActiveWorkbook.Sheets(1).Cells(1, i).Value = elemCollection.innerHTML
i = i + 1
Next elemCollection
End With
IE.Quit
Set IE = Nothing
End Sub
It doesn't bring the information what I need (lasts <td>)
<div id="posicionamentoContent">
<table class="grid">
<thead>...</thead>
<tbody>
<tr id="937712" class="gridrow">
<td width="200px"> Leonardo Peixoto </td>
<td width="200px"> 23/12/2015 09:45 </td>
<td width="200px"> SIM </td>
<td width="200px"> Telhado da loja com pontos de vazamento.</td>
<td width="200px" align="center"></td>
<td width="200px" align="center"></td>
</tr>

help w/ MS access database to <table>.... with a twist

I am trying to more easily manage my table with an access database rather than manually entering data into a <table>.
I have laid out the basic idea but I don't know exactly how to perfect it. Also what do I do if one of my table rows (specifically the last one) doesn't have exactly 3 cells?
This is what I have laid out so far:
<table class="tablecenter" cellspacing="30">
<tbody>
<%
sql()
SQL = Select * From Database
DR = DataReader (SQL)
While Not DR.EOF
x = 1
If x < 4 Then %>
<td><img src="avatar-blank.jpg" alt="headshot"/><br /><p>dr("Name") <br />Hometown: VarAddress <br /> Class: VarClass</p></td>
<% Else
x = 0 %>
</tr>
<tr>
<td><img src="avatar-blank.jpg" alt="headshot" /><br /><p>VarName<br />Hometown: VarAddress<br />Class: VarClass</p></td>
<% End If
x = x + 1
DR.moveNext
Wend %>
It depends on why there are fewer than 3 cells. If it is just missing data, then insert an empty <td> element.