How to extract the div_class table from betting website - html

I was searching continuously and I ran out of luck. I'm trying to just get the text of the games being played on https://www.sportsinteraction.com/football/nfl-betting-lines/. So at the end I want something to spit out like the below (doesn't have to be clean).
Carolina Panthers
Houston Texans
POINTSPREAD
+4.0
1.90
-4.0
1.92
MONEYLINE
2.69
1.49
OVER/UNDER
O
+47.0
1.91
U
+47.0
1.91
Code below which is obviously wrong:
Sub Pulldata2()
Dim ieObj As InternetExplorer
Dim appIE As Object
Dim htmlEle As IHTMLElement
Dim i As Integer
Dim strSheet As String
strSheet = Sheet2.Range("P2")
i = 1
Set ieObj = New InternetExplorer
ieObj.Visible = False
ieObj.navigate Sheet2.Range("P2").Value
Application.ScreenUpdating = False
Application.DisplayStatusBar = False
Application.Wait Now + TimeValue("00:00:03")
Sheet13.Activate
For Each htmlEle In ieObj.document.getElementsByClassName("game")(0)
With ActiveSheet
.Range("A1").Value
End With
i = i + 1
On Error Resume Next
Next htmlEle
End Sub

Why?
This is an interesting scrape for practice so worth the time to attempt. It is interesting because there isn't an immediately obvious way to 'block' out each event with the associated meta data e.g. date, time, match title; and the betting content switches between horizontal and vertical groupings making your selector strategy for identifying nodes particularly important.
Strategy:
The strategy I decided to employ was the following: Use css class selectors to grab a long list of nodes containing all the info required. date, time and title would be nodes whose info I would need to repeat across the two lines for each team within each gameBettingContent block.
Note: the content is static so XHR can be used to avoid the overhead of opening a browser and rendering unnecessary content. This method is a lot quicker.
Anatomy of a logical block:
Note that each parent node with class gameBettingContent contains 3 child nodes with class betTypeContent. These children correspond with PointSpread, MoneyLine and Over/Under. These will need to occupy their own columns in the output including separating out price from handicap. For each logical block, as shown above, there will be two rows where some info is repeated in the first few columns of output.
The nodeList:
So, initially we have a nodeList of all elements of interest matched on className. Within some of those nodes are children that we then need to access; to avoid complicated accessor syntax, when we get to nodes whose children (or deeper nodes) we need to access, we load the html of that node into a 'surrogate', i.e. new, HTMLDocument variable, so we can leverage querySelectorAll method of HTMLDocument again. This gives us easy to read syntax and the cost is minimal.
We loop the nodeList, shown on left above, and test the className of each current node and use a Select Case statement to then determine what to do. If the current className is date, time or title we store that node's .innerText for later use (remember we need to repeat this value across runner 1 and runner 2). If the className is gameBettingContent we load that node's outerHTML into a surrogate HTMLDocument i.e. html2. We can then collect the runners and start populating our output array resultsTable. We add +2 to the r variable (row counter), initially, so we are ready for the next logical block (i.e. match), and then populate the 2 positions for the current match using r-2 and r-1.
Using surrogate HTMLDocument variables to leverage querySelectorAll on nodes deeper in the DOM:
As we are using MSXML2.XMLHTTP with MSHTML.HTMLDocument we do not have access to nth-of-type pseudo class css selector syntax (which you do with ie.document when using SHDocVw.InternetExplorer) to differentiate between the 3 divs (columns) for bet types within gameBettingContent; whilst we could chain class selectors and adjacent sibling combinators to move from left to right across the divs (columns) e.g.
PointSpread:
.gameBettingContent #runnerNames + .betTypeContent
MoneyLine:
.gameBettingContent #runnerNames + .betTypeContent + .betTypeContent
Over/Under:
.gameBettingContent #runnerNames + .betTypeContent + .betTypeContent + .betTypeContent
I find it cleaner to simply gather those divs into a nodeList
Set contentDivs = html2.querySelectorAll(".betTypeContent")
Then load each div's (column's) outerHTML into a new HTMLDocument surrogate html3 and leverage querySelectorAll again to gather the two rows info by index from each column.
For example, PointsSpread will be the first node in contentDivs which we read across into html3:
html3.body.innerHTML = contentDivs.item(0).outerHTML
We then select handicap and price
Set pointSpreadHandicaps = html3.querySelectorAll(".handicap")
Set pointSpreadPrices = html3.querySelectorAll(".price")
And can use index to get runner 1 versus runner 2 values:
resultsTable(r - 2, 5) = pointSpreadHandicaps.item(0).innerText: resultsTable(r - 1, 5) = pointSpreadHandicaps.item(1).innerText
That's pretty much the bulk of the logic. The output array is dimensioned based on number of rows being equal to:
html.querySelectorAll("#runnerNames li").Length
i.e. how many runners there are. The number of columns is equal to the number of items we specified in the headers array (we add 1 to the Ubound as the array is 0-based). The array is then written out in one go to sheet as are headers.
VBA:
Option Explicit
Public Sub GetNFLMatchInfo()
Dim html As HTMLDocument, html2 As HTMLDocument
Set html = New HTMLDocument: Set html2 = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.sportsinteraction.com/football/nfl-betting-lines/", False
.send
html.body.innerHTML = .responseText
End With
Dim allNodes As Object, i As Long, resultsTable(), r As Long, headers()
Dim dateValue As String, timeValue As String, title As String, html3 As HTMLDocument
headers = Array("Date", "Time", "Title", "Team", "Pointspread handicap", "Pointspread price", "Moneyline price", "O/U Name", "O/U Handicap", "O/U Price")
Set allNodes = html.querySelectorAll(".date, .time, .title, .gameBettingContent") 'nodeList of all items of interest. gameBettingContent is a block _
that will be further subdivided by reading its html into a 'surrogate' HTMLDocument
ReDim resultsTable(1 To html.querySelectorAll("#runnerNames li").Length, 1 To UBound(headers) + 1)
r = 1: Set html3 = New HTMLDocument
For i = 0 To allNodes.Length - 1
With allNodes.item(i)
Select Case .className
Case "date"
dateValue = .innerText
Case "time"
timeValue = .innerText
Case "title"
title = Trim$(.innerText)
Case "gameBettingContent"
Dim runners As Object, contentDivs As Object, pointSpreadHandicaps As Object
Dim pointSpreadPrices As Object, moneyLinePrices As Object, runners As Object
Dim OuHandicaps As Object, OuPrices As Object
r = r + 2 'then fill line one at r-2, and line 2 at r-1
html2.body.innerHTML = .outerHTML
Set runners = html2.querySelectorAll("#runnerNames li")
resultsTable(r - 2, 1) = dateValue: resultsTable(r - 1, 1) = dateValue
resultsTable(r - 2, 2) = timeValue: resultsTable(r - 1, 2) = timeValue
resultsTable(r - 2, 3) = title: resultsTable(r - 1, 3) = title
resultsTable(r - 2, 4) = runners.item(0).innerText: resultsTable(r - 1, 4) = runners.item(1).innerText
Set contentDivs = html2.querySelectorAll(".betTypeContent")
html3.body.innerHTML = contentDivs.item(0).outerHTML
'populate resultsTable for two rows relating to current gameBettingContent
Set pointSpreadHandicaps = html3.querySelectorAll(".handicap")
Set pointSpreadPrices = html3.querySelectorAll(".price")
resultsTable(r - 2, 5) = pointSpreadHandicaps.item(0).innerText: resultsTable(r - 1, 5) = pointSpreadHandicaps.item(1).innerText
resultsTable(r - 2, 6) = pointSpreadPrices.item(0).innerText: resultsTable(r - 1, 6) = pointSpreadPrices.item(1).innerText
html3.body.innerHTML = contentDivs.item(1).outerHTML 'Set html3 content to next content div to right
Set moneyLinePrices = html3.querySelectorAll(".price")
resultsTable(r - 2, 7) = moneyLinePrices.item(0).innerText: resultsTable(r - 1, 7) = moneyLinePrices.item(1).innerText
html3.body.innerHTML = contentDivs.item(2).outerHTML
Set runners = html3.querySelectorAll(".name")
Set OuHandicaps = html3.querySelectorAll(".handicap")
Set OuPrices = html3.querySelectorAll(".price")
resultsTable(r - 2, 8) = runners.item(0).innerText: resultsTable(r - 1, 8) = runners.item(1).innerText
resultsTable(r - 2, 9) = OuHandicaps.item(0).innerText: resultsTable(r - 1, 9) = .item(1).innerText
resultsTable(r - 2, 10) = OuPrices.item(0).innerText: resultsTable(r - 1, 10) = OuPrices.item(1).innerText
End Select
End With
Next
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(resultsTable, 1), UBound(resultsTable, 2)) = resultsTable
End With
End Sub
Example selection from results:
Additional reading:
css selectors
querySelectorAll
HTMLDocument

Related

Edit 14 day weather forecast Excel VBA to include precipitation

I found the code below which works nicely and I think I can repurpose it for my needs, but does not include the precipitation. I'm relatively new to HTML so having trouble understanding what each line of code's purpose is. I've gone to the website and looked at the elements and console but can't find "p[data-testid='wxPhrase']" or the word 'children' or 'child'.
I presumed precipitation was just another child so tried adding lines like these after editing the column headers in the first sub:
Results(r + 1, 3) = Children(r).FirstChild.innerText
Results(r + 1, 4) = Children(r).PreviousSibling.PreviousSibling.PreviousSibling.FirstChild.innerText
but it gives Run-time error '438': Object doesn't support this property or method. I appreciate very much some help and education. Thanks, in advance!
Sub MiamiWeather()
Dim Data As Variant
Data = MiamiWeatherData
Range("A1:B1").Value = Array("Date", "Temperature")
Range("A2").Resize(UBound(Data), 2).Value = Data
End Sub
Function MiamiWeatherData()
Const URL = "https://weather.com/weather/tenday/l/3881cd527264bc7c99b6b541473c0085e75aa026b6bd99658c56ad9bb55bd96e"
Dim responseText As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.send
responseText = .responseText
End With
Dim Document As HTMLDocument
Set Document = CreateObject("HTMLFILE")
Document.body.innerHTML = responseText
Dim Children As IHTMLDOMChildrenCollection
Set Children = Document.querySelectorAll("p[data-testid='wxPhrase']")
Dim Results As Variant
ReDim Results(1 To Children.Length, 1 To 2)
Dim r As Long
For r = 0 To Children.Length - 1
Results(r + 1, 1) = Children(r).PreviousSibling.PreviousSibling.FirstChild.innerText
Results(r + 1, 2) = Children(r).PreviousSibling.FirstChild.innerText
Next
MiamiWeatherData = Results
End Function
Assuming you want the percentage, you need to resize the array to hold an extra dimension then add an extra selector within the loop. That selector can select by attribute = value and will need to work off .Children(r).PreviousSibling. Assuming, you have a still maintained Microsoft set-up you can chain querySelector at this point as shown below.
For older versions e.g., <= Windows 7 then use Results(r + 1, 3) = Children(r).PreviousSibling.Children(2).Children(0).Children(1).innerText
Option Explicit
Public Sub MiamiWeather()
Dim Data As Variant
Data = MiamiWeatherData
Range("A1:C1").value = Array("Date", "Temperature", "Precipitation")
Range("A2").Resize(UBound(Data), 3).value = Data
End Sub
Function MiamiWeatherData()
Const URL = "https://weather.com/weather/tenday/l/3881cd527264bc7c99b6b541473c0085e75aa026b6bd99658c56ad9bb55bd96e"
Dim responseText As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.send
responseText = .responseText
End With
Dim Document As HTMLDocument
Set Document = CreateObject("HTMLFILE")
Document.body.innerHTML = responseText
Dim Children As IHTMLDOMChildrenCollection
Set Children = Document.querySelectorAll("p[data-testid='wxPhrase']")
Dim Results As Variant
ReDim Results(1 To Children.length, 1 To 3)
Dim r As Long
For r = 0 To Children.length - 1
Results(r + 1, 1) = Children(r).PreviousSibling.PreviousSibling.FirstChild.innerText
Results(r + 1, 2) = Children(r).PreviousSibling.FirstChild.innerText
Results(r + 1, 3) = Children(r).PreviousSibling.querySelector("[data-testid=PercentageValue]").innerText
Next
MiamiWeatherData = Results
End Function

VBA: Nested JSON Object Structure for UK Bank Holidays - Run Time Errors 13, 438 and 449

I'm looking to import all England and Wales Bank Holidays from https://www.gov.uk/bank-holidays.json and add them to a pre-created MS Access recordset (called "TestTable") using the MS Access VBA module. The code below opens and converts the json to a string, and then parses it using the JsonConverter.
This is where I seem to have hit a wall - I can't seem to get the right combo of Dictionaries and Collections to tell the VBA module the structure of the json file (I have no problem with creating a record in Access). After parsing the json, I'm getting one of two errors, most likely because what I think is supposed to be a dictionary (with {} brackets) and what I think is supposed to be a collection (with [] brackets) give me errors.
Option Explicit
Sub ImportBH()
Dim Parsed As Dictionary
Dim rsT As DAO.Recordset
Dim jsonStr As String
Dim dictionaryKey, var1 As Variant
Dim initialCollection As Collection
Set rsT = CurrentDb.OpenRecordset("TestTable")
Dim httpobject As Object
Set httpobject = CreateObject("MSXML2.XMLHTTP")
httpobject.Open "GET", "https://www.gov.uk/bank-holidays.json", False
httpobject.Send
jsonStr = httpobject.responsetext
Set Parsed = ParseJson(jsonStr) 'parse json data
If I now use the line:
For Each dictionaryKey In Parsed("england-and-wales")
Then at the end of the "item" function in JsonConverter, I get a Run-time error 438: Object doesn't support this property or method.
On the other hand, if I use the line:
For Each dictionaryKey In Parsed.Keys
Then it works (using the "Keys" function in JsonConverter), and when I hover over "Parsed.Keys", it gives me "england-and-wales". However, at the first line of the following code, I get a Run-time error 13: Type mismatch.
Set initialCollection = dictionaryKey("events")
With rsT
.AddNew
![Title] = var1("title")
![Datex] = var1("date")
![Notes] = var1("notes")
.Update
End With
Next
End Sub
I've tried the solutions (and others similar) in these links.
https://github.com/VBA-tools/VBA-Web/issues/134 - I'm aware this is for exporting json and not importing, but I thought the syntax might help, as Tim Hall has replied himself. Unfortunately, The ".Data" property doesn't appear or work for me :(
VBA-Json Parse Nested Json - When trying to apply this to the UK Bank Holidays json, I get Run-time error 13 again.
https://github.com/VBA-tools/VBA-Web/issues/329 - If I try, for example:
Debug.Print Parsed(dictionaryKey)
Then after then "item" function in JsonConverter, I get a Run-time error 449: Argument not optional.
https://github.com/VBA-tools/VBA-Web/issues/260 - I can't get to the stage to create a collection to use ".Count" to make this work.
If anyone has achieved this before in VBA, or might be able to offer a hand, it would be very much appreciated!
Start with learning how to read the json structure. You can paste the json string in a json viewer. You then get a nice view of the structure. In VBA JSON the [] denote a collection you can For Each over or access by index, and the {} denotes a dictionary you can For Each the keys of, or access by specific key.
If you put your json into a viewer you should be reading it something like as follows:
Excel version for use as template:
Accessing all items:
The following shows one way of emptying the entire json into an array (you could amend for adding to recordset?)
Option Explicit
Public Sub EmptyJsonIntoArray()
Dim json As Object, r As Long, c As Long, results(), counter As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.gov.uk/bank-holidays.json", False
.Send
Set json = JsonConverter.ParseJson(.responsetext) 'dictionary with 3 keys
End With
Dim key As Variant, innerKey As Variant, col As Collection
Dim division As String, headers(), item As Object, arr()
arr = json.keys
headers = json(arr(LBound(arr)))("events").item(1).keys 'take first innermost dictionary keys as headers for output
'oversize array as number of events can vary by division
ReDim results(1 To 1000, 1 To UBound(headers) + 2) '4 is the number of keys for each event level dictionary. +1 so can have _
division included as first column in output and +1 to move from 0 based headers array to 1 based results
r = 1 'leave first row for headers
results(1, 1) = "Division"
For c = LBound(headers) To UBound(headers)
results(1, c + 2) = headers(c) 'write out rest of headers to first row
Next
For Each key In json.keys 'england-and-wales etc. division
division = key
For Each item In json(division)("events") 'variable number of events dictionaries within collection
r = r + 1: c = 2 'create a new row for event output. Set column to 2 (as position 1 will be occupied by division
results(r, 1) = division
For Each innerKey In item.keys 'write out innermost dictionary values into row of array
results(r, c) = item(innerKey)
c = c + 1
Next
Next
Next
'transpose array so can redim preserve the number of rows (now number of columns) to only required number based on current value of r
results = Application.Transpose(results)
ReDim Preserve results(1 To UBound(headers) + 2, 1 To r)
results = Application.Transpose(results) 'transpose array back
'STOP '<== View array
End Sub
Sample of results contents:
Access:
From feedback by OP. With Access there is no Application.Transpose. Instead array can be passed to the following functionsource. However, the array must then be 0 based that is passed.
Public Function TransposeArray(myarray As Variant) As Variant
Dim X As Long
Dim Y As Long
Dim Xupper As Long
Dim Yupper As Long
Dim tempArray As Variant
Xupper = UBound(myarray, 2)
Yupper = UBound(myarray, 1)
ReDim tempArray(Xupper, Yupper)
For X = 0 To Xupper
For Y = 0 To Yupper
tempArray(X, Y) = myarray(Y, X)
Next Y
Next X
TransposeArray = tempArray
End Function
Access version as appended by OP:
In addition to TransposeArray above (edited below to work in this case), here's the full code for Access:
Option Compare Database
Option Explicit
Public Sub UpdateBankHolidays()
Dim dbs As DAO.Database
Dim tBH As Recordset
Dim i, r, c As Long
Set dbs = CurrentDb
'Set recordset variable as existing table (in this case, called "z_BankHolidays")
Set tBH = dbs.OpenRecordset("z_BankHolidays")
'Download and parse json
Dim json As Object, results(), counter As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.gov.uk/bank-holidays.json", False
.Send
Set json = ParseJson(.responsetext) 'dictionary with 3 keys
End With
Dim key As Variant, innerKey As Variant, col As Collection
Dim division As String, headers(), item As Object, arr()
arr = json.Keys
headers = json(arr(LBound(arr)))("events").item(1).Keys 'take first innermost dictionary keys as headers for output
'oversize array as number of events can vary by division
ReDim results(1 To 1000, 1 To UBound(headers) + 2) '4 is the number of keys for each event level dictionary. +1 so can have _
division included as first column in output and +1 to move from 0 based headers array to 1 based results
r = 1 'leave first row for headers
results(1, 1) = "Division"
For c = LBound(headers) To UBound(headers)
results(1, c + 2) = headers(c) 'write out rest of headers to first row
Next
For Each key In json.Keys 'england-and-wales etc. division
division = key
For Each item In json(division)("events") 'variable number of events dictionaries within collection
r = r + 1: c = 2 'create a new row for event output. Set column to 2 (as position 1 will be occupied by division
results(r, 1) = division
For Each innerKey In item.Keys 'write out innermost dictionary values into row of array
results(r, c) = item(innerKey)
c = c + 1
Next
Next
Next
'transpose array so can redim preserve the number of rows (now number of columns) to only required number based on current value of r
results = TransposeArray(results)
ReDim Preserve results(0 To UBound(results), 0 To r)
results = TransposeArray(results) 'transpose array back
'Clear all existing bank holidays from recordset
dbs.Execute "DELETE * FROM " & tBH.Name & ";"
'Insert array results into tBH recordset, transforming the date into a date value using a dd/mmm/yyyy format (in the array they are currently yyyy-mm-dd)
For i = 1 To r
If results(i, 1) = "england-and-wales" Then
dbs.Execute " INSERT INTO " & tBH.Name & " " _
& "(Title,Holiday,Notes) VALUES " _
& "('" & results(i, 2) & "', " & _
"'" & DateValue(Right(results(i, 3), 2) & "/" & Format("20/" & Mid(results(i, 3), 6, 2) & "/2000", "mmm") & "/" & Left(results(i, 3), 4)) & "', " & _
"'" & results(i, 4) & "'" & _
");"
End If
Next
'Finish
MsgBox "Bank Holidays updated."
End Sub
It's also worth noting that I (OP) had to change X and Y in the TransposeArray to start from 1, not 0 (even though, as noted above and in comments, subsequently redimming it must be based at 0). I.e.:
Public Function TransposeArray(myarray As Variant) As Variant
Dim X As Long
Dim Y As Long
Dim Xupper As Long
Dim Yupper As Long
Dim tempArray As Variant
Xupper = UBound(myarray, 2)
Yupper = UBound(myarray, 1)
ReDim tempArray(Xupper, Yupper)
For X = 1 To Xupper
For Y = 1 To Yupper
tempArray(X, Y) = myarray(Y, X)
Next Y
Next X
TransposeArray = tempArray
End Function

Web Scraping Elements By Class & Tag name

I'm trying to copy data from below mentioned web-site, I need the all range of sizes,Price,Amenities,Specials, Reserve. I frame below code but I'm able to copy element properly. first thing only three elements are coping with duplication also I'm not getting result for Amenities and Reserve. Can anybody please look into this?
Sub text()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
.Visible = True
.Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "Specials", "Price", "Reserve")
Set list = .document.getElementsByClassName("units_table")
'.unit_size medium, .features, .Specials, .price, .Reserve
Dim rowCount As Long
rowCount = .document.querySelectorAll(".tab_container li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByClassName("unitinfo even")
r = r + 1
results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub
tl;dr;
Apologies in advance (to some) for the answer length but I thought I would take this
pedagogic moment to detail what is going on.
The overall approach I use is the same as in your code: Find a css selector to isolate rows (despite being in different tabs the small, medium, large are actually still all present on page):
Set listings = html.querySelectorAll(".unitinfo")
The above generates the rows. As before, we dump this into a new HTMLDocument so we can leverage querySelector/querySelectorAll methods.
Rows:
Let's take a look at the first row html we are retrieving. The subsequent sections will take this row as a case study to talk through how info is retrieved:
5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
// Refine Search
//
$(function() {
$("#a5x5-1").tooltip({
track: false,
delay: 0,
showURL: false,
left: 5,
top: 5,
bodyHandler: function () {
return " <div class=\"tooltip\"> <div class=\"tooltop\"></div> <div class=\"toolmid clearfix\"> <div class=\"toolcontent\"> <div style=\"text-align:center;width:100%\"> <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" /> </div> <div class=\"display\">5 x 5</div> <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div> </div> <div class=\"clearfix\"></div> </div> <div class=\"toolfoot\"></div> <div class=\"clearfix\"></div> </div> "}
});
});
</SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
$('.units_table tr.unitinfo').each(function(index, el) {
if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
$(this).parents('.units_table').addClass('both');
$(this).addClass('also-both');
$(this).find('.rates').addClass('rates_two_column');
}
});
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>
Each row we are going to be working with will have similar html inside of the html2 variable. If you were in doubt look at the javascript in the function shown above:
$('.units_table tr.unitinfo').each(function(index, el)
it is using the same selector (but also specifying the parent table class and element type (tr)). Basically, that function is being called for each row in the table.
Size:
For some reason the opening td tag is being dropped (I've seen this with missing parent <table> tags I think) so for size, rather than grab by class, I am looking for the start of the closing tag and extracting the string up to there. I do this by by passing the return value given by Instr (where < was found in string) -1 to the Left$ (typed) function.
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
This returns 5x5.
Description:
The description column is populated by the function we saw above (which is applied to each row remember)
This bit - $("#a5x5-1").tooltip - tells it where to target, and then the return statement of the function provides the html which has a div, with class description, containing the text we want. As we are not using a browser, and I am on 64 bit windows, I cannot evaluate this script, but I can use split to extract the string (description) between the "description\"> and the start of the closing associated div tag:
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
This returns:
"Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle."
Rate type and price:
These are straightforward and use class name to target:
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Returning (respectively)
Web Rate ,
£39.00
Amenities:
This is where things are a little trickier.
Let's re-examine the html shown above, for this row, that pertains to amenities:
<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>
We can see that parent td has a class of amenities, which has child div elements which have compound class names; the latter of which, in each case, serves as an identifier for amenity type e.g. icon_climate.
When you hover over these, on the page, tooltip info is presented:
We can trace the location of this tooltip in the html of the actual page:
As you hover over different amenities this content updates.
To cut a long story short (he says whilst half way down the page!), this content is being updated from a php file on the server. We can make a request for the file and construct a dictionary which maps the class name of each amenities e.g. amenity_icon icon_climate (which as are compound classes need " " replaced with "." when converting to the appropriate css selector of .amenity_icon.icon_climate) to the associated descriptions. You can explore the php file here.
The php file:
Let's look at just the start of the file so as to dissect the basic unit of what is a repeating pattern:
function LoadTooltips() {
$(".units_table .amenity_icon.icon_climate").tooltip({
track: false,
delay: 0,
showURL: false,
left: -126,
top: -100,
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
}
});
The function responsible for updating the tooltip is LoadTooltips. CSS class selectors are used to target each icon:
$(".units_table .amenity_icon.icon_climate").tooltip
And we have the bodyhandler specifying the return text:
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
We have three bits of useful information that appear in repeating groups. The class name selector for the element, the short description and the long description e.g.
.amenity_icon.icon_climate : we use this to map the php file descriptions to the class name of the amenity icon in our row. CSS selector
Temperature Controlled; inside h4 tag of tooltip function return text. Short description
Units are heated and/or cooled. See manager for details.; inside p tag of tooltip function return text. Long description
I write 2 functions, GetMatches and GetAmenitiesDescriptions, that use regex to extract all of the repeating items, for each icon, and return a dictionary which has the css selector as the key, and the short description : long description as the value.
When I gather all the icons in each row:
Set icons = html2.querySelectorAll(".amenity_icon")
I use the dictionary to return the tooltip descriptions based on the class name of the icon
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
I then join the descriptions with vbNewLine to ensure output is on different lines within output cell.
You can explore the regex here.
The regex uses | (Or) syntax so I return all matched patterns in a single list.
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
As I will want different submatches (0,1 or 2 a.k.a css class selector, short desc, long desc) I use a Select Case i mod 3, with counter variable i, to extract appropriate sub-matches.
Example of those matches for the mapping in the php file:
Specials:
We are back to class selectors. Offer2 is not populated so you could remove.
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
returns (respectively):
Call for Specials, empty string
Closing remarks:
So, the above walks you through one row. It is simply rinse and repeat in the loop over all rows. Data is added, for efficiency, to an array, results; which is then written to Sheet1 in one go. There are some minor improvements I can see but this is fast.
VBA:
Option Explicit
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
Set amenitiesDescriptions = GetAmenitiesDescriptions
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
Dim headers(), results(), listings As Object, amenities As String
headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
Set listings = html.querySelectorAll(".unitinfo")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Set icons = html2.querySelectorAll(".amenity_icon")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out
results(r, 5) = amenities
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
' view regex here: https://regex101.com/r/bII5AL/1
Set dict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
For i = LBound(arr) To UBound(arr) Step 3 'build up lookup dictionary for amenities descriptions
dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
Next
End With
Set GetAmenitiesDescriptions = dict
End Function
Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
Select Case i Mod 3
Case 0
arrMatches(i) = iMatch.SubMatches.item(0)
Case 1
arrMatches(i) = iMatch.SubMatches.item(1)
Case 2
arrMatches(i) = iMatch.SubMatches.item(2)
End Select
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Output:
References (VBE > Tools > References):
Microsoft HTML Object Library
Here's one way to do it:
Sub test()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim targetTable As HTMLTable
Dim tableRow As HTMLTableRow
Dim tableCell As HTMLTableCell
Dim element As HTMLDivElement
Dim sht As Worksheet
Dim amenitiesString As String
Dim i As Long
Dim j As Long
Set sht = ThisWorkbook.Worksheets("Sheet1")
With req
.Open "GET", "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955", False
.send
doc.body.innerHTML = .responseText
End With
Set targetTable = doc.getElementById("units_small_units") 'You can use units_medium_units or units_large_units to get the info from the other tabs
i = 0
For Each tableRow In targetTable.Rows
i = i + 1
j = 0
For Each tableCell In tableRow.Cells
amenitiesString = ""
j = j + 1
If tableCell.className = "amenities" And tableCell.innerText <> "Amenities" Then
For Each element In tableCell.getElementsByTagName("div")
amenitiesString = amenitiesString & element.Title & ","
Next element
sht.Cells(i, j).Value = amenitiesString
ElseIf tableCell.className <> "features" Then
sht.Cells(i, j).Value = tableCell.innerText
End If
Next tableCell
Next tableRow
End Sub
I'm using an HTTP request instead of Internet Explorer to get the HTML. Apart from that I think you can get an idea of how to access the elements you want.
Here's a screenshot of the result.
The presentation is a bit primitive but you get the idea :-P
Basically this:
listing.getElementsByClassName("amenities")(0).innerText
will return a blank, because there is no inner text in these elements. The information is produced by a script but it can also be found in the title of the div elements.
References used:
Microsoft HTML Object Library and WinHTTP Services Version 5.1
Can you try Jquery get approach like below:
$.get( 'url', function(data) {
// Loop through elements
$(data).find("ul").find("li").each( function(){
var text = $(this).text();
} )
} );

Scrape table for nested table in local html using selenium

In this thread [Link}(Scraping table from local HTML with unicode characters), QHarr has helped me to scrape a table from local html file.
I have a html file at this Link
And I used the same code and edited a little for the variables 'startTableNumber' and 'endTableNumber' and 'numColumns'
Public Sub Test()
Dim fStream As ADODB.Stream, html As HTMLDocument
Set html = New HTMLDocument
Set fStream = New ADODB.Stream
With fStream
.Charset = "UTF-8"
.Open
.LoadFromFile "C:\Users\Future\Desktop\Sample 2.html"
html.body.innerHTML = .ReadText
.Close
End With
Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long
Dim counter As Long, endTableNumber, numColumns As Long
startTableNumber = 91
endTableNumber = 509
numColumns = 14
Set hTables = html.getElementsByTagName("table")
r = 2: c = 1
For i = startTableNumber To endTableNumber Step 2
counter = counter + 1
If counter = 10 Then
c = 1: r = r + 1: counter = 1
End If
Cells(r, c) = hTables(i).innerText
c = c + 1
Next
End Sub
But I got scattered data of the table further more I would like to find a flexible way so as to make the code recognize those variables without assigning them manually
I hope to find solution using selenium. Hope also not to receive negative rep. I have done my best to clarify the issue
Regards
So, as I said in my comments you need to study how the data appears in the later table tags and perform a mapping to get the correct ordering. The following writes out the table. As I also mentioned, this is not robust and only the methodology may possibly be transferable to other documents.
In your case you wouldn't be reading from file but would use
Set tables = driver.FindElementsByCss("table[width='100%'] table:first-child")
You would then For Each over the web elements in the collection adjusting the syntax as required e.g. .Text instead of .innerText. There may be a few other adaptations for selenium due to its indexing of webElements but everything you need to should be evident below.
VBA:
Option Explicit
Public Sub ParseInfo()
Dim html As HTMLDocument, tables As Object, ws As Worksheet, i As Long
Set ws = ThisWorkbook.Worksheets("Sheet2")
Dim fStream As ADODB.Stream
Set html = New HTMLDocument
Set fStream = New ADODB.Stream
With fStream
.Charset = "UTF-8"
.Open
.LoadFromFile "C:\Users\User\Desktop\test.html"
html.body.innerHTML = .ReadText
.Close
End With
Set tables = html.querySelectorAll("table[width='100%'] table:first-child")
Dim rowCounter: rowCounter = 2
Dim mappings(), j As Long, headers(), arr(13)
headers = Array("Notes", "Type", "Enrollment status", "Governorate of birth", "Year", "Month", "Day", "Date of Birth", "Religion", _
"Nationality", "Student Name", "National Number", "Student Code", "M")
mappings = Array(3, 8, 9, 12, 11, 10, 2, 7, 1, 6, 5, 4, 13)
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
For i = 89 To 504 Step 26
arr(0) = vbNullString
For j = 0 To 12
arr(mappings(j)) = tables.item(2 * j + i).innerText
Next
ws.Cells(rowCounter + 1, 1).Resize(1, UBound(arr) + 1) = arr
rowCounter = rowCounter + 1
Next
End Sub

Extracting a series of URL using VBA

I just trying to run through a list of url link, but it keep showing run time error'91',object variable or with block variable not set.
The data I want to extract is from iframes. It do shown some of the values but it stuck in the middle of process with error.
Below is the sample url link that I want to extract value from:http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201
Public Sub GetInfo()
Dim IE As New InternetExplorer As Object
With IE
.Visible = False
For u = 2 To 100
.navigate Cells(u, 1).Value
While .Busy Or .readyState < 4: DoEvents: Wend
With .document.getElementById("bm_ann_detail_iframe").contentDocument
ThisWorkbook.Worksheets("Sheet1").Cells(u, 3) = .getElementById("main").innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 4) = .getElementsByClassName("company_name")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 5) = .getElementsByClassName("formContentData")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 6) = .getElementsByClassName("formContentData")(5).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 7) = .getElementsByClassName("formContentData")(7).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 8) = .getElementsByClassName("formContentData")(8).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 10) = .getElementsByClassName("formContentData")(10).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 11) = .getElementsByClassName("formContentData")(11).innerText
End With
Next u
End With
End Sub
tl;dr
Your error is due to the fact there are different numbers of elements for the given class name depending on the results per page. So you can't used fixed indexes. For the page you indicated the last index for that class, via the iframe, is 9 i.e. ThisWorkbook.Worksheets("Sheet1").cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText . 10 and 11 are invalid. Below I show a way to determine the number of results and extract info from each result row.
General principle:
Ok... so the following works on the principle of targeting the Details of Changes table for most of the info.
Example extract:
More specifically, I target the rows that repeat the info for No, Date of Change, #Securities, Type of Transaction and Nature of Interest. These values are stored in an array of arrays (one array per row of information). Then the results arrays are stored in a collection to later be written out to the sheet. I loop each table cell in the targeted rows (td tag elements within parent tr) to populate the arrays.
I add in the Name from the table above on the page and also, because there can be more than one row of results, depending on the webpage, and because I am writing the results to a new Results sheet, I add in the URL before each result to indicate source of information.
TODO:
Refactor the code to be more modular
Potentially add in some error handling
CSS selectors:
① I select the Name element, which I refer to as title, from the Particulars of substantial Securities Holder table.
Example name element:
Inspecting the HTML for this element shows it has a class of formContentLabel, and that it is the first class with this value on the page.
Example HTML for target Name:
This means I can use a class selector , .formContentLabel, to target the element. As it is a single element I want I use the querySelector method to apply the CSS selector.
② I target the rows of interest in the Details of Changes table with a selector combination of .ven_table tr. This is descendant selector combination combining selecting elements with tr tag having parent with class ven_table. As these are multiple elements I use the querySelectorAll method to apply the CSS selector combination.
Example of a target row:
Example results returned by CSS selector (sample):
The rows I am interested start at 1 and repeat every + 4 rows after e.g. row 5 , 9 etc.
So I use a little maths in the code to return just the rows of interest:
Set currentRow = data.item(i * 4 + 1)
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, headers(), u As Long, resultCollection As Collection
headers = Array("URL", "Name", "No", "Date of change", "# Securities", "Type of Transaction", "Nature of Interest")
Set resultCollection = New Collection
Dim links()
links = Application.Transpose(ThisWorkbook.Worksheets("Sheet1").Range("A2:A3")) 'A100
With IE
.Visible = True
For u = LBound(links) To UBound(links)
If InStr(links(u), "http") > 0 Then
.navigate links(u)
While .Busy Or .readyState < 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 1) '<you may not always need this. Or may need to increase.
Dim data As Object, title As Object
With .document.getElementById("bm_ann_detail_iframe").contentDocument
Set title = .querySelector(".formContentData")
Set data = .querySelectorAll(".ven_table tr")
End With
Dim results(), numberOfRows As Long, i As Long, currentRow As Object, td As Object, c As Long, r As Long
numberOfRows = Round(data.Length / 4, 0)
ReDim results(1 To numberOfRows, 1 To 7)
For i = 0 To numberOfRows - 1
r = i + 1
results(r, 1) = links(u): results(r, 2) = title.innerText
Set currentRow = data.item(i * 4 + 1)
c = 3
For Each td In currentRow.getElementsByTagName("td")
results(r, c) = Replace$(td.innerText, "document.write(rownum++);", vbNullString)
c = c + 1
Next td
Next i
resultCollection.Add results
Set data = Nothing: Set title = Nothing
End If
Next u
.Quit
End With
Dim ws As Worksheet, item As Long
If Not resultCollection.Count > 0 Then Exit Sub
If Not Evaluate("ISREF('Results'!A1)") Then '<==Credit to #Rory for this test
Set ws = Worksheets.Add
ws.NAME = "Results"
Else
Set ws = ThisWorkbook.Worksheets("Results")
ws.cells.Clear
End If
Dim outputRow As Long: outputRow = 2
With ws
.cells(1, 1).Resize(1, UBound(headers) + 1) = headers
For item = 1 To resultCollection.Count
Dim arr()
arr = resultCollection(item)
For i = LBound(arr, 1) To UBound(arr, 1)
.cells(outputRow, 1).Resize(1, 7) = Application.WorksheetFunction.Index(arr, i, 0)
outputRow = outputRow + 1
Next
Next
End With
End Sub
Example results using 2 provided tests URLs:
Sample URLs in sheet1:
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5928057
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201