Extracting a series of URL using VBA

Extracting a series of URL using VBA - html

I just trying to run through a list of url link, but it keep showing run time error'91',object variable or with block variable not set.
The data I want to extract is from iframes. It do shown some of the values but it stuck in the middle of process with error.
Below is the sample url link that I want to extract value from:http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201
Public Sub GetInfo()
Dim IE As New InternetExplorer As Object
With IE
.Visible = False
For u = 2 To 100
.navigate Cells(u, 1).Value
While .Busy Or .readyState < 4: DoEvents: Wend
With .document.getElementById("bm_ann_detail_iframe").contentDocument
ThisWorkbook.Worksheets("Sheet1").Cells(u, 3) = .getElementById("main").innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 4) = .getElementsByClassName("company_name")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 5) = .getElementsByClassName("formContentData")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 6) = .getElementsByClassName("formContentData")(5).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 7) = .getElementsByClassName("formContentData")(7).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 8) = .getElementsByClassName("formContentData")(8).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 10) = .getElementsByClassName("formContentData")(10).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 11) = .getElementsByClassName("formContentData")(11).innerText
End With
Next u
End With
End Sub

tl;dr
Your error is due to the fact there are different numbers of elements for the given class name depending on the results per page. So you can't used fixed indexes. For the page you indicated the last index for that class, via the iframe, is 9 i.e. ThisWorkbook.Worksheets("Sheet1").cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText . 10 and 11 are invalid. Below I show a way to determine the number of results and extract info from each result row.
General principle:
Ok... so the following works on the principle of targeting the Details of Changes table for most of the info.
Example extract:
More specifically, I target the rows that repeat the info for No, Date of Change, #Securities, Type of Transaction and Nature of Interest. These values are stored in an array of arrays (one array per row of information). Then the results arrays are stored in a collection to later be written out to the sheet. I loop each table cell in the targeted rows (td tag elements within parent tr) to populate the arrays.
I add in the Name from the table above on the page and also, because there can be more than one row of results, depending on the webpage, and because I am writing the results to a new Results sheet, I add in the URL before each result to indicate source of information.
TODO:
Refactor the code to be more modular
Potentially add in some error handling
CSS selectors:
① I select the Name element, which I refer to as title, from the Particulars of substantial Securities Holder table.
Example name element:
Inspecting the HTML for this element shows it has a class of formContentLabel, and that it is the first class with this value on the page.
Example HTML for target Name:
This means I can use a class selector , .formContentLabel, to target the element. As it is a single element I want I use the querySelector method to apply the CSS selector.
② I target the rows of interest in the Details of Changes table with a selector combination of .ven_table tr. This is descendant selector combination combining selecting elements with tr tag having parent with class ven_table. As these are multiple elements I use the querySelectorAll method to apply the CSS selector combination.
Example of a target row:
Example results returned by CSS selector (sample):
The rows I am interested start at 1 and repeat every + 4 rows after e.g. row 5 , 9 etc.
So I use a little maths in the code to return just the rows of interest:
Set currentRow = data.item(i * 4 + 1)
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, headers(), u As Long, resultCollection As Collection
headers = Array("URL", "Name", "No", "Date of change", "# Securities", "Type of Transaction", "Nature of Interest")
Set resultCollection = New Collection
Dim links()
links = Application.Transpose(ThisWorkbook.Worksheets("Sheet1").Range("A2:A3")) 'A100
With IE
.Visible = True
For u = LBound(links) To UBound(links)
If InStr(links(u), "http") > 0 Then
.navigate links(u)
While .Busy Or .readyState < 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 1) '<you may not always need this. Or may need to increase.
Dim data As Object, title As Object
With .document.getElementById("bm_ann_detail_iframe").contentDocument
Set title = .querySelector(".formContentData")
Set data = .querySelectorAll(".ven_table tr")
End With
Dim results(), numberOfRows As Long, i As Long, currentRow As Object, td As Object, c As Long, r As Long
numberOfRows = Round(data.Length / 4, 0)
ReDim results(1 To numberOfRows, 1 To 7)
For i = 0 To numberOfRows - 1
r = i + 1
results(r, 1) = links(u): results(r, 2) = title.innerText
Set currentRow = data.item(i * 4 + 1)
c = 3
For Each td In currentRow.getElementsByTagName("td")
results(r, c) = Replace$(td.innerText, "document.write(rownum++);", vbNullString)
c = c + 1
Next td
Next i
resultCollection.Add results
Set data = Nothing: Set title = Nothing
End If
Next u
.Quit
End With
Dim ws As Worksheet, item As Long
If Not resultCollection.Count > 0 Then Exit Sub
If Not Evaluate("ISREF('Results'!A1)") Then '<==Credit to #Rory for this test
Set ws = Worksheets.Add
ws.NAME = "Results"
Else
Set ws = ThisWorkbook.Worksheets("Results")
ws.cells.Clear
End If
Dim outputRow As Long: outputRow = 2
With ws
.cells(1, 1).Resize(1, UBound(headers) + 1) = headers
For item = 1 To resultCollection.Count
Dim arr()
arr = resultCollection(item)
For i = LBound(arr, 1) To UBound(arr, 1)
.cells(outputRow, 1).Resize(1, 7) = Application.WorksheetFunction.Index(arr, i, 0)
outputRow = outputRow + 1
Next
Next
End With
End Sub
Example results using 2 provided tests URLs:
Sample URLs in sheet1:
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5928057
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201

Related

How to extract the div_class table from betting website

I was searching continuously and I ran out of luck. I'm trying to just get the text of the games being played on https://www.sportsinteraction.com/football/nfl-betting-lines/. So at the end I want something to spit out like the below (doesn't have to be clean).
Carolina Panthers
Houston Texans
POINTSPREAD
+4.0
1.90
-4.0
1.92
MONEYLINE
2.69
1.49
OVER/UNDER
O
+47.0
1.91
U
+47.0
1.91
Code below which is obviously wrong:
Sub Pulldata2()
Dim ieObj As InternetExplorer
Dim appIE As Object
Dim htmlEle As IHTMLElement
Dim i As Integer
Dim strSheet As String
strSheet = Sheet2.Range("P2")
i = 1
Set ieObj = New InternetExplorer
ieObj.Visible = False
ieObj.navigate Sheet2.Range("P2").Value
Application.ScreenUpdating = False
Application.DisplayStatusBar = False
Application.Wait Now + TimeValue("00:00:03")
Sheet13.Activate
For Each htmlEle In ieObj.document.getElementsByClassName("game")(0)
With ActiveSheet
.Range("A1").Value
End With
i = i + 1
On Error Resume Next
Next htmlEle
End Sub

Why?
This is an interesting scrape for practice so worth the time to attempt. It is interesting because there isn't an immediately obvious way to 'block' out each event with the associated meta data e.g. date, time, match title; and the betting content switches between horizontal and vertical groupings making your selector strategy for identifying nodes particularly important.
Strategy:
The strategy I decided to employ was the following: Use css class selectors to grab a long list of nodes containing all the info required. date, time and title would be nodes whose info I would need to repeat across the two lines for each team within each gameBettingContent block.
Note: the content is static so XHR can be used to avoid the overhead of opening a browser and rendering unnecessary content. This method is a lot quicker.
Anatomy of a logical block:
Note that each parent node with class gameBettingContent contains 3 child nodes with class betTypeContent. These children correspond with PointSpread, MoneyLine and Over/Under. These will need to occupy their own columns in the output including separating out price from handicap. For each logical block, as shown above, there will be two rows where some info is repeated in the first few columns of output.
The nodeList:
So, initially we have a nodeList of all elements of interest matched on className. Within some of those nodes are children that we then need to access; to avoid complicated accessor syntax, when we get to nodes whose children (or deeper nodes) we need to access, we load the html of that node into a 'surrogate', i.e. new, HTMLDocument variable, so we can leverage querySelectorAll method of HTMLDocument again. This gives us easy to read syntax and the cost is minimal.
We loop the nodeList, shown on left above, and test the className of each current node and use a Select Case statement to then determine what to do. If the current className is date, time or title we store that node's .innerText for later use (remember we need to repeat this value across runner 1 and runner 2). If the className is gameBettingContent we load that node's outerHTML into a surrogate HTMLDocument i.e. html2. We can then collect the runners and start populating our output array resultsTable. We add +2 to the r variable (row counter), initially, so we are ready for the next logical block (i.e. match), and then populate the 2 positions for the current match using r-2 and r-1.
Using surrogate HTMLDocument variables to leverage querySelectorAll on nodes deeper in the DOM:
As we are using MSXML2.XMLHTTP with MSHTML.HTMLDocument we do not have access to nth-of-type pseudo class css selector syntax (which you do with ie.document when using SHDocVw.InternetExplorer) to differentiate between the 3 divs (columns) for bet types within gameBettingContent; whilst we could chain class selectors and adjacent sibling combinators to move from left to right across the divs (columns) e.g.
PointSpread:
.gameBettingContent #runnerNames + .betTypeContent
MoneyLine:
.gameBettingContent #runnerNames + .betTypeContent + .betTypeContent
Over/Under:
.gameBettingContent #runnerNames + .betTypeContent + .betTypeContent + .betTypeContent
I find it cleaner to simply gather those divs into a nodeList
Set contentDivs = html2.querySelectorAll(".betTypeContent")
Then load each div's (column's) outerHTML into a new HTMLDocument surrogate html3 and leverage querySelectorAll again to gather the two rows info by index from each column.
For example, PointsSpread will be the first node in contentDivs which we read across into html3:
html3.body.innerHTML = contentDivs.item(0).outerHTML
We then select handicap and price
Set pointSpreadHandicaps = html3.querySelectorAll(".handicap")
Set pointSpreadPrices = html3.querySelectorAll(".price")
And can use index to get runner 1 versus runner 2 values:
resultsTable(r - 2, 5) = pointSpreadHandicaps.item(0).innerText: resultsTable(r - 1, 5) = pointSpreadHandicaps.item(1).innerText
That's pretty much the bulk of the logic. The output array is dimensioned based on number of rows being equal to:
html.querySelectorAll("#runnerNames li").Length
i.e. how many runners there are. The number of columns is equal to the number of items we specified in the headers array (we add 1 to the Ubound as the array is 0-based). The array is then written out in one go to sheet as are headers.
VBA:
Option Explicit
Public Sub GetNFLMatchInfo()
Dim html As HTMLDocument, html2 As HTMLDocument
Set html = New HTMLDocument: Set html2 = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.sportsinteraction.com/football/nfl-betting-lines/", False
.send
html.body.innerHTML = .responseText
End With
Dim allNodes As Object, i As Long, resultsTable(), r As Long, headers()
Dim dateValue As String, timeValue As String, title As String, html3 As HTMLDocument
headers = Array("Date", "Time", "Title", "Team", "Pointspread handicap", "Pointspread price", "Moneyline price", "O/U Name", "O/U Handicap", "O/U Price")
Set allNodes = html.querySelectorAll(".date, .time, .title, .gameBettingContent") 'nodeList of all items of interest. gameBettingContent is a block _
that will be further subdivided by reading its html into a 'surrogate' HTMLDocument
ReDim resultsTable(1 To html.querySelectorAll("#runnerNames li").Length, 1 To UBound(headers) + 1)
r = 1: Set html3 = New HTMLDocument
For i = 0 To allNodes.Length - 1
With allNodes.item(i)
Select Case .className
Case "date"
dateValue = .innerText
Case "time"
timeValue = .innerText
Case "title"
title = Trim$(.innerText)
Case "gameBettingContent"
Dim runners As Object, contentDivs As Object, pointSpreadHandicaps As Object
Dim pointSpreadPrices As Object, moneyLinePrices As Object, runners As Object
Dim OuHandicaps As Object, OuPrices As Object
r = r + 2 'then fill line one at r-2, and line 2 at r-1
html2.body.innerHTML = .outerHTML
Set runners = html2.querySelectorAll("#runnerNames li")
resultsTable(r - 2, 1) = dateValue: resultsTable(r - 1, 1) = dateValue
resultsTable(r - 2, 2) = timeValue: resultsTable(r - 1, 2) = timeValue
resultsTable(r - 2, 3) = title: resultsTable(r - 1, 3) = title
resultsTable(r - 2, 4) = runners.item(0).innerText: resultsTable(r - 1, 4) = runners.item(1).innerText
Set contentDivs = html2.querySelectorAll(".betTypeContent")
html3.body.innerHTML = contentDivs.item(0).outerHTML
'populate resultsTable for two rows relating to current gameBettingContent
Set pointSpreadHandicaps = html3.querySelectorAll(".handicap")
Set pointSpreadPrices = html3.querySelectorAll(".price")
resultsTable(r - 2, 5) = pointSpreadHandicaps.item(0).innerText: resultsTable(r - 1, 5) = pointSpreadHandicaps.item(1).innerText
resultsTable(r - 2, 6) = pointSpreadPrices.item(0).innerText: resultsTable(r - 1, 6) = pointSpreadPrices.item(1).innerText
html3.body.innerHTML = contentDivs.item(1).outerHTML 'Set html3 content to next content div to right
Set moneyLinePrices = html3.querySelectorAll(".price")
resultsTable(r - 2, 7) = moneyLinePrices.item(0).innerText: resultsTable(r - 1, 7) = moneyLinePrices.item(1).innerText
html3.body.innerHTML = contentDivs.item(2).outerHTML
Set runners = html3.querySelectorAll(".name")
Set OuHandicaps = html3.querySelectorAll(".handicap")
Set OuPrices = html3.querySelectorAll(".price")
resultsTable(r - 2, 8) = runners.item(0).innerText: resultsTable(r - 1, 8) = runners.item(1).innerText
resultsTable(r - 2, 9) = OuHandicaps.item(0).innerText: resultsTable(r - 1, 9) = .item(1).innerText
resultsTable(r - 2, 10) = OuPrices.item(0).innerText: resultsTable(r - 1, 10) = OuPrices.item(1).innerText
End Select
End With
Next
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(resultsTable, 1), UBound(resultsTable, 2)) = resultsTable
End With
End Sub
Example selection from results:
Additional reading:
css selectors
querySelectorAll
HTMLDocument

How to extract data from HTML divs into Excel

I am trying to extract the details in this webpage and they seem to be under certain "divs" with "selection-left" and "selection-right" right. I haven't found a way to successfully pull it yet.
This is the URL - https://sports.ladbrokes.com/en-af/betting/golf/golf-all-golf/us-masters/2020-us-masters/228648232/
And here is an image of what I want to extract. I want to copy the competition name and each participant and score.
I have tried using QHar's approach in this link - How to extract values from nested divs using VBA. But I'm getting errors along this line -
ReDim results(1 To countries.Length / 2, 1 To 4)
Here is the code I've been trying to make work
Option Explicit
Public Sub GetData()
Dim html As HTMLDocument, ws As Worksheet, countries As Object, scores As Object, results(), i As
Long, r As Long
Set ws = ThisWorkbook.Worksheets("Sheet1"): Set html = New HTMLDocument: r = 1
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://sports.ladbrokes.com/en-af/betting/golf/golf-all-golf/us-masters/2020-us-masters/228648232/", False
.send
html.body.innerHTML = .responseText
End With
Set participant = html.querySelectorAll(".market-content .selection-left"): Set scores = html.querySelectorAll("..market-content .selection-right")
ReDim results(1 To countries.Length / 2, 1 To 4)
For i = 0 To participant.Length - 1 Step 2
results(r, 1) = participant.item(i).innerText: results(r, 2) = "'" & scores.item(i).innerText
r = r + 1
Next
ws.Cells(1, 1).Resize(1, 4) = Array("Competition", "Participant", "Score")
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
I will need help to make this code work

Content is dynamically added so will not be present in your current request format; hence your error as you have a nodeList of Length 0. You could try making POST requests as the page does but it doesn't look like a quick and easy bit of coding. I would go with browser automation, if this is a small project, so that js can run on the page and you can click the show more button. You will need a wait condition for the page to have properly loaded. I use the presence of the show more button.
Option Explicit
Public Sub GetOddsIE()
Dim d As InternetExplorer, odds As Object, names As Object, i As Long
Dim ws As Worksheet, results(), competition As String
Set d = New InternetExplorer
Set ws = ThisWorkbook.Worksheets("Sheet1")
Const URL = "https://sports.ladbrokes.com/en-af/betting/golf/golf-all-golf/us-masters/2020-us-masters/228648232/"
With d
.Visible = False
.Navigate2 URL
While .Busy Or .ReadyState <> 4: DoEvents: Wend
With .Document.getElementsByClassName("expandable-below-container-button")
Do
DoEvents
Loop While .Length = 0 'wait for element to be present
.Item(0).Click 'click on show more
End With
Set names = .Document.getElementsByClassName("selection-left-selection-name")
Set odds = .Document.getElementsByClassName("odds-convert")
competition = .Document.getElementsByClassName("league")(0).innerText
ReDim results(1 To names.Length, 1 To 3)
For i = 0 To names.Length - 1
results(i + 1, 1) = competition
results(i + 1, 2) = names.Item(i).innerText
results(i + 1, 3) = "'" & odds.Item(i).innerText
Next
.Quit
End With
ws.Cells(1, 1).Resize(1, 3) = Array("Competition", "Participant", "Score")
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Web Scraping Elements By Class & Tag name

I'm trying to copy data from below mentioned web-site, I need the all range of sizes,Price,Amenities,Specials, Reserve. I frame below code but I'm able to copy element properly. first thing only three elements are coping with duplication also I'm not getting result for Amenities and Reserve. Can anybody please look into this?
Sub text()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
.Visible = True
.Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "Specials", "Price", "Reserve")
Set list = .document.getElementsByClassName("units_table")
'.unit_size medium, .features, .Specials, .price, .Reserve
Dim rowCount As Long
rowCount = .document.querySelectorAll(".tab_container li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByClassName("unitinfo even")
r = r + 1
results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub

tl;dr;
Apologies in advance (to some) for the answer length but I thought I would take this
pedagogic moment to detail what is going on.
The overall approach I use is the same as in your code: Find a css selector to isolate rows (despite being in different tabs the small, medium, large are actually still all present on page):
Set listings = html.querySelectorAll(".unitinfo")
The above generates the rows. As before, we dump this into a new HTMLDocument so we can leverage querySelector/querySelectorAll methods.
Rows:
Let's take a look at the first row html we are retrieving. The subsequent sections will take this row as a case study to talk through how info is retrieved:
5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
// Refine Search
//
$(function() {
$("#a5x5-1").tooltip({
track: false,
delay: 0,
showURL: false,
left: 5,
top: 5,
bodyHandler: function () {
return " <div class=\"tooltip\"> <div class=\"tooltop\"></div> <div class=\"toolmid clearfix\"> <div class=\"toolcontent\"> <div style=\"text-align:center;width:100%\"> <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" /> </div> <div class=\"display\">5 x 5</div> <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div> </div> <div class=\"clearfix\"></div> </div> <div class=\"toolfoot\"></div> <div class=\"clearfix\"></div> </div> "}
});
});
</SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
$('.units_table tr.unitinfo').each(function(index, el) {
if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
$(this).parents('.units_table').addClass('both');
$(this).addClass('also-both');
$(this).find('.rates').addClass('rates_two_column');
}
});
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>
Each row we are going to be working with will have similar html inside of the html2 variable. If you were in doubt look at the javascript in the function shown above:
$('.units_table tr.unitinfo').each(function(index, el)
it is using the same selector (but also specifying the parent table class and element type (tr)). Basically, that function is being called for each row in the table.
Size:
For some reason the opening td tag is being dropped (I've seen this with missing parent <table> tags I think) so for size, rather than grab by class, I am looking for the start of the closing tag and extracting the string up to there. I do this by by passing the return value given by Instr (where < was found in string) -1 to the Left$ (typed) function.
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
This returns 5x5.
Description:
The description column is populated by the function we saw above (which is applied to each row remember)
This bit - $("#a5x5-1").tooltip - tells it where to target, and then the return statement of the function provides the html which has a div, with class description, containing the text we want. As we are not using a browser, and I am on 64 bit windows, I cannot evaluate this script, but I can use split to extract the string (description) between the "description\"> and the start of the closing associated div tag:
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
This returns:
"Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle."
Rate type and price:
These are straightforward and use class name to target:
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Returning (respectively)
Web Rate ,
£39.00
Amenities:
This is where things are a little trickier.
Let's re-examine the html shown above, for this row, that pertains to amenities:
<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>
We can see that parent td has a class of amenities, which has child div elements which have compound class names; the latter of which, in each case, serves as an identifier for amenity type e.g. icon_climate.
When you hover over these, on the page, tooltip info is presented:
We can trace the location of this tooltip in the html of the actual page:
As you hover over different amenities this content updates.
To cut a long story short (he says whilst half way down the page!), this content is being updated from a php file on the server. We can make a request for the file and construct a dictionary which maps the class name of each amenities e.g. amenity_icon icon_climate (which as are compound classes need " " replaced with "." when converting to the appropriate css selector of .amenity_icon.icon_climate) to the associated descriptions. You can explore the php file here.
The php file:
Let's look at just the start of the file so as to dissect the basic unit of what is a repeating pattern:
function LoadTooltips() {
$(".units_table .amenity_icon.icon_climate").tooltip({
track: false,
delay: 0,
showURL: false,
left: -126,
top: -100,
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
}
});
The function responsible for updating the tooltip is LoadTooltips. CSS class selectors are used to target each icon:
$(".units_table .amenity_icon.icon_climate").tooltip
And we have the bodyhandler specifying the return text:
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
We have three bits of useful information that appear in repeating groups. The class name selector for the element, the short description and the long description e.g.
.amenity_icon.icon_climate : we use this to map the php file descriptions to the class name of the amenity icon in our row. CSS selector
Temperature Controlled; inside h4 tag of tooltip function return text. Short description
Units are heated and/or cooled. See manager for details.; inside p tag of tooltip function return text. Long description
I write 2 functions, GetMatches and GetAmenitiesDescriptions, that use regex to extract all of the repeating items, for each icon, and return a dictionary which has the css selector as the key, and the short description : long description as the value.
When I gather all the icons in each row:
Set icons = html2.querySelectorAll(".amenity_icon")
I use the dictionary to return the tooltip descriptions based on the class name of the icon
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
I then join the descriptions with vbNewLine to ensure output is on different lines within output cell.
You can explore the regex here.
The regex uses | (Or) syntax so I return all matched patterns in a single list.
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
As I will want different submatches (0,1 or 2 a.k.a css class selector, short desc, long desc) I use a Select Case i mod 3, with counter variable i, to extract appropriate sub-matches.
Example of those matches for the mapping in the php file:
Specials:
We are back to class selectors. Offer2 is not populated so you could remove.
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
returns (respectively):
Call for Specials, empty string
Closing remarks:
So, the above walks you through one row. It is simply rinse and repeat in the loop over all rows. Data is added, for efficiency, to an array, results; which is then written to Sheet1 in one go. There are some minor improvements I can see but this is fast.
VBA:
Option Explicit
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
Set amenitiesDescriptions = GetAmenitiesDescriptions
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
Dim headers(), results(), listings As Object, amenities As String
headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
Set listings = html.querySelectorAll(".unitinfo")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Set icons = html2.querySelectorAll(".amenity_icon")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out
results(r, 5) = amenities
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
' view regex here: https://regex101.com/r/bII5AL/1
Set dict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
For i = LBound(arr) To UBound(arr) Step 3 'build up lookup dictionary for amenities descriptions
dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
Next
End With
Set GetAmenitiesDescriptions = dict
End Function
Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
Select Case i Mod 3
Case 0
arrMatches(i) = iMatch.SubMatches.item(0)
Case 1
arrMatches(i) = iMatch.SubMatches.item(1)
Case 2
arrMatches(i) = iMatch.SubMatches.item(2)
End Select
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Output:
References (VBE > Tools > References):
Microsoft HTML Object Library

Here's one way to do it:
Sub test()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim targetTable As HTMLTable
Dim tableRow As HTMLTableRow
Dim tableCell As HTMLTableCell
Dim element As HTMLDivElement
Dim sht As Worksheet
Dim amenitiesString As String
Dim i As Long
Dim j As Long
Set sht = ThisWorkbook.Worksheets("Sheet1")
With req
.Open "GET", "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955", False
.send
doc.body.innerHTML = .responseText
End With
Set targetTable = doc.getElementById("units_small_units") 'You can use units_medium_units or units_large_units to get the info from the other tabs
i = 0
For Each tableRow In targetTable.Rows
i = i + 1
j = 0
For Each tableCell In tableRow.Cells
amenitiesString = ""
j = j + 1
If tableCell.className = "amenities" And tableCell.innerText <> "Amenities" Then
For Each element In tableCell.getElementsByTagName("div")
amenitiesString = amenitiesString & element.Title & ","
Next element
sht.Cells(i, j).Value = amenitiesString
ElseIf tableCell.className <> "features" Then
sht.Cells(i, j).Value = tableCell.innerText
End If
Next tableCell
Next tableRow
End Sub
I'm using an HTTP request instead of Internet Explorer to get the HTML. Apart from that I think you can get an idea of how to access the elements you want.
Here's a screenshot of the result.
The presentation is a bit primitive but you get the idea :-P
Basically this:
listing.getElementsByClassName("amenities")(0).innerText
will return a blank, because there is no inner text in these elements. The information is produced by a script but it can also be found in the title of the div elements.
References used:
Microsoft HTML Object Library and WinHTTP Services Version 5.1

Can you try Jquery get approach like below:
$.get( 'url', function(data) {
// Loop through elements
$(data).find("ul").find("li").each( function(){
var text = $(this).text();
} )
} );

VBA HTML Navigate Through Listings

I've got the following code that WORKS, and pulls all of the links for listings in the webpage below. I am now looking to expand this to pull the next page of results (up to n). I took a stab at doing this (second half of this code), but nothing is being displayed.
NOTE: In this sample of code, I was attempting to place the second page of links in Column B, but in an ideal world, I'd like to add the links to the bottom of the results of Page 1 (in Column A).
UPDATE: This code now moves to each page result, but it pastes the same links in Col A as B as C, etc. I am not sure how this is happening as I can watch the browser changing URLs as it goes.
Also, if you have any better ways of doing this (rather than copy/pasting this 10x to get the amount of results I am looking for), please let me know!
Option Explicit
Public Sub GetLinks()
Dim ie As New InternetExplorer, ws As Worksheet, t As Date
Dim k As Integer
Const MAX_WAIT_SEC As Long = 10
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
k = 0
Do While k < 10
.Navigate2 "https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn=" & k
While .Busy Or .readyState < 4: DoEvents: Wend
Dim Links As Object, i As Long, count As Long
t = Timer
Do
On Error Resume Next
Set Links = .Document.querySelectorAll(".s-item__link[href]")
count = Links.Length
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While count = 0
For i = 0 To Links.Length - 1
ws.Cells(i + 1, k + 1) = Links.item(i)
Next
k = k + 1
Loop
.Quit
End With
End Sub

I would probably look to add in a test to ensure the number of pages you request are not greater than the available. Modularize the code a little to pull out the info extraction step. Use arrays and some basic optimization (Screenupdating) to speed up the whole process. Also, get rid of the ie object asap.
This with the listings results count set to 200 (which in fact gives 211 results per page with the given selector). Not sure if this is simply an ebay setting that is remembered or is default.
Option Explicit
Public Sub GetInfo()
Dim ie As InternetExplorer, nodeList As Object, page As Long, totalResults As Long, ws As Worksheet
Const RESULTS_PER_PAGE = 211
Const DESIRED_PAGES = 3
Const BASE = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn="
Dim results(), url As String, maxPages As Long
ReDim results(1 To DESIRED_PAGES)
Application.ScreenUpdating = False
Set ie = New InternetExplorer
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
For page = 1 To DESIRED_PAGES
url = BASE & page
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
If page = 1 Then
totalResults = Replace$(.document.querySelector(".srp-controls__count-heading").innerText, " results", vbNullString)
maxPages = totalResults / RESULTS_PER_PAGE
End If
Set nodeList = .document.querySelectorAll("#srp-river-results .s-item__link[href]")
results(page) = GetLinks(nodeList)
Set nodeList = Nothing
If page + 1 >= maxPages Then Exit For
Next
.Quit
End With
If maxPages < DESIRED_PAGES Then ReDim Preserve results(1 To maxPages)
For page = LBound(results) To UBound(results)
If page = 1 Then
ws.Cells(1, 1).Resize(UBound(results(page), 1)) = Application.Transpose(results(page))
Else
ws.Cells(GetLastRow(ws, 1) + 1, 1).Resize(UBound(results(page), 1)) = Application.Transpose(results(page))
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetLinks(ByVal nodeList As Object) As Variant
Dim results(), i As Long
ReDim results(1 To nodeList.Length)
For i = 0 To nodeList.Length - 1
results(i + 1) = nodeList.item(i)
Next
GetLinks = results
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Untested (and I might be missing something), but seems like you can just specify the page you want to access with URL query parameter _pgn.
So for example, navigating to the URL below:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn=2
means you're requesting page 2 (also, the _ipg parameter seems to dictate how many results are shown on a page, so increasing to 200 might mean you need to make fewer requests overall).
So if you create some variable pageNumber in your code and increment it inside some sort of loop (that terminates once you've reached the last page), you should be able to get all pages -- or even any page at some arbitrary index --without copy-pasting/repeating yourself in your code.

Excel VBA extracting href value

I have a macro that tries to extract all the href values from a page but it only seems to get the first one. If someone could help me out that would be greatly appreciated.
The URL I used is https://www.facebook.com/marketplace/vancouver/entertainment
Screenshot of HTML
<div class="_3-98" data-testid="marketplace_home_feed">
<div>
<div>
<div class="_65db">
<a class="_1oem" href="/marketplace/item/920841554781924" data-testid="marketplace_feed_item">
<a class="_1oem" href="/marketplace/item/580124349088759" data-testid="marketplace_feed_item">
<a class="_1oem" href="/marketplace/item/1060730340772072" data-testid="marketplace_feed_item">
Sub Macro1()
``marker = 0
Set objShell = CreateObject("Shell.Application")
IE_count = objShell.Windows.Count
For x = 0 To (IE_count - 1)
On Error Resume Next ' sometimes more web pages are counted than are open
my_url = objShell.Windows(x).document.Location
my_title = objShell.Windows(x).document.Title
If my_title Like "Facebook" & "*" Then 'compare to find if the desired web page is already open
Set ie = objShell.Windows(x)
marker = 1
Exit For
Else
End If
Next
Set my_data = ie.document.getElementsByClassName("_3-98")
Dim link
i = 1
For Each elem In my_data
Set link = elem.getElementsByTagName("a")(0)
i = i + 1
'copy the data to the excel sheet
ActiveSheet.Cells(i, 4).Value = link.href
Next
End Sub

You can use a CSS selector combination to get the elements. If you provide the actual HTML, not as an image it will be easier to test and determine best combination. The selector is applied via the querySelectorAll method to return a nodeList of all matching elements. You traverse the .Length of the nodeList to access items by index from 0 to .Length-1.
VBA:
Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll("._1oem[href]")
For i = 0 To aNodeList.Length-1
Activesheet.Cells(i + 2,4) = aNodeList.item(i)
Next
The css selector combination is ._1oem[href], which selects the href attributes of elements with a class of _1oem. The "." is a class selector and the [] an attribute selector. It is a fast and robust method.
The above assumes there are no parent form/frame/iframe tags to negotiate.
An alternative selector that matches on the two attributes, rather than the class would be:
html.querySelectorAll("[data-testid='marketplace_feed_item'][href]")
Full example:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
With IE
.Visible = True
.navigate "https://www.facebook.com/marketplace/vancouver/entertainment"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim aNodeList As Object, i As Long
Set aNodeList = IE.document.querySelectorAll("._1oem[href]")
For i = 0 To aNodeList.Length - 1
ActiveSheet.Cells(i + 2, 4) = aNodeList.item(i)
Next
'Quit '<== Remember to quit application
End With
End Sub

You only ask for the first anchor element within each element with a _3-98 class. Iterate through the collection of anchor elements within the parent element.
...
dim j as long
Set my_data = ie.document.getElementsByClassName("_65db")
For Each elem In my_data
for i = 0 to elem.getelementsbytagname("a").count -1
j = j+1
ActiveSheet.Cells(j, 4).Value = elem.getElementsByTagName("a")(i).href
next i
Next elem
...

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extracting a series of URL using VBA - html

Related

How to extract the div_class table from betting website

How to extract data from HTML divs into Excel

Web Scraping Elements By Class & Tag name

VBA HTML Navigate Through Listings

Excel VBA extracting href value

Categories

Resources