VBA HTML Navigate Through Listings - html

I've got the following code that WORKS, and pulls all of the links for listings in the webpage below. I am now looking to expand this to pull the next page of results (up to n). I took a stab at doing this (second half of this code), but nothing is being displayed.
NOTE: In this sample of code, I was attempting to place the second page of links in Column B, but in an ideal world, I'd like to add the links to the bottom of the results of Page 1 (in Column A).
UPDATE: This code now moves to each page result, but it pastes the same links in Col A as B as C, etc. I am not sure how this is happening as I can watch the browser changing URLs as it goes.
Also, if you have any better ways of doing this (rather than copy/pasting this 10x to get the amount of results I am looking for), please let me know!
Option Explicit
Public Sub GetLinks()
Dim ie As New InternetExplorer, ws As Worksheet, t As Date
Dim k As Integer
Const MAX_WAIT_SEC As Long = 10
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
k = 0
Do While k < 10
.Navigate2 "https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn=" & k
While .Busy Or .readyState < 4: DoEvents: Wend
Dim Links As Object, i As Long, count As Long
t = Timer
Do
On Error Resume Next
Set Links = .Document.querySelectorAll(".s-item__link[href]")
count = Links.Length
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While count = 0
For i = 0 To Links.Length - 1
ws.Cells(i + 1, k + 1) = Links.item(i)
Next
k = k + 1
Loop
.Quit
End With
End Sub

I would probably look to add in a test to ensure the number of pages you request are not greater than the available. Modularize the code a little to pull out the info extraction step. Use arrays and some basic optimization (Screenupdating) to speed up the whole process. Also, get rid of the ie object asap.
This with the listings results count set to 200 (which in fact gives 211 results per page with the given selector). Not sure if this is simply an ebay setting that is remembered or is default.
Option Explicit
Public Sub GetInfo()
Dim ie As InternetExplorer, nodeList As Object, page As Long, totalResults As Long, ws As Worksheet
Const RESULTS_PER_PAGE = 211
Const DESIRED_PAGES = 3
Const BASE = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn="
Dim results(), url As String, maxPages As Long
ReDim results(1 To DESIRED_PAGES)
Application.ScreenUpdating = False
Set ie = New InternetExplorer
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
For page = 1 To DESIRED_PAGES
url = BASE & page
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
If page = 1 Then
totalResults = Replace$(.document.querySelector(".srp-controls__count-heading").innerText, " results", vbNullString)
maxPages = totalResults / RESULTS_PER_PAGE
End If
Set nodeList = .document.querySelectorAll("#srp-river-results .s-item__link[href]")
results(page) = GetLinks(nodeList)
Set nodeList = Nothing
If page + 1 >= maxPages Then Exit For
Next
.Quit
End With
If maxPages < DESIRED_PAGES Then ReDim Preserve results(1 To maxPages)
For page = LBound(results) To UBound(results)
If page = 1 Then
ws.Cells(1, 1).Resize(UBound(results(page), 1)) = Application.Transpose(results(page))
Else
ws.Cells(GetLastRow(ws, 1) + 1, 1).Resize(UBound(results(page), 1)) = Application.Transpose(results(page))
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetLinks(ByVal nodeList As Object) As Variant
Dim results(), i As Long
ReDim results(1 To nodeList.Length)
For i = 0 To nodeList.Length - 1
results(i + 1) = nodeList.item(i)
Next
GetLinks = results
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Untested (and I might be missing something), but seems like you can just specify the page you want to access with URL query parameter _pgn.
So for example, navigating to the URL below:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn=2
means you're requesting page 2 (also, the _ipg parameter seems to dictate how many results are shown on a page, so increasing to 200 might mean you need to make fewer requests overall).
So if you create some variable pageNumber in your code and increment it inside some sort of loop (that terminates once you've reached the last page), you should be able to get all pages -- or even any page at some arbitrary index --without copy-pasting/repeating yourself in your code.

Related

How can I scrape multiple pages/links at once using VBA?

I'm currrently trying to scrape info from this Reddit Page. My goal is to make excel open all the posts in new tabs and then I want to scrape information from each of those pages, since the starting page doesn't have as much information.
I've been trying for the last few hours to figure this out, but I'm admittedly pretty confused about how to do it, just overall unsure what to do next, so any pointers would be greatly appreciated!
Here is my current code, it works decently enough but as I said, I'm not sure what I should do next to open the links it finds one by one and scrape each page for data.
The links are scraped off that first page and then added to my spreadsheet right now, but if possible I'd like to just skip that step and scrape them all at once.
Thanks! :)
Sub GetData()
Dim objIE As InternetExplorer
Dim itemEle As Object
Dim upvote As Integer, awards As Integer, animated As Integer
Dim postdate As String, upvotepercent As String, oc As String, filetype As String, linkurl As String, myhtmldata As String, visiComments As String, totalComments As String, removedComments As String
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = False
objIE.navigate (ActiveCell.Value)
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each itemEle In objIE.document.getElementsByClassName("flat-list buttons")
visiComments = itemEle.getElementsByTagName("a")(0).innerText
linkurl = itemEle.getElementsByTagName("a")(0).href
Sheets("Sheet1").Range("A" & y).Value = visiComments
Sheets("Sheet1").Range("B" & y).Value = linkurl
y = y + 1
Next
End Sub
You should be able to gather the urls then visit in a loop and write results from page visited to array, then array to sheet. Add this after your existing line
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Add:
Dim nodeList As Object , i As Long, urls(), results()
Note: You are only potentially gaining on the page loads, as VBA is single threaded. To do that you would need to store a reference to each tab, or open all first, then loop through relevant open windows to do the scrape. My preference would be to keep in same tab to be honest.
Set nodeList = ie.document.querySelectorAll(".comments")
Redim urls(0 To nodeList.Length-1)
Redim results(1 to nodeList.Length, 1 to 3)
'Store all urls in an array to later loop
For i = 0 To nodeList.Length -1
urls(i) = nodeList.item(i).href
Next
For i = LBound(urls) To UBound(urls)
ie.Navigate2 urls(i)
While ie.Busy Or ie.Readystate <> 4: DoEvents:Wend
'may need a pause here
results(i + 1, 1) = ie.document.querySelector("a.title").innerText 'title
results(i + 1, 2) = ie.document.querySelector(".number").innerText 'upvotes
results(i + 1, 3) = ie.document.querySelector(".word").NextSibling.nodeValue '%
Next
ActiveSheet.Cells(1,1).Resize(UBound(results,1) , UBound(results,2)) = results

VBA Web Scraping using getElementsByClassName to names and addresses

I'm trying to extract the clinic name and corresponding address for all the clinics from the following web page: https://medimap.ca/Location/Calgary,%20AB,%20Canada
I'm having issues locating the exact area where I should be drilling down into. All the clinic names have the same class name of "_1FLG5" and the addresses are all "_1-Gov" . However, when I run through the below code nothing happens - no errors just nothing.
I'm also unsure if the reference after .getElementsByClassName is correct, as I want the inner text from the same row as where the "_1FLG5" is I referenced (0) and since I wanted the text from two rows below "_1-Gov" I referenced (2).
Option Explicit
Sub GetClinicData()
Dim objIE As InternetExplorer
Dim clinicEle As Object
Dim clinicAdd As Object
Dim clinicName As String
Dim address As String
Dim y As Integer
Dim x As Integer
Set objIE = New InternetExplorer
objIE.Visible = False
objIE.navigate "https://medimap.ca/Location/Calgary,%20AB,%20Canada"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each clinicEle In objIE.document.getElementsByClassName("_1FLG5")
clinicName = clinicEle.getElementsByClassName("_1FLG5")(0).innerText
Sheets("Sheet1").Range("A" & y).Value = clinicName
y = y + 1
Next
x = 1
For Each clinicAdd In objIE.document.getElementsByClassName("_1-Gov")
clinicAdd = clinicAdd.getElementsByClassName("_1-Gov")(2).innerText
Sheets("Sheet1").Range("B" & x).Value = clinicAdd
x = x + 1
Next
End Sub
Content is dynamically loaded so you need a wait condition to ensure content loaded - otherwise your collections end up being of length 0. I use querySelectorAll to apply the class names which return nodeList you For Loop over the .Length of. Ideally you should add a timeout condition to the loop. I show a timed loop here.
Option Explicit
'VBE > Tools > References: Microsoft Internet Controls
Public Sub GetData()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate2 "https://medimap.ca/Location/Calgary,%20AB,%20Canada"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim clinics As Object, addresses As Object, i As Long
With .document
Do
Set clinics = .querySelectorAll("._1FLG5")
Set addresses = .querySelectorAll("._1-Gov")
Loop While clinics.Length = 0
For i = 0 To clinics.Length - 1
With ThisWorkbook.Worksheets("Sheet1")
.Cells(i + 1, 1) = Trim$(clinics.item(i).innerText)
.Cells(i + 1, 2) = Trim$(addresses.item(i).innerText)
End With
Next
End With
.Quit
End With
End Sub

I have developed a web scraping code in VBA but have trouble navigating to the data I need

The code navigates to the page right before the page I need to get to. I need to click the 'request only' button.
The Code needs to do the following:
Access the Hertz website, Enter the pick up and drop off location, set the pick up and drop off dates. then click find a vehicle. It then needs to select a vehicle and then scrape the "One-Way fee" that appears. In its entirety I need the code to do this for all location combinations and all available car groups. for the purposes of this question, I just want help in understanding how to get to the page containing the "one-way" fee and then scrape the value into excel. I will then try and figure out how to get it to loop.
I have tried using the classname but with no luck. I had to shorten my code to fit but it seems to work still.
Private Sub test1()
Dim appIE As Object
Dim ws As Worksheet
Dim wb As Workbook
Dim a As String, b As String, c As String, d As String, e As Object, l As Object
Dim PickUp As Object
Dim iL As IHTMLElement 'this declares the html object
Dim f As IHTMLElementCollection ' this declares the collection of html objects
Dim post As Object, Ret As Object, entry As Object
r = 2 ' sets the start row of where to input the One Way fee etc
Set wb = Application.Workbooks("Hertz")
Set ws = wb.Worksheets("One Way Fees")
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "https://www.Hertz.co.za"
.Visible = True
Application.Wait (Now + TimeValue("0:00:03"))
Do While appIE.Busy
DoEvents
Application.Wait (Now + TimeValue("0:00:03"))
Loop
Application.Wait (Now + TimeValue("0:00:03"))
Set g = appIE.document.getElementById("return-location")
g.Click
Application.ScreenUpdating = True
'this part sets the station in and station out cells as well as the pickup/dropoff dates
i = 2 'For i = 2 To 3
With ws
a = 1267
'.Cells(i, 8)
d = 1261
'.Cells(i, 9)
b = "15 - May - 19"
'.Cells(i, 10)
c = "25 - May - 19"
'.Cells(i, 11)
End With
For Each g In appIE.document.getElementsByClassName("return-location")
If g.className = "return-location" Then
g.Click
Exit For
End If
Next g
' finds the pickup branch in html and clicks selection
Set e = appIE.document.getElementById("pickup-depot")
For Each O In e.Options
If O.Value = a Then
O.Selected = True
Exit For
End If
Next
'sets the return branch and clicks the selection
Set e = appIE.document.getElementById("return-depot")
For Each O In e.Options
If O.Value = d Then
O.Selected = True
Exit For
End If
Next
Set post = appIE.document.getElementsByName("pickup-pate")
For Each post In appIE.document.getElementsByName("PickupDate")
post.Value = b
Next post
' sets the return date and clicks the button
Set Ret = appIE.document.getElementsByName("return-date")
For Each Ret In appIE.document.getElementsByName("return-date")
Ret.Value = c
Next Ret
'Clicking find a vehicle
For Each l In appIE.document.getElementsByTagName("input")
If l.className = "btn" Then
l.Click
Exit For
End If
Next
'This is the part where I would need to click the request button to select a vehicle. After this I would need the One Way fee.
'Next
End With
End Sub
Unsure where one way is but for request only you can use classname as css selector
.select-vehicle
VBA:
Dim requests As Object
Set requests = ie.document.querySelectorAll(".select-vehicle")
requests.Item(1).Click '2nd in list
Above is nodeList of all buttons for requests you can index into

Extracting a series of URL using VBA

I just trying to run through a list of url link, but it keep showing run time error'91',object variable or with block variable not set.
The data I want to extract is from iframes. It do shown some of the values but it stuck in the middle of process with error.
Below is the sample url link that I want to extract value from:http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201
Public Sub GetInfo()
Dim IE As New InternetExplorer As Object
With IE
.Visible = False
For u = 2 To 100
.navigate Cells(u, 1).Value
While .Busy Or .readyState < 4: DoEvents: Wend
With .document.getElementById("bm_ann_detail_iframe").contentDocument
ThisWorkbook.Worksheets("Sheet1").Cells(u, 3) = .getElementById("main").innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 4) = .getElementsByClassName("company_name")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 5) = .getElementsByClassName("formContentData")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 6) = .getElementsByClassName("formContentData")(5).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 7) = .getElementsByClassName("formContentData")(7).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 8) = .getElementsByClassName("formContentData")(8).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 10) = .getElementsByClassName("formContentData")(10).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 11) = .getElementsByClassName("formContentData")(11).innerText
End With
Next u
End With
End Sub
tl;dr
Your error is due to the fact there are different numbers of elements for the given class name depending on the results per page. So you can't used fixed indexes. For the page you indicated the last index for that class, via the iframe, is 9 i.e. ThisWorkbook.Worksheets("Sheet1").cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText . 10 and 11 are invalid. Below I show a way to determine the number of results and extract info from each result row.
General principle:
Ok... so the following works on the principle of targeting the Details of Changes table for most of the info.
Example extract:
More specifically, I target the rows that repeat the info for No, Date of Change, #Securities, Type of Transaction and Nature of Interest. These values are stored in an array of arrays (one array per row of information). Then the results arrays are stored in a collection to later be written out to the sheet. I loop each table cell in the targeted rows (td tag elements within parent tr) to populate the arrays.
I add in the Name from the table above on the page and also, because there can be more than one row of results, depending on the webpage, and because I am writing the results to a new Results sheet, I add in the URL before each result to indicate source of information.
TODO:
Refactor the code to be more modular
Potentially add in some error handling
CSS selectors:
① I select the Name element, which I refer to as title, from the Particulars of substantial Securities Holder table.
Example name element:
Inspecting the HTML for this element shows it has a class of formContentLabel, and that it is the first class with this value on the page.
Example HTML for target Name:
This means I can use a class selector , .formContentLabel, to target the element. As it is a single element I want I use the querySelector method to apply the CSS selector.
② I target the rows of interest in the Details of Changes table with a selector combination of .ven_table tr. This is descendant selector combination combining selecting elements with tr tag having parent with class ven_table. As these are multiple elements I use the querySelectorAll method to apply the CSS selector combination.
Example of a target row:
Example results returned by CSS selector (sample):
The rows I am interested start at 1 and repeat every + 4 rows after e.g. row 5 , 9 etc.
So I use a little maths in the code to return just the rows of interest:
Set currentRow = data.item(i * 4 + 1)
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, headers(), u As Long, resultCollection As Collection
headers = Array("URL", "Name", "No", "Date of change", "# Securities", "Type of Transaction", "Nature of Interest")
Set resultCollection = New Collection
Dim links()
links = Application.Transpose(ThisWorkbook.Worksheets("Sheet1").Range("A2:A3")) 'A100
With IE
.Visible = True
For u = LBound(links) To UBound(links)
If InStr(links(u), "http") > 0 Then
.navigate links(u)
While .Busy Or .readyState < 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 1) '<you may not always need this. Or may need to increase.
Dim data As Object, title As Object
With .document.getElementById("bm_ann_detail_iframe").contentDocument
Set title = .querySelector(".formContentData")
Set data = .querySelectorAll(".ven_table tr")
End With
Dim results(), numberOfRows As Long, i As Long, currentRow As Object, td As Object, c As Long, r As Long
numberOfRows = Round(data.Length / 4, 0)
ReDim results(1 To numberOfRows, 1 To 7)
For i = 0 To numberOfRows - 1
r = i + 1
results(r, 1) = links(u): results(r, 2) = title.innerText
Set currentRow = data.item(i * 4 + 1)
c = 3
For Each td In currentRow.getElementsByTagName("td")
results(r, c) = Replace$(td.innerText, "document.write(rownum++);", vbNullString)
c = c + 1
Next td
Next i
resultCollection.Add results
Set data = Nothing: Set title = Nothing
End If
Next u
.Quit
End With
Dim ws As Worksheet, item As Long
If Not resultCollection.Count > 0 Then Exit Sub
If Not Evaluate("ISREF('Results'!A1)") Then '<==Credit to #Rory for this test
Set ws = Worksheets.Add
ws.NAME = "Results"
Else
Set ws = ThisWorkbook.Worksheets("Results")
ws.cells.Clear
End If
Dim outputRow As Long: outputRow = 2
With ws
.cells(1, 1).Resize(1, UBound(headers) + 1) = headers
For item = 1 To resultCollection.Count
Dim arr()
arr = resultCollection(item)
For i = LBound(arr, 1) To UBound(arr, 1)
.cells(outputRow, 1).Resize(1, 7) = Application.WorksheetFunction.Index(arr, i, 0)
outputRow = outputRow + 1
Next
Next
End With
End Sub
Example results using 2 provided tests URLs:
Sample URLs in sheet1:
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5928057
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201

VBA Get HTML Element Info for Changing Id

I am trying to create an excel web scraper that logs into my companies ticket tracking system and logs certain information on the sheet (Lead assigned, Desired Date for the project, etc.). I was doing fine until I had to pull a field off the website that has a changing ID.
For example, on two pages the same field will have the IDs:
"cq_widget_CqFilteringSelect_32"
"cq_widget_CqFilteringSelect_9"
Can somebody provide guidance to how I should search and paste the "IT Lead" value into excel?
HTML snippet of div
Snippet of actual website
Setup in excel
Below is what I have so far
I get confused in this area:
lead = objCollection(i).Value
Sub CQscrub()
Dim i As Long
Dim objElement As Object
Dim objCollection As Object
Dim objCollection2 As Object
Dim ie As InternetExplorer
Dim html As HTMLDocument
Dim numbers() As String
Dim size As Integer
Dim row As Integer
Dim objLead As Object
Dim objLead2 As Object
Dim lead As String
Dim counter As Integer
size = WorksheetFunction.CountA(Worksheets(1).Columns(1)) - 4
ReDim numbers(size)
For row = 10 To (size + 10)
numbers(row - 10) = Cells(row, 1).Value
'Cells(row, 2) = numbers(row - 10)
Next row
Set ie = New InternetExplorer
ie.Height = 1000
ie.Width = 1000
ie.Visible = True
ie.navigate "http://clearquest/cqweb/"
Application.StatusBar = "Loading http://clearquest/cqweb"
Do While ie.Busy
Application.Wait DateAdd("s", 1, Now)
Loop
Application.StatusBar = "Searching form. Please wait..."
'Had these below as comment
Dim WRnumber1 As String
WRnumber1 = Range("A10").Value
'Range("A6").Value = WRnumber1
Dim iLastRow As Integer
Dim Rng As Range
iLastRow = Cells(Rows.Count, "a").End(xlUp).row 'last row of A
'Set objCollection = ie.document.getElementsByTagName("input") originally here
For counter = 0 To size - 1
Set objCollection = ie.document.getElementsByTagName("input")
i = 0
While i < objCollection.Length
If objCollection(i).Name = "cqFindRecordString" Then
objCollection(i).Value = numbers(counter)
End If
i = i + 1
Wend
'''''''''''''''''' Find Label ''''''''''''''''''''''''''''
Set objCollection = ie.document.getElementsByTagName("label")
i = 0
While i < objCollection.Length
If objCollection(i).innerText = "IT Lead/Assigned To" Then
lead = objCollection(i).Value
'Set objLead = objCollection(i)
End If
i = i + 1
Wend
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Range("B" & (iLastRow - (size - counter - 1))).Value = lead
Set objElement = ie.document.getElementById("cqFindRecordButton")
objElement.Click
Do While ie.Busy
Application.Wait DateAdd("s", 1, Now)
Loop
Application.Wait (Now + TimeValue("0:00:02"))
Next counter
ie.Quit
Set ie = Nothing
Set objElement = Nothing
Set objCollection = Nothing
Application.StatusBar = ""
MsgBox "Done!"
End Sub
Note: Website is internal only
Goal: Select Name under "IT Lead/Assigned To" field and paste to Excel
Thanks
Regarding the supplied code, tl;dr.
But if you are wanting the scratched out portion you supplied in your HTML snippet, the following may work (I can't test something that I don't have access to :D).
There are many different ways to grab an element, and this method you are grabbing the first instance of the class name dijitReset dijitInputField dijitInputContainer. Class names are not always a unique value, but due to the somewhat complexity of this class name, I feel somewhat safe that in your case it is.
You could have used one line to Set yourObj... but for demonstration purposes I decided to break it up. 1-liner method to Set your obj:
Set yourObj = doc.getElementsByClassName("dijitReset dijitInputField dijitInputContainer")(0).getElementsByTagName("input")(1)
Code Snippet:
Sub getElementFromIE()
Dim ie As InternetExplorer
' ... your above code pulls up webpage ...
'''''''''''''''''' Find Label ''''''''''''''''''''''''''''
Dim doc As HTMLDocument, yourObj As Object
Set doc = ie.document
' I assume the class name is unique? If so, just append (0) as I did below
Set yourObj = doc.getElementsByClassName("dijitReset dijitInputField dijitInputContainer")(0)
Set yourObj = yourObj.getElementsByTagName("input")(1)
lead = yourObj.Value
End Sub
The reason for the (1) on Set yourObj = yourObj.getElementsByTagName("input")(1) is because there are 2 input tags after your class dijitReset.... You are wanting the 2nd instance of this tag, which contains your value; and as you are probably already aware, you are using Base 0, meaning the 2nd instance is actually the number 1.