Retrieving all Excel file links from a webpage - html

I'm trying to get all the downloadable Excel file links from the website, but having difficulty. Please help to guide me. Thanks.
Sub TYEX()
Dim internet As Object
Dim internetdata As Object
Dim div_result As Object
Dim header_links As Object
Dim link As Object
Dim URL As String
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = True
URL = "https://www.jpx.co.jp/markets/public/short-selling/index.html"
internet.Navigate URL
Do Until internet.ReadyState >= 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 5)
Set internetdata = internet.Document
Set div_result = internetdata.getElementById("readArea")
Set header_links = div_result.getElementsByTagName("td")
For Each h In header_links
Set link = h.ChildNodes.item(0)
Cells(Range("A" & Rows.Count).End(xlUp).Row + 1, 1) = link.href
Next
MsgBox "done"
End Sub

You had the idea down correctly, but here's a different approach:
Sub TYEX()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate "https://www.jpx.co.jp/markets/public/short-selling/index.html"
.Visible = True
Do While .Busy Or .readyState < 4
DoEvents
Loop
Dim doc As Object, tbl As Object
Set doc = .document
Set tbl = doc.getElementsByClassName("component-normal-table")(0).Children(0)
Dim r As Long, xlsArr(), a As Object
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
End With
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With
End Sub
Public Function NextRow(ByVal ws As Worksheet, Optional ByVal col As Variant = 1) As Long
With ws
NextRow = .Cells(.Rows.Count, col).End(xlUp).Row + 1
End With
End Function
Breaking Down the Code
This will loop your html table rows. We start at 1, because 0 is actually just the table header.
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
This uses regular expressions to extract the url from the innerHTML property. You can see how this particular regex works here: Regex101
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
You will size your range to be the same size as your array that contains the links, then write the array to the worksheet. This is usually much faster than writing cells one-by-one.
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With

You can use an attribute = value CSS selector with $ operator to say the href value must end with .xls. Then use querySelectorAll to retrieve all matched results. Using CSS selectors is a very fast and generally robust method.
Dim list As Object
Set list = ie.document.querySelectorAll("[href$='.xls']")
It is much faster to use XMLHTTP as well, rather than opening IE. Note that you can then pass these links to a function to perform a binary download or to URLMon for downloading.
Option Explicit
Public Sub Links()
Dim sResponse As String, html As HTMLDocument, list As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.jpx.co.jp/markets/public/short-selling/index.html", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set list = html.querySelectorAll("[href$='.xls']")
End With
For i = 0 To list.Length - 1
Debug.Print Replace$(list.item(i), "about:", "https://www.jpx.co.jp")
Next
End Sub
Example download function (though you could re-use your existing XMLHTTP object - this is just to illustrate):
Public Function DownloadFile(ByVal downloadFolder As String, ByVal downloadURL As String) As String
Dim http As Object , tempArr As Variant
Set http = CreateObject("WinHttp.WinHttpRequest.5.1")
http.Open "GET", downloadURL, False
http.send
On Error GoTo errhand
With CreateObject("ADODB.Stream")
.Open
.Type = 1
.write http.responseBody
tempArr = Split(downloadURL, "/")
tempArr = tempArr(UBound(tempArr))
.SaveToFile downloadFolder & tempArr, 2 '< "/" on enter of downloadFolder. 2 for overwrite which is Ok if no file modifications.
.Close
End With
DownloadFile = downloadFolder & tempArr
Exit Function
errhand:
If Err.Number <> 0 Then
Debug.Print Err.Number, Err.Description
MsgBox "Download failed"
End If
DownloadFile = vbNullString
End Function
References (VBE > Tools > References):
Microsoft HTML Object Library

Related

Webscrape a specific part of a webpage

My webscrape stopped working. The owner changed the html.
I believe it is the Set allElements = doc.getElementsByClassName("el-col el-col-8") line that needs changing.
I am trying to grab text from the webpage that includes the "52-week Range (undefined)" section. I managed to grab text from before and after but not the section I need. An example webpage is https://www.gurufocus.com/stock/gliba/summary and my code should fill my cell with "38.72 - 73.63" after I do some trimming.
I need to do it this way so I can get my head round it and change it in the future when necessary so please just focus on correcting my set line of code (assuming that is the problem!) rather than a whole new more sophisticated method as it will be beyond me. (My other set line of code does what I want it to do.)
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim incomeStmtURLs As Variant
Dim sURL As String
Dim lastrow As Long
Dim allRowOfData As Object
Dim i As Integer
Dim allElements As IHTMLElementCollection
Dim anElement As IHTMLElement
Dim aCell As HTMLTableCell
Application.DisplayAlerts = False
Call ToggleEvents(False)
incomeStmtURLs = Range("Sheet1!h1:h2").Value
For i = 1 To UBound(incomeStmtURLs)
Set wb = CreateObject("internetExplorer.Application")
sURL = incomeStmtURLs(i, 1)
wb.navigate sURL
wb.Visible = False
While wb.Busy
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
Set doc = wb.document
On Error GoTo err_clear
Set allElements = doc.getElementsByClassName("el-col el-col-8")
While allElements.Length = 0
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
x = allElements(0).innerText
' Debug.Print x
Sheet6.Cells(i + 1, 2).Value = Trim(Replace(Mid(x, InStr(1, x, "52-Week Range (undefined)") + 25, 25), vbLf, ""))
Set allElements = doc.getElementsByClassName("fs-x-large fc-primary fw-bolder")
x = allElements(0).innerText
Sheet6.Cells(i + 1, 4).Value = Trim(Replace(Mid(x, InStr(1, x, "$") + 1, 7), vbLf, ""))
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Next i
Call ToggleEvents(True)
End Sub
Sub ToggleEvents(blnState As Boolean)
Application.DisplayAlerts = blnState
Application.EnableEvents = blnState
If blnState Then Application.CutCopyMode = False
If blnState Then Application.StatusBar = False
End Sub
The page dynamically updates content as you scroll down. You likely need to scroll that part of the page into view then use grab all the elements with classname statictics-item then take the n-2 index e.g. Without the scrolling part:
Set elems = ie.document.getElementsByClassName("statictics-item")
If elems.length > 1 Then Debug.print elems(elems.length-2).innerText
For future readers (I know OP doesn't want this):
I would avoid the whole scrolling pickle, dynamic html and browser and issue an xmlhttp request and regex out the appropriate values from the javscript objects the web page uses for updating. N.B. I would probably add in validation on regex match positions.
Public Sub test()
Dim r As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.gurufocus.com/stock/gliba/summary", False
.send
r = GetMatches(.responseText, "price52wlow:(.*?),|price52whigh:(.*?),")
If r <> "NA" Then MsgBox r
End With
End Sub
Public Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As String
Dim matches As Object
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
'If .test(inputString) Then
Set matches = .Execute(inputString)
If matches.Count = 2 Then
GetMatches = matches.Item(0).submatches(0) & "-" & matches.Item(1).submatches(1)
Else
GetMatches = "NA"
End If
End With
End Function
Regex:

Scraping a specific <p> class from Yahoo Finance (VBA, Excel)

I have been trying to extract specific information from a certain nested class from the code at the following location
https://finance.yahoo.com/quote/ASUR?p=ASUR
The class where in innertext is located is named "D(ib) Va(t)" and as far as I have seen at least this text is unique for the class name. I am using the following code to get the data.
Private Sub CommandButton1_Click()
Dim IE4 As Object
Dim strURL3 As String
Dim divs1 As Object
Dim symbol1 As String
Dim rowd As Integer
Dim divs2 As Object
'turn calculation off
Application.ScreenUpdating = False
Application.DisplayAlerts = False
Application.Calculation = xlCalculationManual
rowd = 1
'Sheets.Add After:=Sheets(ActiveWorkbook.Sheets.Count)
'Sheets(ActiveSheet.Name).Name = "Industry"
'Worksheets("Panel").Activate
'Range("B9").Select
Set IE4 = CreateObject("InternetExplorer.Application") 'Create only one IE instance
'Do Until ActiveCell.Value = "" 'Loop
'symbol1 = ActiveCell.Value
strURL3 = "https://finance.yahoo.com/quote/ASUR?p=ASUR"
IE4.Visible = True 'Flag to remove IE visibility
VBA.Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 264", vbMinimizedNoFocus
VBA.Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 258", vbMinimizedNoFocus
IE4.Navigate strURL3
While IE4.Busy: DoEvents: Wend 'Break of 4 seconds after loading
Application.Wait (Now + TimeValue("0:00:04"))
Set divs1 = IE4.Document.getelementsbytagname("div")
Worksheets("Industry").Activate
ActiveSheet.Cells("1,2").Select
For Each div In divs1
Set divs2 = IE4.Document.getelementsbytagname("p")
For Each p In divs2
If p.classname = "D(ib) Va(t)" Then
Debug.Print p.innertext
'Cells(rowd, 2) = p.innertext
'rowd = rowd + 1
End If
Next p
Next div
'Sheets("Panel").Select
' ActiveCell.Offset(1, 0).Select
'Loop
'Sheets("Panel").Select
'Range("B9").Select 'range that selects rows and columns to paste in every company sheet
'Range(Selection, Selection.End(xlDown)).Select
'Range(Selection, Selection.End(xlDown)).Select
'Range(Selection, Selection.End(xlDown)).Select
' Selection.Copy
' Sheets("Industry").Select
'Range("A1").Select
'ActiveSheet.Paste
' Application.CutCopyMode = False
IE4.Quit
'turn calculation on
Application.Calculation = xlCalculationAutomatic
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub
It does not capture the class, can anyone help me with this?
It's not a good practice to go for selecting compound classes as long as there is a way to avoid that. Check out the following implementation to achieve the same result:
Sub FetchText()
Const Link$ = "https://finance.yahoo.com/quote/ASUR?p=ASUR"
Dim Http As New XMLHTTP60, Htmldoc As New HTMLDocument
With Http
.Open "GET", Link, False
.send
Htmldoc.body.innerHTML = .responseText
End With
MsgBox Htmldoc.querySelector("p.businessSummary").PreviousSibling.LastChild.innerText
End Sub
The p tag element with that class includes the company sector, industry and employee info. You can use a faster method of xmlhttp to retrieve by avoiding opening a browser. Then use a css selector combination to target the element
Option Explicit
Public Sub GetInfo()
Const URL As String = "https://finance.yahoo.com/quote/ASUR?p=ASUR"
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.Send
html.body.innerHTML = .responsetext
End With
Debug.Print html.querySelector("p.D\(ib\).Va\(t\)").innerText
End Sub
If you want to avoid compound classes you can use the following:
Debug.Print html.querySelector("p + .D\(ib\)").innerText
References (VBE > Tools > References):
Microsoft HTML Object Library

Web scraping using VBA

I would like to extract data from this URL.
I want to extract Title, mobile contact number and address from each of 10 business cards.
Here is some code I tried but did not get success.
Public Sub GetValueFromBrowser()
On Error Resume Next
Dim Sn As Integer
Dim ie As Object
Dim url As String
Dim Doc As HTMLDocument
Dim element As IHTMLElement
Dim elements As IHTMLElementCollection
For Sn = 1 To 1
url = Sheets("Infos").Range("C" & Sn).Value
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = 0
.navigate url
While .Busy Or .readyState <> 4
DoEvents
Wend
End With
Set Doc = ie.document
Set elements = Doc.getElementsByClassName(" col-sm-5 col-xs-8 store-details sp-detail paddingR0")
Dim count As Long
Dim erow As Long
count = 0
For Each element In elements
If element.className = "lng_cont_name" Then
erow = Sheet1.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1) = Doc.getElementsByClassName("Store-Name")(count).innerText
Cells(erow, 2) = Doc.getElementsByClassName("cont_fl_addr")(count).innerText
count = count + 1
End If
Next element
If Val(Left(Sn, 2)) = 99 Then
ActiveWorkbook.Save
End If
Next Sn
End Sub
The tel numbers were not easy as I think they have purposefully been made difficult to scrape. I have found a way to decipher the values from the CSS pseudo ::before element content. The addresses and titles were straightforward CSS selections.
I have since written a cleaner script in python here.
So, how do the various parts of the code work?
titles:
Set titles = .querySelectorAll(".jcn [title]")
I target the titles as elements that have a title attribute with a parent jcn class attribute. The "." indicates a class selector, the "[]" an attribute selector, and the " " in between is a descendant combinator.
querySelectorAll method of document returns a nodeList of all matching elements on the page i.e. the 10 titles.
addresses:
Set addresses = .querySelectorAll(".desk-add.jaddt")
Addresses are targeted by their class attribute desk-add jaddt. As compound class names are not allowed, an additional "." has to be replace the white space in the name.
Telephone numbers (via deciphering contents within storesTextToDecipher) :
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
This is where the magic happens. The numbers are not available via the DOM directly as they are pseudo element content.
If you inspect the relevant HTML you will find a series of pseudo ::before elements. VBA exposes no mechanism for applying pseudo selectors to try and get at this info in the CSS for the page.
What you see is in fact a series of span elements that each have a class attribute beginning with mobilesv. These elements sit within a single parent element of class col-sm-5 col-xs-8 store-details sp-detail paddingR0 (note again the compound class name).
I initially gather a nodeList of all the parent elements.
Sample of returned elements:
Each of these parent elements houses the class name (beginning with mobilesv) elements that constitute the characters of the telephone number string. Some characters are numbers in the string, others represent the +()-. Note: The 2|3 letter strings, in the class names, after icon- e.g. dc, fe.
For example, the first search result on the page, for the initial number 9 in the telephone number:
The actual CSS content for this pseudo element /telephone character can be observed in the CSS styling:
Note the class name and before pseudo element selector: .icon-ji:before
And the content is \9d010.
Long story short.... you can extract the 2 or 3 letters after icon- i.e. ji in this case, and the number string after \9d0, i.e. 10 in this case, and use these two bits of info to decipher the telephone number. This info is available in the response:
See the same 2/3 letter strings that are associated with the class names of the telephone string on the left, and the content instructions on the right. A little maths deduces that the number on the right is 1 greater than the telephone number, for that class, shown in the image of the webpage. I simply create a dictionary that then maps the 2/3 letter abbreviation to the telephone number by parsing this section of the html.
When looping over storesTextToDecipher, I use this dictionary to decipher the actual telephone number from the matching 2/3 letter abbreviation in the class name.
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
html.body.innerHTML = sResponse
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
With html
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
End With
For i = 0 To titles.Length - 1
Debug.Print "title: " & titles.item(i).innerText
Debug.Print "address: " & addresses.item(i).innerText
Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Sample output:
Edit: All page results
As you now want more than 10 the following uses the expected page result count (NUMBER_RESULTS_ON_PAGE) to gather the information from the page. It scrolls the page until the expected number of telephone numbers (which should be unique) are found, or the MAX_WAIT_SEC is hit. This means you avoid an infinite loop and can set your expected result count if you expect a different number. It does rely on each store having a telephone number listed - this seems a fairly reasonable assumption.
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
Const NUMBER_RESULTS_ON_PAGE As Long = 80
Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
Application.ScreenUpdating = True
Set resultCountDict = CreateObject("Scripting.Dictionary")
Set cipherDict = GetCipherDict(URL)
With IE
.Visible = True
.Navigate2 URL
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
t = Timer
Do
DoEvents
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
Dim telNumber As String, i As Long
For i = 0 To titles.Length - 1
telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
If Not resultCountDict.Exists(telNumber) Then
resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
End If
Next
.parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
While IE.Busy Or IE.readyState < 4: DoEvents: Wend
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE
End With
.Quit
End With
Dim key As Variant, rowCounter As Long
rowCounter = 1
With ThisWorkbook.Worksheets("Sheet1")
For Each key In resultCountDict.keys
.Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
rowCounter = rowCounter + 1
Next
End With
Application.ScreenUpdating = True
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Public Function GetCipherDict(ByVal URL As String) As Object
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
Set GetCipherDict = cipherDict
End Function
EDIT:
Version for where more than one number is present at top (Please note that if you make too many requests or too quickly server will serve you random pages):
Option Explicit
Public Sub GetDetails()
Dim re As Object, decodeDict As Object, i As Long
Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
Set decodeDict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
Set html = New MSHTML.htmlDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
.setRequestHeader "User-Agent", "Mozilla/4.0"
.send
responseText = .responseText
html.body.innerHTML = responseText
End With
keys = GetMatches(re, responseText, "-(\w+):before")
If UBound(keys) = 0 Then Exit Sub
values = GetMatches(re, responseText, "9d0(\d+)", True)
For i = LBound(values) To UBound(values)
decodeDict(keys(i)) = values(i)
Next
Dim itemsToDecode()
decodeDict(keys(UBound(keys))) = "+"
itemsToDecode = GetValuesToDecode(html)
PrintNumbers re, html, itemsToDecode, decodeDict
End Sub
Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .Test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
If numeric Then
arrMatches(i) = iMatch.SubMatches.item(0) - 1
Else
If spanSearch Then
arrMatches(i) = iMatch
Else
arrMatches(i) = iMatch.SubMatches.item(0)
End If
End If
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
Dim i As Long, elements As Object, results(), Class As String
Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
ReDim results(elements.Length - 1)
For i = 0 To elements.Length - 1
Class = elements.item(i).className
results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
Next
GetValuesToDecode = results
End Function
Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
Dim output As String, i As Long
For i = LBound(itemsToDecode) To UBound(itemsToDecode)
output = output & decodeDict(itemsToDecode(i))
Next
Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
htmlToSearch = html.querySelector(".telCntct").outerHTML
groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
startPos = 1
Dim totalNumbers As Long
For i = LBound(groups) To UBound(groups)
If InStr(groups(i), ",") > 0 Then
totalNumbers = totalNumbers + 1
Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
startPos = i + 1
End If
Next
If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub

Parsing HTML with VBA

I am trying to pull data from some 500 urls of a website. All the pages are same in structure. I am facing a problem with understanding the HTML of this particular site
https://www.coworker.com/s-f/6033/united-states_hawaii_honolulu_impact-hub-honolulu
I want to extract Name, Address, Tel and website. My current code:
Sub GetData()
Dim wsSheet As Worksheet, Rows As Long, links As Variant, IE As Object, link As Variant
Set wb = ThisWorkbook
Set wsSheet = wb.Sheets("Sheet1")
Set IE = New InternetExplorer
Rows = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row
links = wsSheet.Range("A1:A" & Rows)
With IE
.Visible = True
For Each link In links
.navigate (link)
While .Busy Or .readyState <> 4: DoEvents: Wend
Next
End With
End Sub
Here you go. Without more links to test with this is very fragile. It relies heavily on consistent styling across pages.
XHR Looping link list:
Option Explicit
Public Sub GetInfo()
Dim wsSheet As Worksheet, Rows As Long, links(), link As Long, wb As Workbook, html As HTMLDocument
Set wb = ThisWorkbook: Set wsSheet = wb.Sheets("Sheet1")
Application.ScreenUpdating = False
With wsSheet
Rows = .Cells(.Rows.Count, "A").End(xlUp).Row
If Rows = 1 Then
ReDim links(1 To 1, 1 To 1)
links(1, 1) = wsSheet.Range("A1")
Else
links = wsSheet.Range("A1:A" & Rows).Value
End If
Dim r As Long
For link = LBound(links, 1) To UBound(links, 1)
r = r + 1
Set html = GetHTML(links(link, 1))
On Error Resume Next
Dim aNodeList As Object: Set aNodeList = html.querySelectorAll(".col-xs-12.pade_none.muchroom_mail")
.Cells(r, 2) = "Name: " & html.querySelector(".col-sm-9.col-md-9.col-xs-12.pade_none.kohub_space_headings h2").innerText
.Cells(r, 3) = "Address: " & aNodeList.item(0).innerText
.Cells(r, 4) = "Tel: " & aNodeList.item(1).innerText
.Cells(r, 5) = "Website: " & html.querySelector(".website-link-text a[href]").getAttribute("href")
On Error GoTo 0
Next link
End With
Application.ScreenUpdating = True
End Sub
Public Function GetHTML(ByVal url As String) As HTMLDocument
Dim sResponse As String, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
With html
.body.innerHTML = sResponse
End With
Set GetHTML = html
End Function
Output:
References (VBE>Tools>References):
HTML object Library
Internet Explorer:
Option Explicit
Public Sub GetInfo()
Dim wsSheet As Worksheet, Rows As Long, links(), link As Long, wb As Workbook, html As HTMLDocument, ie As InternetExplorer
Set wb = ThisWorkbook: Set wsSheet = wb.Sheets("Sheet1")
Application.ScreenUpdating = False
With wsSheet
Rows = .Cells(.Rows.Count, "A").End(xlUp).Row
If Rows = 1 Then
ReDim links(1, 1)
links(1, 1) = wsSheet.Range("A1")
Else
links = wsSheet.Range("A1:A" & Rows).Value
End If
Dim r As Long
Set ie = New InternetExplorer
ie.Visible = True
For link = LBound(links, 1) To UBound(links, 1)
ie.navigate links(link, 1)
While ie.Busy Or ie.readyState < 4: DoEvents: Wend
' Application.Wait Now + TimeSerial(0, 0, 10)
On Error Resume Next
r = r + 1: Set html = ie.document
.Cells(r, 2) = "Name: " & html.querySelector(".col-sm-9.col-md-9.col-xs-12.pade_none.kohub_space_headings h2").innerText
.Cells(r, 3) = "Address: " & html.querySelector(".col-xs-12.pade_none.muchroom_mail").innerText
.Cells(r, 4) = "Tel: " & html.querySelector(".fa.fa-phone.fa-rotate-270 ~ a").innerText
.Cells(r, 5) = "Website: " & html.querySelector(".website-link-text a[href]").getAttribute("href")
On Error GoTo 0
Next link
ie.Quit
End With
Application.ScreenUpdating = True
End Sub
References (VBE>Tools>References):
HTML object Library
Microsoft Internet Controls

Web-scraping from Excel List of PDGA Numbers using VBA

I have a list of numbers (PDGA Numbers) in MS Excel. I would like to automatically search the PDGA website (https://www.pdga.com/players/) from the list and automatically paste the player's location next to the corresponding PDGA Number. Currently, I am able to search the number and paste the location individually, but not the entire list.
First I select an excel cell and 'Define Name' as PDGA, and another as Location.
https://imgur.com/AcGtuX8
Then I basically followed this YouTube video. https://www.youtube.com/watch?v=7sZRcaaAVbg
And ultimately got this VBA code to work. (Make sure the proper VBA References are checked)
https://imgur.com/a/OYSM7Am
Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Column = Range("PDGA").Column Then
Dim IE As New InternetExplorer
IE.Visible = False
IE.navigate "https://www.pdga.com/player/" & Range("PDGA").Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
Dim sDD As String
sDD = Trim(Doc.getElementsByTagName("li")(127).outerText)
Range("Location").Value = sDD
End If
End Sub
I think I need some For Each loop, but I'm not sure. It should look like this when completed.
https://imgur.com/a/qOiW4JJ
Thanks in advance for any help.
If you have a specific list of players then you loop and issue XHR requests to get the info. Here I have the PDGA# in an array which is looped:
playerPDGA = Array(1, 5, 23, 46, 789, 567)
Code:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As New HTMLDocument, playerPDGA(), results(), i As Long
playerPDGA = Array(1, 5, 23, 46, 789, 567)
ReDim results(0 To UBound(playerPDGA), 0 To 1)
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(playerPDGA) To UBound(playerPDGA)
.Open "GET", "https://www.pdga.com/player/" & playerPDGA(i), False
.send
sResponse = StrConv(.responseBody, vbUnicode)
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
With html
.body.innerHTML = sResponse
results(i, 0) = .querySelector(".pane-content > h1").innerText
results(i, 1) = .querySelector(".location").innerText
End With
Next i
End With
ActiveSheet.Cells(1, 1).Resize(UBound(results, 1) + 1, UBound(results, 2) + 1) = results
End Sub
For any page listing players:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As New HTMLDocument
Application.ScreenUpdating = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.pdga.com/players/", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim nameList As Object, cityList As Object, stateList As Object, countryList As Object, r As Long
With html
.body.innerHTML = sResponse
Set nameList = .querySelectorAll(".views-field.views-field-nothing")
Set cityList = .querySelectorAll(".views-field.views-field-City.city")
Set stateList = .querySelectorAll(".views-field.views-field-StateProv.state")
Set countryList = .querySelectorAll(".views-field.views-field-Country.country")
End With
With ActiveSheet
Dim i As Long
For i = 0 To nameList.Length - 1
r = r + 1
.Cells(r, 1) = nameList.item(i).innerText
.Cells(r, 2) = Trim$(cityList.item(i).innerText & Chr$(32) & stateList.item(i).innerText & Chr$(32) & countryList.item(i).innerText)
Next i
End With
Application.ScreenUpdating = True
End Sub
Reference:
HTML Object library
You can achieve your desired output in several ways. Here is one of such.
Sub FetchData()
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLDivElement, Idic As New Scripting.Dictionary
Dim key As Variant, N$, CT$, S$, C$, R&
With Http
.Open "GET", "https://www.pdga.com/players/", False
.send
Html.body.innerHTML = .responseText
End With
For Each post In Html.querySelector("table.views-table tbody").getElementsByTagName("tr")
N = post.querySelector("a[title]").innerText
CT = post.querySelector(".city").innerText
S = post.querySelector(".state").innerText
C = post.querySelector(".country").innerText
Idic(N & "|" & CT & " " & S & " " & C) = 1
Next post
For Each key In Idic.Keys
R = R + 1: Cells(R, 1) = Split(key, "|")(0)
Cells(R, 2) = Split(key, "|")(1)
Next key
End Sub
Reference to add to the library:
Microsoft XML, v6.0
Microsoft HTML Object Library
Microsoft Scripting Runtime
Sub test()
Dim IE As New InternetExplorer
Dim Doc As HTMLDocument
Dim lastRow As Long, i As Long
Dim sDD As String
IE.Visible = False
lastRow = Range("A" & Rows.Count).End(xlUp).Row
For i = 1 To lastRow
IE.navigate "https://www.pdga.com/player/" & Range("PDGA").Cells(i).Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Set Doc = IE.document
sDD = Trim(Doc.getElementsByTagName("li")(127).outerText)
Range("Location").Cells(i) = sDD
Next
Set IE = Nothing
Set Doc = Nothing
End Sub