Scraping data from website to Excel using a macro...lost - html

I am totally new to this but here is my scope.
I am running a macro to pull data from a business system.
After this info is pulled, I want a macro to take certain fields, put them into a website form, click submit and then scrape and paste certain data results back into excel.
Everything works minus the scraping and pasting back into excel.
Help please!
I have searched all over stack overflow and watched vids to try and figure out what I need to do but I must be misunderstanding something.
Sub Track()
Range("B2").Select
'This should call to PT and deliver tracking info
Dim IE As Object
Dim tbl As Object, td As Object
Set IE = CreateObject("InternetExplorer.Application") 'Set IEapp =
InternetExplorer
IE.Visible = True
IE.Navigate "https://www.partstown.com/track-my-order"
With IEapp
Do
DoEvents
Loop Until IE.readyState = 4
'Input PO and zip
Call IE.Document.getElementById("orderNo").SetAttribute("value",
"4500969111")
'ActiveCell.Offset(0, 2).Select
Call IE.Document.getElementById("postalCode").SetAttribute("value",
"37040")
IE.Document.forms(7).Submit
Application.Wait Now + TimeValue("00:00:09")
'this is where i am stuck. I know this isnt right but tried to piece it
together
Set elemCollection = IE.Document.getelElementsByTagname("table.account-
table details _tc_table_highlighted")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) =
elemCollection(t).Rows.Cells(c).innertext
Next c
Next r
Next t
End With
End Sub
Here is what I want it to pull:
Shipping column
QTY ordered
QTY shipped Product
And to display in a linear fashion:
Shipping, QTY ordered, QTY shipped, Product

Internet Explorer:
I have made this a little more verbose than usual so you can see each step.
Key things:
1) proper page loads waits with While .Busy Or .readyState < 4: DoEvents: Wend
2) selecting elements by id where possible. The # is a css id selector. css selectors are applied by querySelector method of .document and retrieve the first element in the page which matches the specified pattern
3) a timed loop is needed to wait for results to be present
4) the order qty etc info is a newline divided string. It seemed easiest to split on these newlines and then access individual items from the resultant array by index
5) I order, per your specification, the results in an array and write that array out in one go to the sheet
6) The "." is a class selector in .order-history__item-descript--min i.e. return the first element with class of order-history__item-descript--min
7) The [x=y] is an attribute = value selector in [data-label=Shipping] i.e. return the first element with data-label attribute having value Shipping
8) The combination of .details-table a is using a descendant combinator, " ", to specify I want a tag elements that have a parent with class .details-table
VBA:
Option Explicit
'VBE > Tools > References:
' Microsoft Internet Controls
Public Sub RetrieveInfo()
Dim ie As InternetExplorer, ele As Object, t As Date
Const MAX_WAIT_SEC As Long = 5
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 "https://www.partstown.com/track-my-order"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
.querySelector("#orderNo").Value = "4500969111"
.querySelector("#postalCode").Value = "37040"
.querySelector("#orderLookUpForm").submit
End With
While .Busy Or .readyState < 4: DoEvents: Wend
Dim shipping As String, order As String, items() As String
With .document
t = Timer
Do
On Error Resume Next
Set ele = .querySelector("[data-label=Shipping]")
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ele Is Nothing
If ele Is Nothing Then Exit Sub
shipping = ele.innerText
order = .querySelector(".order-history__item-descript--min").innerText
items = Split(order, vbNewLine)
Dim qtyOrdered As Long, qtyShipped As String, product As String
qtyOrdered = CLng(Replace$(items(0), "Qty Ordered: ", vbNullString))
qtyShipped = CLng(Replace$(items(1), "Qty Shipped: ", vbNullString))
product = .querySelector(".details-table a").Title
Dim results()
results = Array(shipping, qtyOrdered, qtyShipped, product)
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(1, UBound(results) + 1) = results
End With
.Quit
End With
End Sub
If new to HTML please look at:
https://developer.mozilla.org/en-US/docs/Web/HTML
If new to css selectors please look at:
https://flukeout.github.io/
XMLHTTP:
The whole thing can also be done with XHR. This is much faster than opening a browser.
XHR:
Use XMLHttpRequest (XHR) objects to interact with servers. You can
retrieve data from a URL without having to do a full page [render]
In this case I do an initial GET request to the landing page to retrieve the
CSRFToken to use in my re-enactment of the POST request the page makes to the server when you manually input data and press submit. You get the data you want in the server response. I pass a query string in the body of the POST send line
.send "orderNo=4500969111&postalCode=37040&CSRFToken=" & csrft ; you can see your parameters there.
Option Explicit
Public Sub GetInfo()
Dim html As HTMLDocument, csrft As String '< VBE > Tools > References > Microsoft HTML Object Library
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.partstown.com", False
.send
html.body.innerHTML = .responseText
csrft = html.querySelector("[name=CSRFToken]").Value
.Open "POST", "https://www.partstown.com/track-my-order", False
.setRequestHeader "Referer", "https://www.partstown.com/track-my-order"
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
.setRequestHeader "Accept-Encoding", "gzip, deflate"
.setRequestHeader "Accept-Language", "en-US,en;q=0.9"
.send "orderNo=4500969111&postalCode=37040&CSRFToken=" & csrft
html.body.innerHTML = .responseText
End With
Dim shipping As String, order As String, items() As String
shipping = html.querySelector("[data-label=Shipping]").innerText
order = html.querySelector(".order-history__item-descript--min").innerText
items = Split(order, vbNewLine)
Dim qtyOrdered As Long, qtyShipped As String, product As String
qtyOrdered = CLng(Replace$(items(0), "Qty Ordered: ", vbNullString))
qtyShipped = CLng(Replace$(items(1), "Qty Shipped: ", vbNullString))
product = html.querySelector(".details-table a").Title
Dim results()
results = Array(shipping, qtyOrdered, qtyShipped, product)
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(1, UBound(results) + 1) = results
End Sub
Example of loop:
Option Explicit
Public Sub GetInfo()
Dim html As HTMLDocument, csrft As String, lastRow As Long, sourceValues() '< VBE > Tools > References > Microsoft HTML Object Library
Set html = New HTMLDocument
Dim ws As Worksheet, i As Long
Set ws = ThisWorkbook.Worksheets("Sheet4")
lastRow = ws.Cells(ws.rows.Count, "B").End(xlUp).Row
sourceValues = ws.Range("B2:D" & lastRow).Value
Dim results()
ReDim results(1 To UBound(sourceValues, 1), 1 To 4)
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.partstown.com", False
.send
html.body.innerHTML = .responseText
csrft = html.querySelector("[name=CSRFToken]").Value
Stop
For i = LBound(sourceValues, 1) To UBound(sourceValues, 1)
If sourceValues(i, 1) <> vbNullString And sourceValues(i, 3) <> vbNullString Then
DoEvents
.Open "POST", "https://www.partstown.com/track-my-order", False
.setRequestHeader "Referer", "https://www.partstown.com/track-my-order"
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
.setRequestHeader "Accept-Encoding", "gzip, deflate"
.setRequestHeader "Accept-Language", "en-US,en;q=0.9"
.send "orderNo=" & sourceValues(i, 1) & "&postalCode=" & sourceValues(i, 3) & "&CSRFToken=" & csrft
html.body.innerHTML = .responseText
Dim shipping As String, order As String, items() As String
shipping = html.querySelector("[data-label=Shipping]").innerText
order = html.querySelector(".order-history__item-descript--min").innerText
items = Split(order, vbNewLine)
Dim qtyOrdered As Long, qtyShipped As String, product As String
qtyOrdered = CLng(Replace$(items(0), "Qty Ordered: ", vbNullString))
qtyShipped = CLng(Replace$(items(1), "Qty Shipped: ", vbNullString))
product = html.querySelector(".details-table a").Title
results(i, 1) = shipping
results(i, 2) = qtyOrdered
results(i, 3) = qtyShipped
results(i, 4) = product
End If
'Application.Wait Now + TimeSerial(0, 0, 1)
Next
End With
'results written out from row 2 column E
ws.Cells(2, 5).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Related

Cannot click search result elements after submitting HTML web form with embedded results table - VBA web scrape

I am trying to scrape data from the following URL: http://iswdataclient.azurewebsites.net/webSearchID.aspx?dbkey=parkercad.
I can insert and query a property ID but after loading the search results, I am unable to successfully click the "View Property" link in the results table.
My initial debugging suggested that the form had not actually submitted, meaning the link was not present on the webpage. However, the HTML in the subsequent results page shows the additional elements for the search results. I have unsuccessfully tried the following to wait for the webpage to load, but I do not think it is a timing issue:
Do While ie.Busy Or ie.readyState <> 4: DoEvents: Loop
Do While oIE.ReadyState = 4: WScript.Sleep 100: Loop
Do While oIE.ReadyState <> 4: WScript.Sleep 100: Loop
Do While IE.ReadyState = 4: DoEvents: Loop
Do Until IE.ReadyState = 4: DoEvents: Loop
Application.Wait (Now + TimeValue("00:00:03"))
I have parsed the HTML a number of ways, also considering an event handling issue, beginning with a drill down at the form level:
Set ie = CreateObject("internetexplorer.application")
With ie
.navigate "http://iswdataclient.azurewebsites.net/webProperty.aspx?dbkey=parkercad
.Visible = True
Do While .Busy Or .readyState <> 4
DoEvents
Loop
End With
For Each propid In Range(Cells(2, 8), Cells(2, 8)) 'Cells(Range("H" & Rows.Count).End(xlUp).Row, 8)) 'propid = R000001972
If propid <> "N/A" Then
On Error Resume Next
With ie.document.body
For iFRM = 0 To .getElementsByTagName("form").Length - 1
If .getElementsByTagName("form")(iFRM).ID = "searchForm" Then
With .getElementsByTagName("form")(iFRM)
For iNPT = 0 To .getElementsByTagName("input").Length - 1
Select Case .getElementsByTagName("input")(iNPT).Name
Case "ucSearchID$searchid"
.getElementsByTagName("input")(iNPT).Value = propid
Case "ucSearchID$ButtonSearch"
.getElementsByTagName("input")(iNPT).Click
End Select
Next iNPT
Do While ie.Busy Or ie.readyState <> 4: DoEvents: Loop
Application.Wait (Now + TimeValue("00:00:02"))
Exit For
End With
Exit For
End If
Next iFRM
End With
As well as a simple parse of the required elements:
Set ie = CreateObject("internetexplorer.application")
With ie
.navigate "http://iswdataclient.azurewebsites.net/webProperty.aspx?dbkey=parkercad
.Visible = True
Do While .Busy Or .readyState <> 4
DoEvents
Loop
End With
Set intags = ie.document.getElementsByTagName("input")
For Each intag In intags
If intag.classname = "searchid" Then
intag.Value = propid
Set evt = ie.document.createEvent("keyboardevent")
evt.initEvent "change", True, False
intag.dispatchEvent evt
End If
Next intag
ie.document.getelementbyid("ucSearchID_ButtonSearch").Click
While ie.readyState <> 4 Or ie.Busy: DoEvents: Wend
As well as a drill down of table cells, the code for which I deleted. Although I thought there could be an event handling issue, the webpage updates, I just cannot parse the updated HTML from the results table.
Debug.Print ie.document.getelementbyid("lblResults").innerText
The Debug.Print returns "Your search of ' ' returned 0 result(s)", while the webpage reflects a successful search with "Your search of 'R000001972' returned 1 result(s). So, my code successfully submits the form but does not execute the results page "View Property" link click, as it fails to parse the updated HTML:
For at = 0 To ie.document.getElementsByTagName("a").Length - 1
Select Case ie.document.getElementsByTagName("a")(at).ID
Case "ucResultsGrid_" & propid
ie.document.getElementsByTagName("a")(at).Click
End Select
Next at
It does not seem to be either a timing or event handling issue. Unsure of how to proceed. Any help would be much appreciated.
It's an aspx page. You can perform the same GET and POST requests it does in a simplified form. I use clipboard to write out sample tables. You can amend as you choose.
Option Explicit
Public Sub GetPropertyInfo()
Dim html As MSHTML.HTMLDocument, xhr As Object
Application.ScreenUpdating = False
Set html = New MSHTML.HTMLDocument
Set xhr = CreateObject("MSXML2.ServerXMLHTTP")
Dim body As String, propertyId As String
propertyId = "R000001972"
With xhr
.Open "GET", "http://iswdataclient.azurewebsites.net/webSearchID.aspx?dbkey=parkercad&stype=id&sdata=" & propertyId, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
If html.querySelectorAll("#dvPrimary table tr").Length <= 1 Then Exit Sub
body = GetPostBody(html, propertyId)
.Open "POST", "http://iswdataclient.azurewebsites.net/webProperty.aspx?dbkey=parkercad&stype=id&sdata=" _
& propertyId & "&id=" & propertyId, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send body
html.body.innerHTML = .responseText
End With
Dim ws As Worksheet, clipboard As Object, i As Long
Set ws = ThisWorkbook.Worksheets(1)
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
With ws.Cells
.ClearContents
.ClearFormats
End With
With html.querySelectorAll("table")
For i = 8 To .Length - 1
clipboard.SetText .Item(i).outerHTML
clipboard.PutInClipboard
ws.Range("A" & GetLastRow(ws) + 2).PasteSpecial
Next
End With
Application.ScreenUpdating = True
End Sub
Public Function GetPostBody(ByVal html As MSHTML.HTMLDocument, ByVal propertyId As String) As String
Dim i As Long, result As String
With html.querySelectorAll("input[type=hidden]")
For i = 0 To .Length - 1
result = result & .Item(i).ID & "=" & .Item(i).Value & "&"
Next
End With
result = result & "__EVENTTARGET=ucResultsGrid$" & propertyId
GetPostBody = result
End Function
Public Function GetLastRow(ByVal sh As Worksheet) As Long
On Error Resume Next
GetLastRow = sh.Cells.Find(What:="*", _
After:=sh.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row
On Error GoTo 0
End Function
References (VBE > Tools > References):
Microsoft HTML Object Library

Inner Loop design for webscraping

I want to import restaurant data like Restaurant name, phone number, website & address to excel but unfortunately, I am getting sponsored results & also not getting website & full address as it in on the inner page when we click on the hotel name. I with some help at platforms have created a code using but it is not helping out. Please rectify the issue in my code. Website:https://www.yelp.com/searchcflt=restaurants&find_loc=San%20Francisco%2C%20CA&start=
Here is my code:
Sub GetInfo()
Const URL$ = "https://www.yelp.com/search?cflt=restaurants&find_loc=San%20Francisco%2C%20CA&start="
Dim Http As New XMLHTTP60, Html As New HTMLDocument, Htmldoc As New HTMLDocument, page&, I&
For page = 0 To 1 ' this is where you change the last number for the pages to traverse
With Http
.Open "GET", URL & page * 30, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll("[class*='searchResult']")
For I = 0 To .Length - 1
Htmldoc.body.innerHTML = .Item(I).outerHTML
On Error Resume Next
r = r + 1: Cells(r, 1) = Htmldoc.querySelector("[class*='heading--h3'] > a").innerText
Cells(r, 2) = Htmldoc.querySelector("[class*='container'] > [class*='display--inline-block']").innerText
' Cells(r, 3) = Htmldoc.querySelector("[class*='container'] > address").innerText
'Cells(r, 4) = Htmldoc.querySelector("[class*='container'] > address").NextSibling.innerText
'Inner loop creation
Cells(r, 5) = Htmldoc.querySelector("[class*='container'] > website").href ' Extract from window after clicking on hotel name
Cells(r, 6) = Htmldoc.querySelector("[class*='container'] > fulladdress").innerText ' Extract from window after clicking on hotel name
On Error GoTo 0
Next I
End With
Next page
End Sub
You can use the free API to get the top 50 from the business_search endpoint. Pass sort parameter in query string to get top rated.
Use a json parser, such as jsonconverter.bas to handle the response. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
API instructions are here. You need to set up a test app, which requires some basic user info, and verify your email. You will then receive an API key for authentication which is passed in the authorization header as shown below.
There is other info returned which you can parse if wanted.
Option Explicit
Public Sub GetTopRestuarants()
Dim json As Object, headers(), r As Long, c As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://api.yelp.com/v3/businesses/search?term=restuarant&location=san-francisco&limit=50&sort_by=rating", False
.setRequestHeader "Authorization", "Bearer yourAPIkey"
.send
Set json = JsonConverter.ParseJson(.responseText)("businesses")
headers = Array("Restaurant name", "phone", "website", "address")
Dim results(), item As Object
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1
results(r, 1) = item("name")
results(r, 2) = item("phone")
results(r, 3) = item("url")
Dim subItem As Variant, address As String
address = vbNullString
For Each subItem In item("location")("display_address")
address = address & Chr$(32) & subItem
Next
results(r, 4) = Trim$(address)
Next
End With
With ActiveSheet
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Example top 20 of 50 returned:
Caveat emptor
Note that specifying the sort_by is a suggestion (not strictly enforced) to Yelp's search, which considers multiple input parameters to return the most relevant results. For example, the rating sort is not strictly sorted by the rating value, but by an adjusted rating value that takes into account the number of ratings, similar to a Bayesian average. This is to prevent skewing results to businesses with a single review.
This is one of the ways which will let you parse the results from it's inner pages. I can't access that webpage anymore to assist you further. However, give it a shot. I suppose it will work:
Sub GetInfo()
Const URL$ = "https://www.yelp.com/search?cflt=restaurants&find_loc=San%20Francisco%2C%20CA&start="
Const base$ = "https://www.yelp.com"
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim oTitle$, oPhone As Object, Htmldoc As New HTMLDocument
Dim R&, newUrl$, I&, oWeb As Object, page&, oAddress As Object
[A1:D1] = [{"Name","Phone","Address","Website"}]
For page = 1 To 3 'this is where you change the last number for this script to traverse
With Http
.Open "GET", URL & page * 30, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll("[class*='searchResult'] [class*='heading--h3'] > a")
For I = 0 To .Length - 1
If Not InStr(.item(I).getAttribute("href"), "/adredir?") > 0 Then
oTitle = .item(I).innerText
newUrl = Replace(.item(I).getAttribute("href"), "about:", base)
With Http
.Open "GET", newUrl, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
Htmldoc.body.innerHTML = .responseText
End With
R = R + 1: Cells(R + 1, 1) = oTitle
Set oPhone = Htmldoc.querySelector(".biz-phone")
If Not oPhone Is Nothing Then
Cells(R + 1, 2) = oPhone.innerText
End If
Set oAddress = Htmldoc.querySelector(".map-box-address")
If Not oAddress Is Nothing Then
Cells(R + 1, 3) = WorksheetFunction.Clean(oAddress.innerText)
End If
Set oWeb = Htmldoc.querySelector(".biz-website > a")
If Not oWeb Is Nothing Then
Cells(R + 1, 4) = oWeb.innerText
End If
End If
Next I
End With
Next page
End Sub
Btw, the ads have been kicked out.

Retrieving all Excel file links from a webpage

I'm trying to get all the downloadable Excel file links from the website, but having difficulty. Please help to guide me. Thanks.
Sub TYEX()
Dim internet As Object
Dim internetdata As Object
Dim div_result As Object
Dim header_links As Object
Dim link As Object
Dim URL As String
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = True
URL = "https://www.jpx.co.jp/markets/public/short-selling/index.html"
internet.Navigate URL
Do Until internet.ReadyState >= 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 5)
Set internetdata = internet.Document
Set div_result = internetdata.getElementById("readArea")
Set header_links = div_result.getElementsByTagName("td")
For Each h In header_links
Set link = h.ChildNodes.item(0)
Cells(Range("A" & Rows.Count).End(xlUp).Row + 1, 1) = link.href
Next
MsgBox "done"
End Sub
You had the idea down correctly, but here's a different approach:
Sub TYEX()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate "https://www.jpx.co.jp/markets/public/short-selling/index.html"
.Visible = True
Do While .Busy Or .readyState < 4
DoEvents
Loop
Dim doc As Object, tbl As Object
Set doc = .document
Set tbl = doc.getElementsByClassName("component-normal-table")(0).Children(0)
Dim r As Long, xlsArr(), a As Object
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
End With
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With
End Sub
Public Function NextRow(ByVal ws As Worksheet, Optional ByVal col As Variant = 1) As Long
With ws
NextRow = .Cells(.Rows.Count, col).End(xlUp).Row + 1
End With
End Function
Breaking Down the Code
This will loop your html table rows. We start at 1, because 0 is actually just the table header.
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
This uses regular expressions to extract the url from the innerHTML property. You can see how this particular regex works here: Regex101
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
You will size your range to be the same size as your array that contains the links, then write the array to the worksheet. This is usually much faster than writing cells one-by-one.
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With
You can use an attribute = value CSS selector with $ operator to say the href value must end with .xls. Then use querySelectorAll to retrieve all matched results. Using CSS selectors is a very fast and generally robust method.
Dim list As Object
Set list = ie.document.querySelectorAll("[href$='.xls']")
It is much faster to use XMLHTTP as well, rather than opening IE. Note that you can then pass these links to a function to perform a binary download or to URLMon for downloading.
Option Explicit
Public Sub Links()
Dim sResponse As String, html As HTMLDocument, list As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.jpx.co.jp/markets/public/short-selling/index.html", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set list = html.querySelectorAll("[href$='.xls']")
End With
For i = 0 To list.Length - 1
Debug.Print Replace$(list.item(i), "about:", "https://www.jpx.co.jp")
Next
End Sub
Example download function (though you could re-use your existing XMLHTTP object - this is just to illustrate):
Public Function DownloadFile(ByVal downloadFolder As String, ByVal downloadURL As String) As String
Dim http As Object , tempArr As Variant
Set http = CreateObject("WinHttp.WinHttpRequest.5.1")
http.Open "GET", downloadURL, False
http.send
On Error GoTo errhand
With CreateObject("ADODB.Stream")
.Open
.Type = 1
.write http.responseBody
tempArr = Split(downloadURL, "/")
tempArr = tempArr(UBound(tempArr))
.SaveToFile downloadFolder & tempArr, 2 '< "/" on enter of downloadFolder. 2 for overwrite which is Ok if no file modifications.
.Close
End With
DownloadFile = downloadFolder & tempArr
Exit Function
errhand:
If Err.Number <> 0 Then
Debug.Print Err.Number, Err.Description
MsgBox "Download failed"
End If
DownloadFile = vbNullString
End Function
References (VBE > Tools > References):
Microsoft HTML Object Library

How do i extract specific data (name, details )from the website into excel with excel vba?

How do i extract specific data (name, details )from the website into excel with excel vba?
Below I am trying to get processor and warranty:
Option Explicit
Sub GetData()
Dim objIE As InternetExplorer
Dim itemELE As Object
Dim html As IHTMLDocument
Dim Processor As String
Dim warranty As String
Dim y As Integer
'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = True
'navigate to page with needed data
objIE.navigate "https://www.harveynorman.com.sg/computers-tablets-and-gaming/computers/laptops/"
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each itemELE In objIE.document.getElementsByClassName("expandabaleContent")
Processor = itemELE.getElementsByTagName("d1")(0).innerText
warranty = itemELE.getElementsByClassName("d1")(0).getElementsByTagName("a")(0).textContent
Sheets("Sheet1").Range("A" & y).Value = Processor
Sheets("Sheet1").Range("B" & y).Value = warranty
y = y + 1
Next
End Sub
Screenshot of the page:
For the page shown (in your image) you can issue an XMLHTTP (XHR) GET request to grab the product info without opening a slow IE browser instance.
For the specific information:
Processor and warranty info:
If you inspect the page the info about processor and warranty appears associated with a classname facetedResults-feature-list
You can see the classname and then a dl tag housing a dt tag which has sibling dd tags. Two of these sibling dd tags are associated with the info for processor and warranty.
I use a CSS selector to grab all these dd tags which can be simplified,in this instance, to ignore the sibling dt and parent dl tags and use just:
.facetedResults-feature-list dd
The "." is a class selector. The CSS combination selection above says get the dd tags within elements with class facetedResults-feature-list
Product titles info:
The titles I get using another CSS selector of:
.facetedResults-title
This is elements with class facetedResults-title. This contains the product title.
Writing out product titles, processor and warranty info to the sheet:
A little maths shows me that the processor info repeats every 14, and that if I add 8 to the index for the processor I get the warranty info. You can see how you could write out each of the details as they occur at indices that repeat every 14. I combine the loop over the nodeList of dd elements with the titles to write out to the sheet.
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument
Application.ScreenUpdating = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim titles As Object, targetedInfo As Object, rowCounter As Long
With html
.body.innerHTML = sResponse
Set titles = .querySelectorAll(".facetedResults-title")
Set targetedInfo = .querySelectorAll(".facetedResults-feature-list dd")
End With
With Worksheets("Sheet1")
For i = 0 To targetedInfo.Length - 1
If i Mod 14 = 0 Then
rowCounter = rowCounter + 1
.Cells(rowCounter, 1) = titles(rowCounter - 1).innerText
.Cells(rowCounter, 2) = targetedInfo(i).innerText
.Cells(rowCounter, 3) = targetedInfo(i + 8).innerText
End If
Next i
End With
Application.ScreenUpdating = True
End Sub
Output sample:
More general info:
CSS selectors:
The product info is associated with an 'expandableContent facetedResults-expandableContent-features expandableContent-is-collapsed facetedResults-expandableContent-69' class name
The prices are associated with an 'expandableContent facetedResults-expandableContent-price expandableContent-is-collapsed' class name.
You can select these by the traditional .getElementsByClassName and then loop over the collection, or, in my case, use a CSS selector for class to do the same thing, and then traverse the length of the returned nodeList.
.getElementsByClassName("expandableContent facetedResults-expandableContent-features expandableContent-is-collapsed facetedResults-expandableContent-69")
is the same as
.querySelectorAll(".expandableContent.facetedResults-expandableContent-features.expandableContent-is-collapsed.facetedResults-expandableContent-69")
The "." is the class selector.
Titles are associated with a class facetedResults-title
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument
Application.ScreenUpdating = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim info As Object, prices As Object, titles As Object
With html
.body.innerHTML = sResponse
Set titles = .querySelectorAll(".facetedResults-title")
Set info = .querySelectorAll(".expandableContent.facetedResults-expandableContent-features.expandableContent-is-collapsed.facetedResults-expandableContent-69")
Set prices = .querySelectorAll(".expandableContent.facetedResults-expandableContent-price.expandableContent-is-collapsed")
End With
With Worksheets("Sheet1")
For i = 0 To titles.Length - 1
.Cells(i + 1, 1) = titles(i).innerText
.Cells(i + 1, 2) = info(i).innerText
.Cells(i + 1, 3) = prices(i).innerText
Next i
End With
Application.ScreenUpdating = True
End Sub
References required (VBE>Tools>References):
Microsoft HTML Object Library
Qharr has already provided some good options but in case still you want to try IE then see below code
Option Explicit
Sub GetData()
Dim objIE As InternetExplorer
Dim itemELE As Object
Dim html As IHTMLDocument
Dim Processor As String
Dim warranty As String
Dim y As Integer
'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = True
'navigate to page with needed data
objIE.navigate "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS"
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Application.Wait Now + TimeSerial(0, 0, 3)
y = 1
For Each itemELE In objIE.document.getElementsByClassName("facetedResults-feature-list")
If InStr(1, itemELE.className, "bundleList", vbTextCompare) = 0 Then
Application.Wait Now + TimeSerial(0, 0, 2)
Processor = itemELE.getElementsByTagName("dl")(0).innerText
warranty = itemELE.getElementsByTagName("dl")(4).innerText
Sheets("Sheet1").Range("A" & y).Value = Processor
Sheets("Sheet1").Range("B" & y).Value = warranty
y = y + 1
End If
Next
End Sub
Results

VBA reads HTML from the old page after clicking submit button

I am not a programmer but I have managed to learn just a few things in VBA but now on a certain website I face a problem that does not exist on some other.
What should happen is that a page form should be completed with data, submit button clicked and then I want to get some data from the result page.
The first phase works fine but it seems that no matter what I do the VBA still reads data from the page before submit was clicked.
The code is:
Sub VIES2()
'Uruchomienie Internet Explorera i wstrzymanie dalszej akcji aż uzyska stan gotowości
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
Do While IE.ReadyState <> 4: DoEvents: Loop
'Wypełnienie formularza odpowiednimi wartościami i kliknięcie przycisku sprawdzenia
IE.document.getElementbyId("countryCombobox").Value = "IT"
IE.document.getElementbyId("number").Value = "01802840023"
IE.document.getElementbyId("requesterCountryCombobox").Value = "IT"
IE.document.getElementbyId("requesterNumber").Value = "01802840023"
IE.document.getElementbyId("submit").Click
'Test uzyskiwania opisu i identyfikatora zapytania
For t = 1 To 999999
Next t
Application.Wait Now + TimeValue("00:00:10")
Do While IE.ReadyState <> 4: DoEvents: Loop
For t = 1 To 999999
Next t
Application.Wait Now + TimeValue("00:00:10")
MsgBox IE.LocationURL
Set Text = IE.document.getElementsbyClassName("layout-content")
For Each Element In Text
MsgBox Element.innerText
Next
Set Test = IE.document.getElementsbyTagName("TABLE")
For Each Element In Test
MsgBox Element.innerText
Next
End Sub
I have tried putting break, various wait loops and Application.Wait as suggested in similar questions where it seems to have worked. Here, even after the page is long after fully loaded the code still reads the old page - at least pulling the URL and some data seems to point that it is the case.
UPDATE: I should also add that I have tried to make the macro refresh the page but it clears the input content. What is interesting that target URL is:
http://ec.europa.eu/taxation_customs/vies/vatResponse.html
If I change the initial page to this the browser instantly redirects to the original page with notification that initial data is needed. The macro then completes the data and clicks submit button. In this case IE.LocationURL indicates this URL:
http://ec.europa.eu/taxation_customs/vies/vatResponse.html
but according to the content I get with getElementsbyClassName still reads elements from the initial page:
http://ec.europa.eu/taxation_customs/vies/?locale=pl
This worked to print out the VAT response table
Note:
If on 32-bit remove the PtrSafe.
Code:
Option Explicit
Declare PtrSafe Sub sleep Lib "kernel32" Alias "Sleep" (ByVal dwmilliseconds As Long)
Public Sub VIES2()
Application.ScreenUpdating = False
Dim IE As Object
'Uruchomienie Internet Explorera i wstrzymanie dalszej akcji az uzyska stan gotowosci
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.navigate "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
Do While IE.ReadyState <> 4: DoEvents: Loop
'Wypelnienie formularza odpowiednimi wartosciami i klikniecie przycisku sprawdzenia
IE.document.getElementById("countryCombobox").Value = "IT"
IE.document.getElementById("number").Value = "01802840023"
IE.document.getElementById("requesterCountryCombobox").Value = "IT"
IE.document.getElementById("requesterNumber").Value = "01802840023"
IE.document.getElementById("submit").Click
sleep (5000) 'or increase to 10000
Dim tbl As Object
Set tbl = IE.document.getElementById("vatResponseFormTable")
Dim ws As Worksheet
Set ws = ActiveWorkbook.Worksheets.Add
ws.Name = "Results"
Dim rng As Range, currentRow As Object, currentColumn As Object, i As Long, outputRow As Long
outputRow = outputRow + 1
Set rng = ws.Range("B" & outputRow)
For Each currentRow In tbl.Rows
For Each currentColumn In currentRow.Cells
rng.Value = currentColumn.outerText
Set rng = rng.Offset(, 1)
i = i + 1
Next currentColumn
outputRow = outputRow + 1
Set rng = rng.Offset(1, -i)
i = 0
Next currentRow
Application.ScreenUpdating = True
End Sub
Output:
Although QHarr's solution is working in my end, I'm providing with another with no hardcoded delay within the script.
Using IE as your question was:
Sub Get_Data()
Dim HTML As HTMLDocument, post As Object, elems As Object
Dim elem As Object, r&, c&
With New InternetExplorer
.Visible = False
.navigate "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
With HTML
.getElementById("countryCombobox").Value = "IT"
.getElementById("number").Value = "01802840023"
.getElementById("requesterCountryCombobox").Value = "IT"
.getElementById("requesterNumber").Value = "01802840023"
.getElementById("submit").Click
Do: Set post = .getElementById("vatResponseFormTable"): DoEvents: Loop While post Is Nothing
For Each elems In post.Rows
For Each elem In elems.Cells
c = c + 1: Cells(r + 1, c) = elem.innerText
Next elem
c = 0: r = r + 1
Next elems
End With
.Quit
End With
End Sub
Reference to add to the library:
1. Microsoft Internet Controls
2. Microsoft HTML Object Library
Using xmlhttp request (It is way faster than IE):
Sub Get_Data()
Dim elems, elem As Object
Dim QueryString$, S$, r&, c&
QueryString = "memberStateCode=IT&number=01802840023&traderName=&traderStreet=&traderPostalCode=&traderCity=&requesterMemberStateCode=IT&requesterNumber=01802840023&action=check&check=Weryfikuj"
With New XMLHTTP
.Open "POST", "http://ec.europa.eu/taxation_customs/vies/vatResponse.html", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send QueryString
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
For Each elems In .getElementById("vatResponseFormTable").Rows
For Each elem In elems.Cells
c = c + 1: Cells(r + 1, c) = elem.innerText
Next elem
c = 0: r = r + 1
Next elems
End With
End Sub
Reference to add to the library:
1. Microsoft XML, V6
2. Microsoft HTML Object Library
Most of the time you should search if there isn't a REST/SOAP available to achieve that kind of task.
Using an Internet Explorer instance for this is a total overkill.
Try this simple function, that uses the SOAP service to validate VAT numbers:
Function IsVatValid(country_code, vat_number)
Dim objHTTP As Object
Dim xmlDoc As Object
Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")
sURL = "http://ec.europa.eu/taxation_customs/vies/services/checkVatService"
sEnv = "<s11:Envelope xmlns:s11='http://schemas.xmlsoap.org/soap/envelope/'>" & _
"<s11:Body>" & _
"<tns1:checkVat xmlns:tns1='urn:ec.europa.eu:taxud:vies:services:checkVat:types'>" & _
"<tns1:countryCode>" & country_code & "</tns1:countryCode>" & _
"<tns1:vatNumber>" & vat_number & "</tns1:vatNumber>" & _
"</tns1:checkVat>" & _
"</s11:Body>" & _
"</s11:Envelope>"
objHTTP.Open "Post", sURL, False
objHTTP.setRequestHeader "Content-Type", "text/xml"
objHTTP.setRequestHeader "SOAPAction", "checkVatService"
objHTTP.send (sEnv)
objHTTP.waitForResponse
Set xmlDoc = CreateObject("HTMLFile")
xmlDoc.body.innerHTML = objHTTP.responsetext
IsVatValid = CBool(xmlDoc.getElementsByTagName("valid")(0).innerHTML)
Set xmlDoc = Nothing
Set objHTTP = Nothing
End Function
And then you can simply validate all your vat numbers:
Debug.Print IsVatValid("IT", "01802840023")
>>> True