Scraping web with url in milliseconds Unix - html

The web in which I browse has several pages. I want to click on those elements or play with urls and thus be able to copy the data. The initial URL ends at = 1 & playerType = ALL & ts = 1558502019375, in my code there is a loop that is supposed to go page after page to get the data but I can not get it done.
Sub UPDATE_DATA_MLB()
Application.ScreenUpdating = False
'dimension (set aside memory for) our variables
Dim objIE As InternetExplorer
Dim ele As Object
Dim y As Integer
Dim EstaPagina As Byte
Dim EstaURL As String
'Página inicial
EstaPagina = 1
'we will output data to excel, starting on row 1
y = 1
EstaURL = "http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2018&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&playerType=ALL&ts=1558502019375" '&ts=1526432697176"
'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = False
Do Until EstaPagina = 255
'navigate to page with needed data
objIE.navigate EstaURL & EstaPagina
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'If UCase(Left(EstaURL, 211) & "1") = UCase(Left(objIE.LocationURL, (211 + Len(EstaPagina)))) And y > 1 Then Exit Do
'look at all the 'tr' elements in the 'table' with id 'myTable',
'and evaluate each, one at a time, using 'ele' variable
For Each ele In objIE.document.getElementById("datagrid").getElementsByTagName("tr")
'show the text content of 'tr' element being looked at
'Debug.Print ele.textContent
'each 'tr' (table row) element contains 4 children ('td') elements
'put text of 1st 'td' in col A
Sheets("Sheet1").Range("A" & y).Value = ele.Children(0).textContent
'put text of 2nd 'td' in col B
Sheets("Sheet1").Range("B" & y).Value = ele.Children(1).textContent
'put text of 3rd 'td' in col C
Sheets("Sheet1").Range("C" & y).Value = ele.Children(2).textContent
'put text of 4th 'td' in col D
Sheets("Sheet1").Range("D" & y).Value = ele.Children(5).textContent
'put text of 4th 'td' in col f
Sheets("Sheet1").Range("E" & y).Value = ele.Children(22).textContent
'increment row counter by 1
y = y + 1
Next
EstaPagina = EstaPagina + 1
Loop
lobjIE.Quit
Set objIE = Nothing
Set ele = Nothing
Range("A1").Select
Range(Selection, Selection.End(xlToRight)).Select
Range(Selection, Selection.End(xlDown)).Select
Selection.RemoveDuplicates Columns:=Array(1, 2, 3, 4, 5), _
Header:=xlNo
Application.ScreenUpdating = True
MsgBox "Volcado terminado", vbInformation
Range("A1").Select
'save the Excel workbook
ActiveWorkbook.Save
End Sub ```

XMLHTTP:
The page does ajax calls to get json which it uses to update the content for each page. It does this using query string parameters, one of which is the records per page (default 50). You can view this activity in the browser network tab via dev tools F12
You can avoid using a browser and issue xmlhttp requests to the same endpoint, altering the query string parameters to get all the results. It seems, in my testing, you cannot request all the results in one go. Instead, I issue a request for 1000 results, then check if there are any additional results to obtain and issue further requests, in batches of 1000, to get them.
I am using jsonconverter.bas to parse the json. I extract the info from the json and load into an array, results, so I can write all results out to the sheet in one go - much more efficient this way as reduces i/o with sheet.
After copying the code from the link above into a module called jsonConverter, you need to go 'VBE > Tools > References > Add a reference to Microsoft Scripting Runtime
TODO:
Add error handling for failed request
VBA:
Option Explicit
Public Sub GetResults()
'VBE > Tools > References > Microsoft Scripting Runtime
Dim ws As Worksheet, results(), i As Long, totalResults As Long
Dim headers(), columnCount As Long, pageNumber As Long
Dim numberOfPages As Long, resultsPerPage As Long, json As Object
resultsPerPage = 1000
pageNumber = 1
Set ws = ThisWorkbook.Worksheets("Sheet1")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://mlb.mlb.com/pubajax/wf/flow/stats.splayer?season=2018&sort_order=%27desc%27&sort_column=%27avg%27&stat_type=hitting&page_type=SortablePlayer&game_type=%27R%27&player_pool=ALL&season_type=ANY&sport_code=%27mlb%27&results=" & resultsPerPage & "&recSP=" & pageNumber & "&recPP=" & resultsPerPage, False
.send
Set json = JsonConverter.ParseJson(.responseText)
totalResults = json("stats_sortable_player")("queryResults")("totalSize")
headers = json("stats_sortable_player")("queryResults")("row").item(1).keys
numberOfPages = json("stats_sortable_player")("queryResults")("totalP")
columnCount = UBound(headers) + 1
ReDim results(1 To totalResults, 1 To columnCount)
Dim r As Long, c As Long, dict As Object, key As Variant
For pageNumber = 1 To numberOfPages
If pageNumber > 1 Then
.Open "GET", "http://mlb.mlb.com/pubajax/wf/flow/stats.splayer?season=2018&sort_order=%27desc%27&sort_column=%27avg%27&stat_type=hitting&page_type=SortablePlayer&game_type=%27R%27&player_pool=ALL&season_type=ANY&sport_code=%27mlb%27&results=" & resultsPerPage & "&recSP=" & pageNumber & "&recPP=" & resultsPerPage, False
.send
Set json = JsonConverter.ParseJson(.responseText)
End If
For Each dict In json("stats_sortable_player")("queryResults")("row")
r = r + 1: c = 1
For Each key In dict.keys
results(r, c) = dict(key)
c = c + 1
Next
Next
Next
End With
With ws
.Cells(1, 1).Resize(1, columnCount) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Sample of output ( appreciate can't really read as is - but idea of layout):
image link: https://i.stack.imgur.com/jiDTP.png
Internet Explorer:
If you want to use slower browser solution you can concatenate the page number into the url and loop to cover all pages. The number of pages can be extracted from the pagination on page 1.
You can see how to write the tables out below each other by looking at this answer. Change the lines with GetLastRow(ws, 1) + 2 to GetLastRow(ws, 1) + 1
Option Explicit
'VBE > Tools > References: Microsoft Internet Controls
Public Sub GetData()
Dim ie As New InternetExplorer, numberOfPages As Long
Dim url As String, i As Long
Const PLAYERS_PER_PAGE = 50
url = "http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2018&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&playerType=ALL&ts="
With ie
.Visible = True
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
numberOfPages = CLng(.querySelector(".paginationWidget-last").innerText)
'do something with page 1
If numberOfPages > 1 Then
For i = 2 To numberOfPages
ie.Navigate2 Replace$(url, "page=1", "page=" & CStr(i))
While ie.Busy Or ie.readyState < 4: DoEvents: Wend
'do something with other pages
Next
Stop 'delete me later
End If
End With
.Quit
End With
End Sub

I have managed to complement his second code with mine, although I have not been very professional, I am interested in how to achieve combertir the url of "official page" to this "official page for jsone"
Option Explicit
'VBE > Tools > References: Microsoft Internet Controls
Public Sub GetData()
Dim ele As Object
Dim y As Integer
Dim EstaPagina As Byte
EstaPagina = 1
'we will output data to excel, starting on row 1
y = 1
Dim ie As New InternetExplorer, numberOfPages As Long
Dim url As String, i As Long
Const PLAYERS_PER_PAGE = 50
url = "http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+pitching&game_type='R'&season=2018&season_type=ANY&league_code='MLB'&sectionType=sp&statType=pitching&page=1&playerType=ALL&ts="
With ie
.Visible = True
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
numberOfPages = CLng(.querySelector(".paginationWidget-last").innerText)
'do something with page 1
If numberOfPages > 1 Then
For i = 1 To numberOfPages
ie.Navigate2 Replace$(url, "page=1", "page=" & CStr(i))
For Each ele In ie.document.getElementById("datagrid").getElementsByTagName("tr")
'show the text content of 'tr' element being looked at
'Debug.Print ele.textContent
'each 'tr' (table row) element contains 4 children ('td') elements
'put text of 1st 'td' in col A
Sheets("Sheet1").Range("A" & y).value = ele.Children(1).textContent
'put text of 2nd 'td' in col B
Sheets("Sheet1").Range("B" & y).value = ele.Children(2).textContent
'put text of 3rd 'td' in col C
Sheets("Sheet1").Range("C" & y).value = ele.Children(3).textContent
'put text of 4th 'td' in col D
Sheets("Sheet1").Range("D" & y).value = ele.Children(4).textContent
'put text of 4th 'td' in col f
Sheets("Sheet1").Range("E" & y).value = ele.Children(5).textContent
'increment row counter by 1
y = y + 1
Next
While ie.Busy Or ie.readyState < 4: DoEvents: Wend
' do something with other pages
Next
' Stop 'delete me later
End If
End With
.Quit
End With
On Error Resume Next
Range("A1").Select
Range(Selection, Selection.End(xlToRight)).Select
Range(Selection, Selection.End(xlDown)).Select
Selection.RemoveDuplicates Columns:=Array(1, 2, 3, 4), _
header:=xlNo
On Error Resume Next
Application.ScreenUpdating = True
MsgBox "Volcado terminado", vbInformation
Range("A1").Select
'save the Excel workbook
ActiveWorkbook.Save
End Sub

Related

Getting the text contents from a HTML Table without ID by using VBA

I am trying to parse a HTML table (it has no id but class name) from a website. However, since it has no id, I have difficulty in printing the contents from the table. But I couldn't figure it out.
Edited
Here you can see the image of Excel file. GTIP Numbers are located at column A. My aim is when the VBA code runs, these GTIP numbers from column A are forwarded to the Search Field named "GTİP Ara" in the website https://www.isib.gov.tr/urun-arama/. As a result, the companies having the selected GTIP will be returned to the columns next specified row.
For the third row, GTIP number "841013000000" has forwarded to the Search Field named GTIP Ara and as a result; Company 2, ... Company 9 are returned to neighbor columns.
Sometimes GTIP numbers return nothing since none of the companies are holding specified number.
For example: 841410819000 will return the companies but 841112101000 will return an error "Aradığınız Sonuç Bulunamadı!". That's why I am trying to add an if statement but it doesn't work properly.
Now, because of a mistake somewhere in my code block; the returned values are same for each GTIP, here you can see the result on second image.
Sub GrabLastNames()
Dim objIE As InternetExplorer
Dim ele As Object
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.navigate "https://www.isib.gov.tr/urun-arama"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 2
For i = 2 To 269
objIE.document.getElementById("gtip-ara").Value = _
Sheets("Sheet1").Range("A:A").Cells(i, 1).Value
objIE.document.getElementById("ara").Click
'If objIE.document.getElementsByClassName("error").getElementsByTagName("span").Value <> "Aradığınız Sonuç Bulunamadı!" Then
For Each ele In objIE.document.getElementsByClassName("urun-arama-table table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
Sheets("Sheet1").Cells(i, y).Value = ele.Children(0).textContent
y = y + 1
Next
y = 2
Next i
End Sub
I only had one working GTIP 841410819000, so cant test this more thoroughly.
You don't really need to know the ID, the page has one table, so getting it using getElemenetsByClassName like you did, or just getElementsByTagName like in my example should work fine. My code is probably the same as yours, just with a few pauses to a: not spam the website, and b: give IE a chance to get itself together (its IE after all).
If you can provide a few more working GTIP values I can test a bit more.
' requires reference 'Microft HTML Object Library'
Sub Main()
Dim Browser As New InternetExplorer
Browser.Visible = True
Browser.navigate "https://www.isib.gov.tr/urun-arama"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
Dim TableRows As IHTMLElementCollection
Dim TableRow As IHTMLElement
Dim SourceRow As Integer
Dim ResultColumn As Integer
Application.Wait (Now + TimeValue("0:00:05"))
SourceRow = 2 ' Skip Header
Do
Debug.Print "Trying " & Sheet.Cells(SourceRow, 1).Value
Browser.Document.getElementById("gtip-ara").Value = Sheet.Cells(SourceRow, 1).Value
Browser.Document.getElementById("ara").Click
Application.Wait (Now + TimeValue("0:00:02"))
Do While Browser.Busy
DoEvents
Loop
If Browser.Document.getElementsByTagName("table").Length > 0 Then
Debug.Print " > Found Results"
Set TableRows = Browser.Document.getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
ResultColumn = 2 ' dont overwrite search term
For Each TableRow In TableRows
Sheet.Cells(SourceRow, ResultColumn).Value = TableRow.innerText
ResultColumn = ResultColumn + 1
Next TableRow
Else
Debug.Print " - No Results Found"
End If
If Sheet.Cells(SourceRow + 1, 1).Value = "" Then
Exit Do
Else
SourceRow = SourceRow + 1
End If
Application.Wait (Now + TimeValue("0:00:05"))
Loop
Browser.Quit
Set Browser = Nothing
End Sub
Update
Updated my code again, it doesn't spawn lots of windows anymore and prints only the company name (as your example did).
' requires Microsoft HTML Object Library
' requires Microsoft XML, v6.0
Sub Main()
Dim XHR As XMLHTTP60
Dim Document As HTMLDocument
Dim ResultRows As IHTMLElementCollection
Dim ResultRow As IHTMLElement
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
Dim SheetRow As Integer
Dim SheetColumn As Integer
Dim LastRow As Integer
LastRow = Sheet.Cells(Sheet.Rows.Count, "A").End(xlUp).Row
For SheetRow = 2 To LastRow
Debug.Print "Trying GTIP:" & Sheet.Cells(SheetRow, 1).Value
Application.StatusBar = "Status: " & Right(String(Len(CStr(LastRow - 1)), "0") & CStr(SheetRow - 1), Len(CStr(LastRow - 1))) & "/" & CStr(LastRow - 1)
Set XHR = New XMLHTTP60
XHR.Open "POST", "https://www.isib.gov.tr/urun-arama", False
XHR.setRequestHeader "content-type", "application/x-www-form-urlencoded"
XHR.send "gtipkategori=" & Sheet.Cells(SheetRow, 1).Value
Set Document = New HTMLDocument
Document.body.innerHTML = XHR.responseText
If Document.getElementsByTagName("table").Length > 0 Then
Debug.Print " > Found Results"
SheetColumn = 2 ' First Column to output data into
Set ResultRows = Document.getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
For Each ResultRow In ResultRows
Sheet.Cells(SheetRow, SheetColumn).Value = ResultRow.getElementsByTagName("td")(0).innerText ' 0 - company name
SheetColumn = SheetColumn + 1
Next
Else
Debug.Print " - No Results"
End If
Set XHR = Nothing
Set Document = Nothing
Application.Wait (Now + TimeValue("0:00:01")) ' slow down requests
Next
Application.StatusBar = "Complete"
End Sub

Data Scraping Elements By ClassName

I am trying to pull data from a web-site, I want to copy the '10' x 5'unit (class name is "unit_size medium") ' in row 1 for which I am able to copy data successfully but I also want promo (Class name is "promo_offers") '1st Month Free!' in row 2, the problem is this promo is given for specific cells only. hence the data is misleading and I am getting promo in 1st 4 cells and then getting error. However, I want to copy promo for only those units where promo information is given else the cell should be blank or any other value needs to be set. Below is the code...
Please suggest how to frame the code.
Sub GetClassNames()
Dim html As HTMLDocument
Dim objIE As Object
Dim element As IHTMLElement
Dim ie As InternetExplorer
Dim elements As IHTMLElementCollection
Dim result As String 'string variable that will hold our result link
Dim count As Long
Dim erow As Long
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://www.allstorageonline.com/storage-units/texas/amarillo/all-storage-hardy-115423/#utm_source=GoogleLocal&utm_medium=WWLocal&utm_campaign=115423"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
count = 0
Set html = objIE.document
Set elements = html.getElementsByClassName("unit_size medium")
For Each element In elements
If element.className = "unit_size medium" Then
erow = Sheet2.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1) = html.getElementsByClassName("unit_size medium")(count).innerText
Cells(erow, 2) = html.getElementsByClassName("promo_offers")(count).innerText
count = count + 1
End If
Next element
End Sub
I would simply wrap in an On Error Resume Next when attempting to access the element. Have a place already reserved for it in an output array so if not present the place remains empty.
Option Explicit
'VBE > Tools > References:
' Microsoft Internet Controls
Public Sub GetData()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 "https://www.allstorageonline.com/storage-units/texas/amarillo/all-storage-hardy-115423/#utm_source=GoogleLocal&utm_medium=WWLocal&utm_campaign=115423"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim listings As Object, listing As Object, headers(), results(), r As Long, c As Long
headers = Array("size", "features", "promo", "in store", "web")
Set listings = .document.getElementById("small_units_accordion_panel").getElementsByTagName("li")
'.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
ReDim results(1 To listings.Length, 1 To UBound(headers) + 1)
For Each listing In listings
r = r + 1
On Error Resume Next
results(r, 1) = listing.getElementsByClassName("unit_size medium")(0).innerText
results(r, 2) = listing.getElementsByClassName("features")(0).innerText
results(r, 3) = listing.getElementsByClassName("promo_offers")(0).innerText
results(r, 4) = listing.getElementsByClassName("board_rate")(0).innerText
results(r, 5) = listing.getElementsByClassName("price")(0).innerText
On Error GoTo 0
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
End Sub
All boxes:
Option Explicit
'VBE > Tools > References:
' Microsoft Internet Controls
Public Sub GetData()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 "https://www.allstorageonline.com/storage-units/texas/amarillo/all-storage-hardy-115423/#utm_source=GoogleLocal&utm_medium=WWLocal&utm_campaign=115423"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "promo", "in store", "web")
Set list = .document.getElementsByClassName("main_unit")
'.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
Dim rowCount As Long
rowCount = .document.querySelectorAll(".main_unit li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByTagName("li")
r = r + 1
On Error Resume Next
results(r, 1) = item.getElementsByClassName("unit_size medium")(0).innerText
results(r, 2) = item.getElementsByClassName("features")(0).innerText
results(r, 3) = item.getElementsByClassName("promo_offers")(0).innerText
results(r, 4) = item.getElementsByClassName("board_rate")(0).innerText
results(r, 5) = item.getElementsByClassName("price")(0).innerText
On Error GoTo 0
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
End Sub

Extract table from webpage using VBA

I would like to extract the table from html code into Excel using VBA.
I have tried the following code several times with changing some of the code but keep on getting error.
Sub GrabTable()
'dimension (set aside memory for) our variables
Dim objIE As InternetExplorer
Dim ele As Object
Dim y As Integer
'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = False
'navigate to page with needed data
objIE.navigate "http://www.bursamalaysia.com/market/listed-companies/company-announcements/5923061"
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'we will output data to excel, starting on row 1
y = 1
'look at all the 'tr' elements in the 'table' with id 'InputTable2',
'and evaluate each, one at a time, using 'ele' variable
For Each ele In objIE.document.getElementByClassName("InputTable2").getElementsByTagName("tr")
'show the text content of 'td' element being looked at
Debug.Print ele.textContent
'each 'tr' (table row) element contains 2 children ('td') elements
'put text of 1st 'td' in col A
Sheets("Sheet1").Range("A" & y).Value = ele.Children(0).textContent
'put text of 2nd 'td' in col B
Sheets("Sheet1").Range("B" & y).Value = ele.Children(1).textContent
y = y + 1
'repeat until last ele has been evaluated
Next
End Sub
I show you two methods:
Using IE: The data is inside an iframe which needs to be negotiated
Using XMLHTTP request - much faster and without browser opening. It uses the first part of the iframe document URL which is what the iframe is navigating to.
In both cases I access the tables containing the company name and then the disclosure info table. For the disclosure main info table I copy the outerHTML to the clipboard and paste to Excel to avoid looping all the rows and columns. You can simply set loop the tr (table rows) and td (table cells) within instead.
IE:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, clipboard As Object
With IE
.Visible = True
.navigate "http://www.bursamalaysia.com/market/listed-companies/company-announcements/5923061"
While .Busy Or .readyState < 4: DoEvents: Wend
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
With .document.getElementById("bm_ann_detail_iframe").contentDocument
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1) = .getElementsByClassName("company_name")(0).innerText
clipboard.SetText .getElementsByTagName("table")(1).outerHTML
clipboard.PutInClipboard
End With
ThisWorkbook.Worksheets("Sheet1").Cells(2, 1).PasteSpecial
.Quit
End With
End Sub
XMLHTTP:
You can extract a different URL from the front-end of the iframe URL and use that as shown below.
Here is the section of your original HTML that shows the iframe and the associated new URL info:
Option Explicit
Public Sub GetTable()
Dim sResponse As String, html As HTMLDocument, clipboard As Object
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://disclosure.bursamalaysia.com/FileAccess/viewHtml?e=2891609", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
html.body.innerHTML = sResponse
With html
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1) = .querySelector(".company_name").innerText
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText .querySelector(".InputTable2").outerHTML
clipboard.PutInClipboard
End With
ThisWorkbook.Worksheets("Sheet1").Cells(2, 1).PasteSpecial
End Sub
Try it this way.
Sub Web_Table_Option_Two()
Dim HTMLDoc As New HTMLDocument
Dim objTable As Object
Dim lRow As Long
Dim lngTable As Long
Dim lngRow As Long
Dim lngCol As Long
Dim ActRw As Long
Dim objIE As InternetExplorer
Set objIE = New InternetExplorer
objIE.Navigate "http://www.bursamalaysia.com/market/listed-companies/company-announcements/5923061"
Do Until objIE.ReadyState = 4 And Not objIE.Busy
DoEvents
Loop
Application.Wait (Now + TimeValue("0:00:03")) 'wait for java script to load
HTMLDoc.body.innerHTML = objIE.Document.body.innerHTML
With HTMLDoc.body
Set objTable = .getElementsByTagName("table")
For lngTable = 0 To objTable.Length - 1
For lngRow = 0 To objTable(lngTable).Rows.Length - 1
For lngCol = 0 To objTable(lngTable).Rows(lngRow).Cells.Length - 1
ThisWorkbook.Sheets("Sheet1").Cells(ActRw + lngRow + 1, lngCol + 1) = objTable(lngTable).Rows(lngRow).Cells(lngCol).innerText
Next lngCol
Next lngRow
ActRw = ActRw + objTable(lngTable).Rows.Length + 1
Next lngTable
End With
objIE.Quit
End Sub

VBA reads HTML from the old page after clicking submit button

I am not a programmer but I have managed to learn just a few things in VBA but now on a certain website I face a problem that does not exist on some other.
What should happen is that a page form should be completed with data, submit button clicked and then I want to get some data from the result page.
The first phase works fine but it seems that no matter what I do the VBA still reads data from the page before submit was clicked.
The code is:
Sub VIES2()
'Uruchomienie Internet Explorera i wstrzymanie dalszej akcji aż uzyska stan gotowości
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
Do While IE.ReadyState <> 4: DoEvents: Loop
'Wypełnienie formularza odpowiednimi wartościami i kliknięcie przycisku sprawdzenia
IE.document.getElementbyId("countryCombobox").Value = "IT"
IE.document.getElementbyId("number").Value = "01802840023"
IE.document.getElementbyId("requesterCountryCombobox").Value = "IT"
IE.document.getElementbyId("requesterNumber").Value = "01802840023"
IE.document.getElementbyId("submit").Click
'Test uzyskiwania opisu i identyfikatora zapytania
For t = 1 To 999999
Next t
Application.Wait Now + TimeValue("00:00:10")
Do While IE.ReadyState <> 4: DoEvents: Loop
For t = 1 To 999999
Next t
Application.Wait Now + TimeValue("00:00:10")
MsgBox IE.LocationURL
Set Text = IE.document.getElementsbyClassName("layout-content")
For Each Element In Text
MsgBox Element.innerText
Next
Set Test = IE.document.getElementsbyTagName("TABLE")
For Each Element In Test
MsgBox Element.innerText
Next
End Sub
I have tried putting break, various wait loops and Application.Wait as suggested in similar questions where it seems to have worked. Here, even after the page is long after fully loaded the code still reads the old page - at least pulling the URL and some data seems to point that it is the case.
UPDATE: I should also add that I have tried to make the macro refresh the page but it clears the input content. What is interesting that target URL is:
http://ec.europa.eu/taxation_customs/vies/vatResponse.html
If I change the initial page to this the browser instantly redirects to the original page with notification that initial data is needed. The macro then completes the data and clicks submit button. In this case IE.LocationURL indicates this URL:
http://ec.europa.eu/taxation_customs/vies/vatResponse.html
but according to the content I get with getElementsbyClassName still reads elements from the initial page:
http://ec.europa.eu/taxation_customs/vies/?locale=pl
This worked to print out the VAT response table
Note:
If on 32-bit remove the PtrSafe.
Code:
Option Explicit
Declare PtrSafe Sub sleep Lib "kernel32" Alias "Sleep" (ByVal dwmilliseconds As Long)
Public Sub VIES2()
Application.ScreenUpdating = False
Dim IE As Object
'Uruchomienie Internet Explorera i wstrzymanie dalszej akcji az uzyska stan gotowosci
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.navigate "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
Do While IE.ReadyState <> 4: DoEvents: Loop
'Wypelnienie formularza odpowiednimi wartosciami i klikniecie przycisku sprawdzenia
IE.document.getElementById("countryCombobox").Value = "IT"
IE.document.getElementById("number").Value = "01802840023"
IE.document.getElementById("requesterCountryCombobox").Value = "IT"
IE.document.getElementById("requesterNumber").Value = "01802840023"
IE.document.getElementById("submit").Click
sleep (5000) 'or increase to 10000
Dim tbl As Object
Set tbl = IE.document.getElementById("vatResponseFormTable")
Dim ws As Worksheet
Set ws = ActiveWorkbook.Worksheets.Add
ws.Name = "Results"
Dim rng As Range, currentRow As Object, currentColumn As Object, i As Long, outputRow As Long
outputRow = outputRow + 1
Set rng = ws.Range("B" & outputRow)
For Each currentRow In tbl.Rows
For Each currentColumn In currentRow.Cells
rng.Value = currentColumn.outerText
Set rng = rng.Offset(, 1)
i = i + 1
Next currentColumn
outputRow = outputRow + 1
Set rng = rng.Offset(1, -i)
i = 0
Next currentRow
Application.ScreenUpdating = True
End Sub
Output:
Although QHarr's solution is working in my end, I'm providing with another with no hardcoded delay within the script.
Using IE as your question was:
Sub Get_Data()
Dim HTML As HTMLDocument, post As Object, elems As Object
Dim elem As Object, r&, c&
With New InternetExplorer
.Visible = False
.navigate "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
With HTML
.getElementById("countryCombobox").Value = "IT"
.getElementById("number").Value = "01802840023"
.getElementById("requesterCountryCombobox").Value = "IT"
.getElementById("requesterNumber").Value = "01802840023"
.getElementById("submit").Click
Do: Set post = .getElementById("vatResponseFormTable"): DoEvents: Loop While post Is Nothing
For Each elems In post.Rows
For Each elem In elems.Cells
c = c + 1: Cells(r + 1, c) = elem.innerText
Next elem
c = 0: r = r + 1
Next elems
End With
.Quit
End With
End Sub
Reference to add to the library:
1. Microsoft Internet Controls
2. Microsoft HTML Object Library
Using xmlhttp request (It is way faster than IE):
Sub Get_Data()
Dim elems, elem As Object
Dim QueryString$, S$, r&, c&
QueryString = "memberStateCode=IT&number=01802840023&traderName=&traderStreet=&traderPostalCode=&traderCity=&requesterMemberStateCode=IT&requesterNumber=01802840023&action=check&check=Weryfikuj"
With New XMLHTTP
.Open "POST", "http://ec.europa.eu/taxation_customs/vies/vatResponse.html", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send QueryString
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
For Each elems In .getElementById("vatResponseFormTable").Rows
For Each elem In elems.Cells
c = c + 1: Cells(r + 1, c) = elem.innerText
Next elem
c = 0: r = r + 1
Next elems
End With
End Sub
Reference to add to the library:
1. Microsoft XML, V6
2. Microsoft HTML Object Library
Most of the time you should search if there isn't a REST/SOAP available to achieve that kind of task.
Using an Internet Explorer instance for this is a total overkill.
Try this simple function, that uses the SOAP service to validate VAT numbers:
Function IsVatValid(country_code, vat_number)
Dim objHTTP As Object
Dim xmlDoc As Object
Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")
sURL = "http://ec.europa.eu/taxation_customs/vies/services/checkVatService"
sEnv = "<s11:Envelope xmlns:s11='http://schemas.xmlsoap.org/soap/envelope/'>" & _
"<s11:Body>" & _
"<tns1:checkVat xmlns:tns1='urn:ec.europa.eu:taxud:vies:services:checkVat:types'>" & _
"<tns1:countryCode>" & country_code & "</tns1:countryCode>" & _
"<tns1:vatNumber>" & vat_number & "</tns1:vatNumber>" & _
"</tns1:checkVat>" & _
"</s11:Body>" & _
"</s11:Envelope>"
objHTTP.Open "Post", sURL, False
objHTTP.setRequestHeader "Content-Type", "text/xml"
objHTTP.setRequestHeader "SOAPAction", "checkVatService"
objHTTP.send (sEnv)
objHTTP.waitForResponse
Set xmlDoc = CreateObject("HTMLFile")
xmlDoc.body.innerHTML = objHTTP.responsetext
IsVatValid = CBool(xmlDoc.getElementsByTagName("valid")(0).innerHTML)
Set xmlDoc = Nothing
Set objHTTP = Nothing
End Function
And then you can simply validate all your vat numbers:
Debug.Print IsVatValid("IT", "01802840023")
>>> True

Fetching data from web page

I am trying to fetch the publication date corresponding to each patent number.
Here is the Excel sheet:
The database is espacenet.com
Here's the link for the first patent you see in the Excel sheet:
http://worldwide.espacenet.com/searchResults?compact=false&PN=US7055777B2&ST=advanced&locale=en_EP&DB=EPODOC
Under the "Publication Info" header, I need to get the date after matching the patent number with the one in the Excel sheet.
Here's the code:
Sub tryextraction()
Dim ie As New InternetExplorer
Dim sdd As String
Dim tdd() As String
Dim num0 As Integer
Dim num1 As Integer
Dim doc As HTMLDocument
Dim i As Integer
Dim j As Integer
ie.Visible = True
num1 = ActiveSheet.UsedRange.Rows.Count
For num0 = 2 To num1
ie.navigate "http://worldwide.espacenet.com/searchResults?compact=false&PN=" & Range("A" & num0) & "&ST=advanced&locale=en_EP&DB=EPODOC"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.document
sdd = Trim(doc.getElementsByTagName("td")(5).innerText)
tdd() = Split(sdd, vbLf)
j = UBound(tdd)
For i = 0 To j
If InStr(tdd(i), "(") <> 0 Then
tdd(i) = Replace(tdd(i), " ", "")
tdd(i) = Replace(tdd(i), "(", "")
tdd(i) = Replace(tdd(i), ")", "")
If tdd(i) = Range("A" & num0).Value Then
Range("B" & num0).Value = tdd(i + 1)
End If
End If
Next i
Next num0
ie.Quit
End Sub
The code is not giving any error. The column "Publication Date" remains blank after the code finishes running.
The html tag which contains the publication info has been taken correctly.
There are some trailing white space characters after the ID you are searching for in the document so tdd(i) = Range("A" & num0).Value never evaluates to true. It's not just a space, so a simple Trim(tdd(i)) = Range("A" & num0).Value call does not help. Try instead InStr(tdd(i), Range("A" & num0).Value) If that is not good enough, you'll have to specifically remove CRLF from the end of the string before doing the compare.
There are often multiple publication dates under the publication info header.
Example:
The following script obtains all of these and the preceeding line (so you have the associated publication along with date).
It loops from row 2 of the Activesheet, to the last populated row, picking up the Publication Numbers from column A and writing out the results starting from column B. Depending on how many dates there are, the data will extend across multiple columns from B.
Regex:
A regex of ^(.*)\s\d{4}-\d{2}-\d{2} is used to retrieve the date pattern and the preceeding line i.e. The publication identifier and the date. Try it
Example output:
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, html As New HTMLDocument, url As String, pubInfo As Object
Dim loopRange As Range, iRow As Range, counter As Long
'example US7055777B2
Application.ScreenUpdating = False
With ActiveSheet
Set loopRange = Range("A2:A" & .Cells(.Rows.Count, "A").End(xlUp).Row)
End With
With IE
.Visible = True
counter = 2 '<== start from row 2
For Each iRow In loopRange
If Not IsEmpty(iRow) Then
url = "https://worldwide.espacenet.com/searchResults?compact=false&PN=" & iRow.Value & "&ST=advanced&locale=en_EP&DB=EPODOC"
.navigate url
While .Busy Or .readyState < 4: DoEvents: Wend
Set html = .document
Do
DoEvents
On Error Resume Next
Set pubInfo = html.querySelector(".publicationInfoColumn")
On Error GoTo 0
Loop While pubInfo Is Nothing
Dim tempArr()
tempArr = GetDateAndPatent(pubInfo.innerText, "^(.*)\s\d{4}-\d{2}-\d{2}") '"(?m)^(.*)\s\d{4}-\d{2}-\d{2}" '<==This is not supported
With ActiveSheet
.Cells(counter, 2).Resize(1, UBound(tempArr) + 1) = tempArr
End With
End If
counter = counter + 1
Next iRow
.Quit '<== Remember to quit application
End With
Application.ScreenUpdating = True
End Sub
Public Function GetDateAndPatent(ByVal inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
For Each iMatch In matches
ReDim Preserve arrMatches(i)
arrMatches(i) = iMatch.Value
i = i + 1
Next iMatch
End If
End With
GetDateAndPatent = arrMatches
End Function