I am trying to pull cash flows for a bunch of companies in a folder I have created. I am pulling the information from market watch. An example of the websites I am pulling tables from is https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow. All of the Ticker symbols for each company is in column A. My code is breaking on the following line with an error of "Runtime Error "91".
Set tRow = hTable.getElementsByTagName("tr")
I know that there are trs in the HTML code. Also, I ran the code for a couple of the companies) and then when I went to do it again the code never got past the first one (I didn't have the save and close functions in there the first time because I was testing it so I exited out of each of the workbooks I did and I didn't save them).
Public Sub Companies()
Dim sResponse As String, html As HTMLDocument, hTable As Object
Application.ScreenUpdating = False
Dim Last As Long
Dim i As Integer
Dim ws As Worksheet
Last = Cells(Rows.Count, "A").End(xlUp).Row
For i = Last To 572 Step -1
M = 0
Workbooks.Open "C:***\Desktop\Stock Portfolio\Stock Valuations\Temporary Valuations\" & Cells(i, "A").Value & ".xlsx"
ThisWorkbook.Activate
Set ws = Workbooks(Cells(i, "A").Value).Sheets.Add(After:= _
Workbooks(Cells(i, "A").Value).Sheets(Workbooks(Cells(i, "A").Value).Sheets.Count))
ws.Name = "Cash Flow"
ThisWorkbook.Activate
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.marketwatch.com/investing/stock/" & Cells(i, "A").Value & "/financials/cash-flow", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
ThisWorkbook.Activate
With html
.body.innerHTML = sResponse
Set hTable = .getElementsByTagName("tbody")(0)
WriteTable hTable, 1, Workbooks(Cells(i, "A").Value).Sheets("Cash Flow")
End With
ThisWorkbook.Activate
M = 3
With html
.body.innerHTML = sResponse
Set hTable = .getElementsByTagName("tbody")(1)
WriteTable hTable, 1, Workbooks(Cells(i, "A").Value).Sheets("Cash Flow")
End With
Workbooks(Cells(i, "A")).Save
Workbooks(Cells(i, "A")).Close
Next
End Sub
I used the code above and then I used the public code below (where the problem occurs) to get the table.
Public Sub WriteTable(ByVal hTable As Object, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
If ws Is Nothing Then Set ws = ActiveSheet
Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
startRow = (M * 20) + 1
r = startRow
With ws
Set tRow = hTable.getElementsByTagName("tr")
For Each tr In tRow
r = r + 1: c = 1
Set tCell = tr.getElementsByTagName("td")
For Each td In tCell
.Cells(r, c).Value = td.innerText
c = c + 1
Next td
Next tr
End With
End Sub
Not the ideal answer but always inspect the response you are getting. Furthermmore, check whether hTable is nothing. If I inspect the response I notice the site is on the look out for bots and blocking with a captcha.
Pardon Our Interruption...
As you were browsing www.marketwatch.com something about your browser made us think you were a bot. There are a few reasons this
might happen:
You're a power user moving through this website with super-human
speed. You've disabled JavaScript in your web browser. A third-party
browser plugin, such as Ghostery or NoScript, is preventing JavaScript
from running. Additional information is available in this support
article.
After completing the CAPTCHA below, you will immediately regain access
to www.marketwatch.com.
If this is indeed the case for you, you have a few options:
1) Search for an alternative source of the info
2) Use browser automation (selenium basic) and hope that this alone, or with some appropriate waits may get you there
3) Change IP and user agent. If you were able to originally run XHR against this page then it may be that now you have been added to a watch list for suspected bots by the site. Alternating IP and user-agent is not something I would do.
Related
I passing EAN numbers of the certain movies and extracting movie name and ASIN number in Amazon.
"https://www.amazon.de/s?k=7321925005738&__mk_de_DE=ÅMÅŽÕÑ&ref=nb_sb_noss"
But in Amazon website,i face problem like some time the search results contains Sponsored product result also (It may or may not come),How ever i want to extract other than Sponsored products.
So when ever i debug.print the Amazon ASIN number and Movie name,it prints all the ASIN number and movie name (Which includes sponsored product).
For identifying the sponsored product,the way i am using is data-component-type="sp-sponsored-result"
in the response text,
where as actual product does not contain this id at "data-component-type" all,so i am not able to separate the actual movie name (Other than sponsored result)
I tried if not xxxx then ,still my code prints all here i am attaching my code
Here is my code
Sub Amazon_Pull()
Dim Link_2 As String
Link_2 = "https://www.amazon.de/s?k=7321925005738&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
Dim xhr As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
'Debug.Print html.body.innerHTMLDebug.Print html.getElementsByTagName("div").getAttribute("data-index").Length
Dim hTable As Object
Dim hba As Object
Set hTable = html.getElementsByTagName("div")
For Each hba In hTable
If Left(hba.getAttribute("data-asin"), 1) = "B" Then
If hba.getElementsByTagName("div")(2).getAttribute("data-component-type") <> "sp-sponsored-result" Then
Debug.Print hba.getAttribute("data-asin")
End If
End If
Next hba
Set xhr = Nothing
Set html = Nothing
'-------------
End Sub
Use a css attribute = value selector to restrict to the appropriate nodes
Dim nodeList As Object, i As Long
Set nodelist = hba.querySelectorall("[data-asin]")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).getAttribute("data-asin")
Next
You can remove your conditional statements and add all the conditional logic into the css selector with starts with ^ operator for the character B
Dim nodeList As Object, i As Long
Set nodelist = hba.querySelectorall("[data-asin^=B]")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).getAttribute("data-asin")
Next
You can do it quick and dirty like this. But if the word "Gesponsert" is part of film title, it fails ;-)
Sub Amazon_Pull()
Dim Link_2 As String
Dim xhr As MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument
Dim hTable As Object
Dim hba As Object
Dim i As Long
Link_2 = "https://www.amazon.de/s?k=7321925005738"
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.querySelectorAll("div[data-index]")
For i = 0 To hTable.Length - 1
If InStr(1, hTable(i).innerText, "Gesponsert") = 0 Then
Debug.Print hTable(i).getAttribute("data-asin") & " " & hTable(i).getElementsByTagName("h2")(0).innerText
End If
Next i
Set xhr = Nothing
Set html = Nothing
End Sub
In my opinion it's always better to use the code structure of a page and not a part of the content. I know, that's not always possible and it's often more complex.
To check if an offer on amazon is sponsored, you can use the structure of the page code like this. One advantage is that it also works on the international Amazon platforms, regardless of the national language.
(Not tested, because Amazon blocked me as a bot.)
Sub Amazon_Pull()
Dim Link_2 As String
Dim xhr As MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument
Dim hTable As Object
Dim hba As Object
Dim i As Long
Dim check As Long
Dim sponsored As Boolean
Dim checkSponsored As Object
Link_2 = "https://www.amazon.de/s?k=7321925005738"
'Link_2 = "https://www.amazon.de/s?k=apple"
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.querySelectorAll("div[data-index]")
For i = 0 To hTable.Length - 1
sponsored = False
Set checkSponsored = hTable(i).querySelectorAll("div[data-component-type]")
For check = 0 To checkSponsored.Length - 1
If checkSponsored.getAttribute("data-component-type") = "sp-sponsored-result" Then
sponsored = True
End If
Next check
If Not sponsored Then
Debug.Print hTable(i).getAttribute("data-asin") & " " & hTable(i).getElementsByTagName("h2")(0).innerText
End If
Next i
Set xhr = Nothing
Set html = Nothing
End Sub
Public Sub D_Galoplar()
Application.ScreenUpdating = False
Dim Asay(1 To 250)
Dim Jsay(1 To 100)
For q = 2 To Sheets("Y").Columns("A:A").Find(What:="boş").Row - 1
Asay(q - 1) = Sheets("Y").Range("A" & q)
Next q
For q = 2 To Sheets("Y").Columns("C:C").Find(What:="boş").Row - 1
Jsay(q - 1) = Sheets("Y").Range("C" & q)
Next q
For w = 1 To 250
Cells.Delete Shift:=xlUp
Range("A1").Select
If Asay(w) < 1 Then Exit For
Dim elem As Object, trow As Object
Dim R&, C&, s$
With New XMLHTTP60
.Open "POST", "https://yenibeygir.com/at/getatdetaytab", False
.setRequestHeader "content-type", "application/x-www-form-urlencoded; charset=UTF-8"
.send "tab=galopTab&id=" & Asay(w)
s = .responseText
End With
With New HTMLDocument
.body.innerHTML = s
For Each elem In .getElementsByClassName("at_Galoplar")(0).Rows
For Each trow In elem.Cells
C = C + 1: Cells(R + 1, C) = trow.innerText
Next trow
C = 0: R = R + 1
Next elem
End With
Cells.UnMerge
Range("A1").Select
If Range("A1048576").End(xlUp).Row < 2 Then GoTo ATLA2
Columns("A:A").Insert
For i = 2 To Range("B1048576").End(xlUp).Row - 1
Range("A" & i) = Asay(w)
Next i
Range("O2").FormulaR1C1 = "=IF(ISBLANK(RC[-9]),""-"",IF(ISNUMBER(RC[-9]),RC[-9]/4,((LEFT(RC[-9],1)*6000)+(MID(RC[-9],3,2)*100)+(RIGHT(RC[-9],1)*10))/400))"
Range("P2").FormulaR1C1 = "=IF(ISBLANK(RC[-9]),""-"",IF(ISNUMBER(RC[-9]),RC[-9]/6,((LEFT(RC[-9],1)*6000)+(MID(RC[-9],3,2)*100)+(RIGHT(RC[-9],1)*10))/600))"
Range("Q2").FormulaR1C1 = "=IF(ISBLANK(RC[-9]),""-"",IF(ISNUMBER(RC[-9]),RC[-9]/8,((LEFT(RC[-9],1)*6000)+(MID(RC[-9],3,2)*100)+(RIGHT(RC[-9],1)*10))/800))"
Range("R2").FormulaR1C1 = "=IF(ISBLANK(RC[-9]),""-"",IF(ISNUMBER(RC[-9]),RC[-9]/10,((LEFT(RC[-9],1)*6000)+(MID(RC[-9],3,2)*100)+(RIGHT(RC[-9],1)*10))/1000))"
Range("S2").FormulaR1C1 = "=IF(ISBLANK(RC[-9]),""-"",IF(ISNUMBER(RC[-9]),RC[-9]/12,((LEFT(RC[-9],1)*6000)+(MID(RC[-9],3,2)*100)+(RIGHT(RC[-9],1)*10))/1200))"
Range("T2").FormulaR1C1 = "=IF(ISBLANK(RC[-9]),""-"",IF(ISNUMBER(RC[-9]),RC[-9]/14,((LEFT(RC[-9],1)*6000)+(MID(RC[-9],3,2)*100)+(RIGHT(RC[-9],1)*10))/1400))"
Range("O2:T2").Copy
Range("O2:O" & Range("A1048576").End(xlUp).Row).PasteSpecial Paste:=xlPasteFormulas
Cells.Copy
Cells.PasteSpecial Paste:=xlPasteValues
Columns("O:T").Cut Columns("F:K")
Range("A2:N" & Range("A1048576").End(xlUp).Row).Copy
Sheets("Galop").Range("A" & Sheets("Galop").Range("A1048576").End(xlUp).Row + 1).PasteSpecial Paste:=xlPasteValues
ATLA2:
Cells.Delete Shift:=xlUp
Next w
End Sub
I want to get a lot of data with the For Next cycle, but after a while the page hangs. How can I reset objects at the end of each cycle?
Asay numbers
10182
10221
10279
10303
10316
10325
10360
10370
10680
11598
11629
11715
11745
12335
12385
12533
12559
13154
13393
13635
13641
13669
13673
14027
14057
14062
14228
14619
14674
14687
14743
14770
14778
15197
15217
15323
15382
15507
15775
15828
16077
16335
16510
17149
17513
17867
18532
37964
60176
66067
66255
66581
66582
66896
66998
67056
67309
67356
67379
67473
68008
68012
68162
68298
68312
68320
68332
68333
68353
68383
68545
68702
68775
68922
69445
69606
69817
69963
69968
69985
69986
70048
70202
71372
(boş)
Slowing down maybe due to throttling of network if you are trying to hit the site too many times in quick succession. This is particularly likely given your access method. Better would be to see if an API is available to bulk access info. You are likely going through many networks to get to this page as well. It may be possible to get some basic info about delays from TRACERT command from a command prompt.
You are doing a POST so remember there is a fair amount of server side stuff going on as well.
You don't need to set elem to Nothing as it only exists during your For Loop. Same for tRow.
Putting .getElementsByClassName("at_Galoplar")(0).Rows into a variable will provided faster referencing.
Write the results to an array first and then dump the array out to the sheet in one go will provide significant improvement in speed.
Using New keyword can lead to unexpected behaviour. You can create one instance of HTMLDocument and work with that provided you have good error handling in. I have had occassional cases in a loop where I have had to set HTMLDocument to Nothing before looping back round.
Personally, I would cheat and re-write this to leverage that you can issue GET requests to get the same info. I use a class to hold the XMLHTTP object, and an array to hold the results. I write the results out in one go. This takes a few seconds to run for me. The asay numbers are in Sheet1 range A1:A84.
Class module clsHTTP
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Standard module 1
Option Explicit
Public Sub DGaloplar()
Dim asays(), ws As Worksheet, asay As Long, html As HTMLDocument
Dim http As clsHTTP, url As String, headers(), numberOfRequests As Long
headers = Array("Asay", "Tarih", "Sehir", "Kg", "Jokey", "400", "600", "800", "1000", "1200", "1400", "Ç", "Pist", "Durum")
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
asays = Application.Transpose(ws.Range("A1:A84").Value) 'Load asay values from sheet 1
Const numTableRows As Long = 11
Const numTableColumns As Long = 15
Const BASE_URL As String = "https://yenibeygir.com/at/getatdetaytab/?tab=galopTab&id="
numberOfRequests = UBound(asays)
Dim results(), headerRow As Boolean, tRows As Object, tRow As Object, iRow As Long
Dim tCells As Object, tCell As Object, r As Long, c As Long, hTable As HTMLTable
ReDim results(1 To numTableRows * numberOfRequests, 1 To numTableColumns)
Application.ScreenUpdating = False
For asay = 1 To numberOfRequests
headerRow = True
url = BASE_URL & asays(asay)
html.body.innerHTML = http.GetString(url)
Set hTable = html.querySelector(".at_Galoplar")
Set tRows = hTable.getElementsByTagName("tr")
For Each tRow In tRows
If Not headerRow Then
c = 2: r = r + 1
results(r, 1) = asays(asay)
Set tCells = tRow.getElementsByTagName("td")
For Each tCell In tCells
results(r, c) = tCell.innerText
c = c + 1
Next
End If
headerRow = False
Next
Next
With ws
.Cells(1, 3).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 3).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
Application.ScreenUpdating = True
End Sub
Refereces:
Microsoft HTML Object Library
In general Set elem = Nothing is what you need.
In your code you are assigning the variables within a for-each loop, thus even if you set them to Nothing later, there would not be a performance bonus.
Try setting those object to Nothing, like below:
Set elem = Nothing
Set trow = Nothing
I am not sure whether you need variable declarations in your loop, you can take them out of the loop, this might save some time.
But I think your HTTP requests are taking so long, not any VBA code.
UPDATE
Try setting Application.EnableEvents and Application.ScreenUpdating to False at the beggining of macro and setting them back to True at the end.
I have the following error, the data is updated in the webpage, but it is not in Excel. I use Application.OnTime to refresh the webpage.
Below are the code
Sub RefreshAction()
Dim htm As Object
Dim Tr As Object
Dim Td As Object
Dim Tab1 As Object
Dim URL As String
Dim Colstart As Long
Dim HTML As Variant
Dim i As Long
Dim j As Long
Dim n As Long
Dim ss As Long
Application.ScreenUpdating = False
URL = "http://register.um.edu.my/kok_kosong_bi.asp"
Set HTML = CreateObject("htmlfile") 'Create HTMLFile Object
With CreateObject("msxml2.xmlhttp") 'Get the WebPage Content
.Open "GET", URL, False
.send
HTML.Body.Innerhtml = .responseText
End With
Colstart = 1
j = 1
i = Colstart
n = 0
'Loop Through website tables
For Each Tab1 In HTML.getElementsByTagName("table")
With HTML.getElementsByTagName("table")(n)
For Each Tr In .Rows
For Each Td In Tr.Cells
Sheet1.Cells(j, i) = Td.innerText
i = i + 1
Next Td
i = Colstart
j = j + 1
Next Tr
End With
n = n + 1
i = Colstart
j = j + 1
Next Tab1
Application.ScreenUpdating = True
Application.EnableEvents = True
Debug.Print Now() + TimeValue("00:00:05")
Application.OnTime Now() + TimeValue("00:00:05"), "RefreshAction", Schedule = True
End Sub
The snapshot
As per the snapshot, the website has 7 row, but excel only capture 5 row only. I have tried every possible way, still can't find the reason. I am expecting to clear the web cache, but I could not find the reference to do so.
The numbers change on the website. It was 6 rows when I first looked, then 5, and then later 6 again.
Your code is fine, but you need Schedule:=True rather than Schedule = True (typo?), and do you really need to loop all tables? You could also Dim HTML As Object.
I think the website is rather sketchy, to be honest, if exhibiting this type of inconsistency in results.
An easy way to get all rows, at any given time, is to simply copy paste the entire table as shown below. You can link that in with your refresh code.
Option Explicit
Public Sub GetTable()
Dim sResponse As String, html As New HTMLDocument, clipboard As Object
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://register.um.edu.my/kok_kosong_bi.asp", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
html.body.innerHTML = sResponse
With ThisWorkbook.Worksheets("Sheet1")
.Cells.ClearContents
.Cells.ClearFormats
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText html.getElementsByTagName("table")(3).outerHTML
clipboard.PutInClipboard
.Cells(1, 1).PasteSpecial
End With
End Sub
I have written code that grabs a table from a website and pulls out each cell from that table and then drops them into an excel spreadsheet. The code works flawlessly when the website loads correctly.
The issue is the website does not play nice with internet explorer, therefore the code only executes successfully about half of the time. I could write a routine that checks to see if the website loaded successfully and repeat if it did not, However I want to see if I can get it to work with a WinHTTPRequest.
The lines below are how I access the table using internet explorer based webscraping, with the last line being how i load the table into a variable.
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "https://weather.com/weather/tenday/l/12345:4:US"
IE.Visible = True
Application.Wait (Now() + TimeValue("00:02:00"))
Set doc = IE.document
Set WeatherTable = doc.getElementsByClassName("twc-table")(0)
I can load the website in question via WinHTTPRequest using the code below.
Set doc = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://weather.com/weather/tenday/l/12345:4:US", False
.send
doc.body.innerHTML = .responseText
End With
However when I try and grab the table using the line below i get "Run-time error '438': Object doesn't support this property or method.
Set WeatherTable = doc.getElementByclassname("twc-table")(0)
Basically i need the equivalent of this line for WinHTTP webscraping.
I have looked at descending down through the html document(doc.body.all.item(1), etc) but I don't get very far before I run into errors. I have also looked at the Selenium addon, but I don't remember being able to download and install it successfully, and I am not sure if it is even still maintained for current versions of chrome / firefox.
Here is the full code that allows me to get the table via internet explorer webscraping and then drop it onto an excel spreadsheet.
Any help is appreciated.
Sub GetTable2()
Dim IE As Object
Dim doc As HTMLDocument
Dim WeatherTable As HTMLTable
Dim WeatherTableRows As HTMLTableRow
Dim HTMLTableCell As HTMLTableCell
Dim HeaderRow As Boolean
Dim RowCount As Long
Dim ColumnCount As Long
Dim i As Long
RowCount = 1
ColumnCount = 1
HeaderRow = True
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "https://weather.com/weather/tenday/l/12345:4:US"
IE.Visible = True
'Application.Wait (Now() + TimeValue("00:02:00"))
Set doc = IE.document
Set WeatherTable = doc.getElementsByClassName("twc-table")(0)
For Each WeatherTableRows In WeatherTable.Rows
i = 1
For Each HTMLTableCell In WeatherTableRows.Cells
If HeaderRow = True Then
ThisWorkbook.Sheets("Sheet5").Cells(RowCount, ColumnCount).Value = HTMLTableCell.innerText
ColumnCount = ColumnCount + 1
Else
If i = 1 Then
i = i + 1
Else
ThisWorkbook.Sheets("Sheet5").Cells(RowCount, ColumnCount).Value = HTMLTableCell.innerText
ColumnCount = ColumnCount + 1
End If
End If
Next HTMLTableCell
HeaderRow = False
ColumnCount = 1
RowCount = RowCount + 1
Next WeatherTableRows
IE.Quit
Set IE = Nothing
Set doc = Nothing
End Sub
You missed an s. It is plural as you are getting a collection of elements by className.
Set WeatherTable = doc.getElementsByClassName("twc-table")(0)
To make your approach slightly cleaner, you can try this way as well.
Sub FetchTabularData()
Dim elem As Object, trow As Object, S$, R&, C&
[B1:G1] = [{"Day","Description","High/Low","Precip","Wind","Humidity"}]
With New WinHttp.WinHttpRequest
.Open "GET", "https://weather.com/weather/tenday/l/12345:4:US", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
For Each elem In .querySelector(".twc-table").getElementsByTagName("tr")
For Each trow In elem.getElementsByTagName("td")
C = C + 1: Cells(R + 1, C) = trow.innerText
Next trow
C = 0: R = R + 1
Next elem
End With
End Sub
Reference to add:
Microsoft HTML Object Library
Microsoft WinHTTP Services, version 5.1
I'm trying to pull some info from a website that provides oil well data by API number (API is a unique number for every well in the US)
Website: http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1
API example: 1708300502
The issue is, when I get to the 2nd page, IE.document.getElementsByTagName("body")(0).innerText still returns data from the initial page. How do I fetch the updated page data?
The ultimate goal is to get to the 2nd page, click on "30570" via IE.document.getElementsByTagName("a")(0).Click and then read the final 3rd page. I just cannot figure out how to read the updated page :(
Option Explicit
Sub sonris_WellData()
Dim IE As InternetExplorer
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
Dim i As Integer
'Open SONRIS website
Application.StatusBar = "Opening Website"
IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Do While IE.readyState <> 4: DoEvents: Loop
Application.Wait Now() + TimeValue("00:00:01")
Application.StatusBar = False
IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
IE.document.forms(0).submit
' Wait until the next page opens
Application.StatusBar = "Opening Website"
Do While IE.readyState <> 4: DoEvents: Loop
Application.Wait Now() + TimeValue("00:00:01")
Application.StatusBar = False
' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
MsgBox IE.document.getElementsByTagName("body")(0).innerText
IE.Quit
End Sub
This seems to be working. Rather than DoEvents use the WinAPI Sleep function. I also added a call to the Sleep function after the form submit.
MOre often we are seeing sites that are dynamically served by some javascript/etc., in these cases the browser may appear to be READYSTATE_COMPLETE or not Busy but the page has not yet rendered the "new" results.
Option Explicit
Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub sonris_WellData()
Dim IE As Object 'InternetExplorer
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
Dim i As Integer
'Open SONRIS website
Application.StatusBar = "Opening Website"
IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Do While IE.readyState <> 4
Sleep 1000
Loop
Application.StatusBar = False
IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
IE.document.forms(0).submit
Sleep 1000
' Wait until the next page opens
Application.StatusBar = "Opening Website"
Do While IE.readyState <> 4
Sleep 1000
Loop
Application.StatusBar = False
' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
MsgBox IE.document.getElementsByTagName("body")(0).innerText
IE.Quit
End Sub
You can experiment maybe with a slightly longer Sleep after the .submit.
Alternatively, I notice that after you submit, the URL changes, so you could also try changing the second waiting loop to:
Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Sleep 1000
Loop
This should put the Excel.Application to wait until the URL has changed.
Alternatively, you may have better luck using an XMLHTTPRequest (there are many examples of this here on SO and elsewhere on the internet). This allows you to send a request just like the browser, without actually using a web browser. Then you can simply parse the return text as HTML or XML. I would use the Microsoft XML, v6.0 library reference for this.
POST requests:
① Entering the Well API number
I examined the web page making the selections you mention. I inspected the web traffic using fiddler and noticed that the initial request, when you submit the API number is handled by a POST request.
② POST request:
The POST body has the following parameter:
p_apinum is the key and the associated value is the original Well API number.
Using this info I formulated a POST request direct thus avoiding your first landing page.
③ Pressing the hyperlink:
Next, I noticed that the element you wanted to press:
Looking at the associated HTML it has an associated relative hyperlink:
I use a helper function to parse the page HTML to get this relative link and construct the absolute path: GetNextURL(page.body.innerHTML).
④ Making a new request:
I re-use my HTTPRequest function GetPage to send a second request, with an empty body, and grab all the tables from the HTML document returned via: page.getElementsByTagName("table").
⑤ Writing the tables to the Excel worksheet:
I loop all the tables on the page using helper function AddHeaders to write out the table headers, and WriteTables to write the current table to the sheet.
Example page content:
Example code output:
VBA:
Option Explicit
Public Sub GetWellInfo()
Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
Const PARAM1 As String = "p_apinum"
Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
apiNumbers = Array(1708300502, 1708300503)
Application.ScreenUpdating = False
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ws
.Cells.ClearContents
For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
Dim allTables As Object
Set allTables = page.getElementsByTagName("table")
For Each targetTable In allTables
AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
WriteTables targetTable, GetLastRow(ws, 1), ws
Next targetTable
Next currNumber
End With
Application.ScreenUpdating = True
End Sub
Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
Dim objHTTP As Object, html As New HTMLDocument
Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")
Dim sBody As String
If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
With objHTTP
.SetTimeouts 10000, 10000, 10000, 10000
.Open "POST", url, False
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
On Error Resume Next
.send (sBody)
If Err.Number = 0 Then
If .Status = "200" Then
html.body.innerHTML = .responseText
Set GetPage = html
Else
Debug.Print "HTTP " & .Status & " " & .statusText
Exit Function
End If
Else
Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
Exit Function
End If
On Error GoTo 0
End With
End Function
Public Function GetNextURL(ByVal inputString As String)
GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
End Function
Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
ws.Cells(startRow, columnCounter) = header.innerText
Next header
End Sub
Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
If ws Is Nothing Then Set ws = ActiveSheet
Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
r = startRow
With ActiveSheet
Set tRow = hTable.getElementsByTagName("tr")
For Each tr In tRow
Set tCell = tr.getElementsByTagName("td")
For Each td In tCell
.Cells(r, c).Value = td.innerText
c = c + 1
Next td
r = r + 1: c = 1
Next tr
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function
References:
VBE > Tools > References > HTML Object Library.