XMLHTTP60 Req not showing entire HTML Document - html

I am trying to obtain the HTML document from a website to what else, scrape for data!
Unfortunately, I cannot obtain the entire HTML Document associated with the web page. My debug.print statement doesn't show the entire web page as I would like, it gets cut off. I'm somewhat new to programming, help would be greatly appreciated!
My code is below:
Const SecForm4 As String = "https://www.secform4.com/significant-buys.htm"
Sub LoadWebPage()
Dim XMLReq As New MSXML2.XMLHTTP60
XMLReq.Open "GET", SecForm4, False
XMLReq.send
If XMLReq.Status <> 200 Or XMLReq.readyState <> 4 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & "-" & XMLReq.statusText
Exit Sub
End If
ParsingHTMLDocument XMLReq.responseText
End Sub
Sub ParsingHTMLDocument(HTMLText As String)
Dim HTMLDoc As New MSHTML.HTMLDocument
HTMLDoc.body.innerHTML = HTMLText
Debug.Print HTMLText
End Sub

The following works in terms of grabbing the document and the table is present. You are unlikely to be able to print the entire document to the immediate window as it has limitations on capacity. Instead you could write to a text file and inspect.
Change the filepath ,"C:\Users\User\Desktop\Test.txt", to one for you.
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument, hTable As HTMLTable
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.secform4.com/significant-buys.htm", False
.Send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
WriteTxtFile sResponse
With html
.body.innerHTML = sResponse
Set hTable = .getElementById("filing_table")
MsgBox hTable.localName
End With
End Sub
Public Sub WriteTxtFile(ByVal aString As String, Optional ByVal filePath As String = "C:\Users\User\Desktop\Test.txt")
Dim fso As Object, Fileout As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Set Fileout = fso.CreateTextFile(filePath, True, True)
Fileout.Write aString
Fileout.Close
End Sub
Reference to HTML Object Library required.

Related

Extract data from a site that does query with Java script

I want to get data from an Internet site.
I get
run-time error '91'
I modified earlier codes.
Sub DENEME()
Dim S As String
Dim html As HTMLDocument
Dim hTable As HTMLTable
Dim clipboard As Object
Set html = New HTMLDocument
With New XMLHTTP60
.Open "GET", "https://www.scorespro.com/basketball/results/date/2019-02-15", False
.setRequestHeader "content-type", "application/x-www-form-urlencoded; charset=UTF-8"
.send
S = .responseText
End With
html.body.innerHTML = S
Set hTable = html.querySelector(".matches-data")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
Range("A1").PasteSpecial
End Sub
A few points to note.
Yes it is id not class for that element but that element is a div and copy pasting via clipboard the outer html will only paste the html of the div - not all tables within.
What you want is a list of tables so you need to change your selector to get the tables within that div
As using clipboard and there are merged output cells you need to find the last used row independent of column and add 1 to write out next table to a few row.
If unsure about your selector use the search bar in browser as shown [here].1
VBA:
Option Explicit
Public Sub Deneme()
Dim s As String, ws As Worksheet, tables As Object, i As Long
Dim html As HTMLDocument, clipboard As Object
Set html = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
Application.ScreenUpdating = False
With New XMLHTTP60
.Open "GET", "https://www.scorespro.com/basketball/results/date/2019-02-15", False
.setRequestHeader "content-type", "application/x-www-form-urlencoded; charset=UTF-8"
.send
s = .responseText
End With
html.body.innerHTML = s
Set tables = html.querySelectorAll("#matches-data table")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
For i = 0 To tables.Length - 1
clipboard.SetText tables.item(i).outerHTML
clipboard.PutInClipboard
ws.Range("A" & GetLastRow(ws) + 1).PasteSpecial
Next
Application.ScreenUpdating = True
End Sub
Public Function GetLastRow(ByVal sh As Worksheet) As Long
On Error Resume Next
GetLastRow = sh.Cells.Find(What:="*", _
After:=sh.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row
On Error GoTo 0
End Function
Your selector string is set up so that it looks for the first element whose class is matches-data. However, matches-data is the id. As a result, hTable is being assigned Nothing, hence the error. Instead, try the following...
Set hTable = html.querySelector("#matches-data")

VBA: Scraping exact elements from HTMLTable

Please can you help me to understand how to find tags a in Table with class name bptable?
I receive Object does not support this method and I don't know how to solve this problem.
Sub ListVideosOnPage(VidCatName As String, VidCatURL As String)
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim VidRow As MSHTML.IHTMLElement
Dim VidInnerRow As MSHTML.IHTMLElement
Dim VidRows As MSHTML.IHTMLElementCollection
Dim VidInnerRows As MSHTML.IHTMLElementCollection
Dim VidInnerCatID As Integer
XMLReq.Open "GET", VidCatURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
Set VidRows = HTMLDoc.getElementsByClassName("bptable")
Set VidInnerRows = ***VidRows***.getElementsByTagName("a")
With VidRows
For VidInnerCatID = 2 To VidInnerRows.Length
Set VidInnerRow = VidInnerRows(VidInnerCatID)
'Debug.Print
Next VidInnerCatID
End With
End Sub
I would use css selectors as generally a faster selector method and reduces the loops so lower complexity. You loop a single nodeList.
Dim nodeList As Object, i As long
Set nodeList = HTMLDoc.querySelectorAll(".bptable a")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).innerText
Next
The . in front of bptable is a class selector; the space after is a descendant combinator, and the final a is a type selector. It says select a tag elements who parent has class bptable.
I am printing to the immediate window Ctrl + G
Instead of using .getElementsByTagName("a") on collection of elements you can either use VidRows(0) or try a for loop to get individual element in order to apply .getElementsByTagName("a") on them. I would opt for a for loop to serve the purpose. The following is one such way to get the content.
Sub ListVideosOnPage(VidCatName As String, VidCatURL As String)
Dim XMLReq As New XMLHTTP60
Dim HTMLDoc As New HTMLDocument
Dim VidInnerRows As Object
Dim R As Long
XMLReq.Open "GET", VidCatURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
For Each VidInnerRows In HTMLDoc.getElementsByClassName("bptable")
With VidInnerRows.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
End With
Next VidInnerRows
End Sub

VBA Web Scraping, SPAN issue

Im currently trying to scrape some data from a website utilizing but having a complete block on this.
My VBA code is as follows;
Sub WISE()
Dim IE As InternetExplorer
Dim HTML As HTMLDocument
Dim WPage As String
WPage = "www.thencc.org.uk/Our_Members/MemDetails.aspx?CompID=AAA-01-01"
Set IE = New InternetExplorer
IE.Visible = True
IE.navigate WPage
Do While IE.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to " & WPage
DoEvents
Loop
Dim hlpe As String
'Clearly missing something
hlpe = _
HTML.getElementsByTagName("span").getElementsByTagName("b").innerText
Range("a5").Value = hlpe
Set IE = Nothing
Application.StatusBar = ""
End Sub
The website HTML details which i'm trying to get sits in the below;
Essentially i was going to pull the lot then parse in excel, ideally i want the address / Number.
<span id="MainContent_lblDetails"><table class=tabLayout width='90%'> <tr><td style='height:20px'><b>AA Autovan Leisure Ltd</b><br/><br/>Servicing and repairs – mobile specialists in servicing and repairing touring caravans and motorhomes. Contact us for more information<br/><br/>7 Sycamore Lane, Holmfirth, Huddersfield, HD9 7RT<br/>West Yorkshire, England<br/><br/><b>Tel - </b>01484 683636<br/><b>Web - </b><a href='http://www.aaautovanleisure.com' style='color:#0061a0' target='_blank'>www.aaautovanleisure.com</div></td></tr><tr><td align='right'><a href='javascript:history.go(-1)' style='color:#0061a0'> Back </a></td></tr></table></span>
We can parse text related to the element, but we want to use its html so as to have useful delimiters to split on. Also, we can do away with the slow browser opening and issue an XMLHTTP GET request.
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, HTML As New htmldocument, arrayItems() As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://www.thencc.org.uk/Our_Members/MemDetails.aspx?CompID=AAA-01-01", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
With HTML
.body.innerHTML = sResponse
arrayItems = Split(.querySelector("#MainContent_lblDetails").outerHTML, "<BR>")
Debug.Print "Address: " & arrayItems(4) & Chr$(32) & arrayItems(5)
Debug.Print Replace$(Replace$(Replace$(arrayItems(7), "<B>", vbNullString), "</B", vbNullString), "- >", vbNullString)
End With
End Sub

Web-scraping from Excel List of PDGA Numbers using VBA

I have a list of numbers (PDGA Numbers) in MS Excel. I would like to automatically search the PDGA website (https://www.pdga.com/players/) from the list and automatically paste the player's location next to the corresponding PDGA Number. Currently, I am able to search the number and paste the location individually, but not the entire list.
First I select an excel cell and 'Define Name' as PDGA, and another as Location.
https://imgur.com/AcGtuX8
Then I basically followed this YouTube video. https://www.youtube.com/watch?v=7sZRcaaAVbg
And ultimately got this VBA code to work. (Make sure the proper VBA References are checked)
https://imgur.com/a/OYSM7Am
Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Column = Range("PDGA").Column Then
Dim IE As New InternetExplorer
IE.Visible = False
IE.navigate "https://www.pdga.com/player/" & Range("PDGA").Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
Dim sDD As String
sDD = Trim(Doc.getElementsByTagName("li")(127).outerText)
Range("Location").Value = sDD
End If
End Sub
I think I need some For Each loop, but I'm not sure. It should look like this when completed.
https://imgur.com/a/qOiW4JJ
Thanks in advance for any help.
If you have a specific list of players then you loop and issue XHR requests to get the info. Here I have the PDGA# in an array which is looped:
playerPDGA = Array(1, 5, 23, 46, 789, 567)
Code:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As New HTMLDocument, playerPDGA(), results(), i As Long
playerPDGA = Array(1, 5, 23, 46, 789, 567)
ReDim results(0 To UBound(playerPDGA), 0 To 1)
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(playerPDGA) To UBound(playerPDGA)
.Open "GET", "https://www.pdga.com/player/" & playerPDGA(i), False
.send
sResponse = StrConv(.responseBody, vbUnicode)
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
With html
.body.innerHTML = sResponse
results(i, 0) = .querySelector(".pane-content > h1").innerText
results(i, 1) = .querySelector(".location").innerText
End With
Next i
End With
ActiveSheet.Cells(1, 1).Resize(UBound(results, 1) + 1, UBound(results, 2) + 1) = results
End Sub
For any page listing players:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As New HTMLDocument
Application.ScreenUpdating = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.pdga.com/players/", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim nameList As Object, cityList As Object, stateList As Object, countryList As Object, r As Long
With html
.body.innerHTML = sResponse
Set nameList = .querySelectorAll(".views-field.views-field-nothing")
Set cityList = .querySelectorAll(".views-field.views-field-City.city")
Set stateList = .querySelectorAll(".views-field.views-field-StateProv.state")
Set countryList = .querySelectorAll(".views-field.views-field-Country.country")
End With
With ActiveSheet
Dim i As Long
For i = 0 To nameList.Length - 1
r = r + 1
.Cells(r, 1) = nameList.item(i).innerText
.Cells(r, 2) = Trim$(cityList.item(i).innerText & Chr$(32) & stateList.item(i).innerText & Chr$(32) & countryList.item(i).innerText)
Next i
End With
Application.ScreenUpdating = True
End Sub
Reference:
HTML Object library
You can achieve your desired output in several ways. Here is one of such.
Sub FetchData()
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLDivElement, Idic As New Scripting.Dictionary
Dim key As Variant, N$, CT$, S$, C$, R&
With Http
.Open "GET", "https://www.pdga.com/players/", False
.send
Html.body.innerHTML = .responseText
End With
For Each post In Html.querySelector("table.views-table tbody").getElementsByTagName("tr")
N = post.querySelector("a[title]").innerText
CT = post.querySelector(".city").innerText
S = post.querySelector(".state").innerText
C = post.querySelector(".country").innerText
Idic(N & "|" & CT & " " & S & " " & C) = 1
Next post
For Each key In Idic.Keys
R = R + 1: Cells(R, 1) = Split(key, "|")(0)
Cells(R, 2) = Split(key, "|")(1)
Next key
End Sub
Reference to add to the library:
Microsoft XML, v6.0
Microsoft HTML Object Library
Microsoft Scripting Runtime
Sub test()
Dim IE As New InternetExplorer
Dim Doc As HTMLDocument
Dim lastRow As Long, i As Long
Dim sDD As String
IE.Visible = False
lastRow = Range("A" & Rows.Count).End(xlUp).Row
For i = 1 To lastRow
IE.navigate "https://www.pdga.com/player/" & Range("PDGA").Cells(i).Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Set Doc = IE.document
sDD = Trim(Doc.getElementsByTagName("li")(127).outerText)
Range("Location").Cells(i) = sDD
Next
Set IE = Nothing
Set Doc = Nothing
End Sub

Web scraping html page with no tags as delimiter

I'm trying to import into a string array all lines of text in a web page. The URL is here: Vaticano-La Sacra Bibbia-Genesi-Cap.1.
Unfortunately (maybe a choice of the web designer), in the tag there aren't ID's or CLASS. All the rows are separated by 1 or more < BR > element. Start and end text is separated from a simple menu by 2 tag < HR >.
A clean extract of page code is here: jsfiddle.
I find a way to bring the text. And now what I do in VBA till now:
Note: objDoc is a Public variable coming from another module, fill with a .responseText without problems.
Public Sub ScriviXHTML(strBook As String, intNumCap As Integer)
Dim strDati2 As String
Dim TagBr As IHTMLElementCollection
Dim BrElement As IHTMLElement
Dim intElement As Integer
Dim objChild as Object
Dim strData, strTextCont, strNodeVal, strWholeText As String
Set objDoc2 = New HTMLDocument
Set objDoc2 = objDoc
Set objDoc = Nothing
'Put in variable string HTML code of the web page.
strDati2 = objDoc2.body.innerHTML
'Set in the variable object TAG type BR.
Set TagBr = objDoc2.body.getElementsByTagName("BR")
'Loop for all BRs in the page.
For Each BrElement In TagBr
'Here I try to get the NextSibling element of the <br>
' because seems contain the text I'm looking for.
Set objChild = BrElement.NextSibling
With objChild
' Here I try to put in the variables
strData = Trim("" & .Data & "")
strTextCont = Trim("" & .textContent & "")
strNodeVal = Trim("" & .NodeValue & "")
strWholeText = Trim("" & .wholeText & "")
End With
intElement = intElement + 1
Next BrElement
Two questions:
1) Is it, about you, the best way to achieve what I'm trying to do?
2) Sometimes the Element.NextSibling.Data doesn't exist, with an Error of runtime '438', so I manually move the point of sospension of the routine to by-pass the error. How can I intercept this error? [Please not with a simple On Error Resume Next!]... better: how can I use an If...Then... End If statement to check if in NextSibling exist the Data member?
Thanks at all.
Well you can get all the text as follows:
Public Sub GetInfo()
Dim sResponse As String, xhr As Object, html As New HTMLDocument
Set xhr = CreateObject("MSXML2.XMLHTTP")
With xhr
.Open "GET", "http://www.vatican.va/archive/ITA0001/__P1.HTM", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
html.body.innerHTML = sResponse
[A1] = Replace$(Replace$(regexRemove(html.body.innerHTML, "<([^>]+)>"), " ", Chr$(32)), Chr$(10), Chr$(32))
End With
End Sub
Public Function regexRemove(ByVal s As String, ByVal pattern As String) As String
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
End With
If regex.test(s) Then
regexRemove = regex.Replace(s, vbNullString)
Else
regexRemove = s
End If
End Function