VBA: Scraping exact elements from HTMLTable - html

Please can you help me to understand how to find tags a in Table with class name bptable?
I receive Object does not support this method and I don't know how to solve this problem.
Sub ListVideosOnPage(VidCatName As String, VidCatURL As String)
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim VidRow As MSHTML.IHTMLElement
Dim VidInnerRow As MSHTML.IHTMLElement
Dim VidRows As MSHTML.IHTMLElementCollection
Dim VidInnerRows As MSHTML.IHTMLElementCollection
Dim VidInnerCatID As Integer
XMLReq.Open "GET", VidCatURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
Set VidRows = HTMLDoc.getElementsByClassName("bptable")
Set VidInnerRows = ***VidRows***.getElementsByTagName("a")
With VidRows
For VidInnerCatID = 2 To VidInnerRows.Length
Set VidInnerRow = VidInnerRows(VidInnerCatID)
'Debug.Print
Next VidInnerCatID
End With
End Sub

I would use css selectors as generally a faster selector method and reduces the loops so lower complexity. You loop a single nodeList.
Dim nodeList As Object, i As long
Set nodeList = HTMLDoc.querySelectorAll(".bptable a")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).innerText
Next
The . in front of bptable is a class selector; the space after is a descendant combinator, and the final a is a type selector. It says select a tag elements who parent has class bptable.
I am printing to the immediate window Ctrl + G

Instead of using .getElementsByTagName("a") on collection of elements you can either use VidRows(0) or try a for loop to get individual element in order to apply .getElementsByTagName("a") on them. I would opt for a for loop to serve the purpose. The following is one such way to get the content.
Sub ListVideosOnPage(VidCatName As String, VidCatURL As String)
Dim XMLReq As New XMLHTTP60
Dim HTMLDoc As New HTMLDocument
Dim VidInnerRows As Object
Dim R As Long
XMLReq.Open "GET", VidCatURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
For Each VidInnerRows In HTMLDoc.getElementsByClassName("bptable")
With VidInnerRows.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
End With
Next VidInnerRows
End Sub

Related

Loop through line of code and change integer for getElementsByClassName

Previously posted on the MrExcel forum
www.mrexcel.com/board/threads/change-integer-in-code-line-for-htmldoc-getelementsbyclassname.1146814/
My original line of code was
Set DogRows1 = HTMLDoc.getElementsByClassName("rpb-greyhound rpb-greyhound-1 hover-opacity"
It works perfectly for the integer 1. However, I require to increment this by 1 and change to 2, 3, 4, 5 and 6 for other webpages, as below.
Set DogRows1 = HTMLDoc.getElementsByClassName("rpb-greyhound rpb-greyhound-6 hover-opacity"
I tried declaring some variables and adding a For Next Loop, however it will not loop through. What am I doing wrong? Have I put the For Next Loops in the wrong place?
Dim StartRaceNumber As Integer
Dim LastRaceNumber As Integer
XMLReq.Open "GET", DogPageURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerhtml = XMLReq.responseText
Set XMLReq = Nothing
LastRaceNumber = 6
For StartRaceNumber = 1 To LastRaceNumber
Set DogRows1 = HTMLDoc.getElementsByClassName("rpb-greyhound rpb-greyhound-" & StartRaceNumber & " hover-opacity")
For Each DogRow1 In DogRows1
Set DogNameLink1 = DogRow1.getElementsByTagName("a")(0)
NextHref = DogRow1.getAttribute("href")
NextURL = DogURL & Mid(NextHref, InStr(NextHref, ":") + 28)
Debug.Print DogRow1.innerText, NextURL
Next DogRow1
Next StartRaceNumber
Sure SIM
The scraping order is as follows:
Get Greyhound URL racecards
Greyhound Races
Get Greyhound URL Dog information
List of Greyhounds in the race
Get Greyhound Form details, this is an example for Greyhound#1
Form of Each Greyhound #1
Then loop to the next race and repeat.
As I said, from the code I can scrape only the form for greyhound#1 details for each race. I need to get the other dogs too if you can help?
These are my modules, hopefully they have imported correctly >
Option Explicit
Const DogURL As String = "https://www.timeform.com/greyhound-racing/racecards"
Sub ListDogRace()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim TFRaceList As MSHTML.IHTMLElement
Dim TFRaces As MSHTML.IHTMLElementCollection
Dim TFRace As MSHTML.IHTMLElement
Dim NextHref As String
Dim NextURL As String
XMLReq.Open "GET", DogURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerhtml = XMLReq.responseText
Set XMLReq = Nothing
Set TFRaces = HTMLDoc.getElementsByClassName("wfr-race bg-light-gray hover-opacity")
For Each TFRace In TFRaces
NextHref = TFRace.getAttribute("href")
NextURL = DogURL & Mid(NextHref, InStr(NextHref, ":") + 28)
ListDogsOnPage TFRace.innerText, NextURL
Next TFRace
End Sub
Sub ListDogsOnPage(DogName As String, DogPageURL As String)
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim DogRow1 As MSHTML.IHTMLElement
Dim DogRows1 As MSHTML.IHTMLElementCollection
Dim DogNameLink1 As MSHTML.IHTMLElement
Dim NextHref As String
Dim NextURL As String
Dim StartRaceNumber As Integer
Dim LastRaceNumber As Integer
XMLReq.Open "GET", DogPageURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerhtml = XMLReq.responseText
Set XMLReq = Nothing
LastRaceNumber = 6
For StartRaceNumber = 1 To LastRaceNumber
Set DogRows1 = HTMLDoc.getElementsByClassName("rpb-greyhound rpb-greyhound-" & StartRaceNumber & " hover-opacity"
For Each DogRow1 In DogRows1
Set DogNameLink1 = DogRow1.getElementsByTagName("a")(0)
NextHref = DogRow1.getAttribute("href")
NextURL = DogURL & Mid(NextHref, InStr(NextHref, ":") + 28)
Debug.Print DogRow1.innerText, NextURL
Next DogRow1
Next StartRaceNumber
End Sub
Can I just confirm, it is only the URL on the page of each race for each greyhound I need, so I can scrape the greyhound's form.
As an example:
Nottingham 11.06
#1 BALLYBOUGH GARY
https://www.timeform.com/greyhound-racing/greyhound-form/ballybough-gary/59297
#2 SALACRES BRUISER
https://www.timeform.com/greyhound-racing/greyhound-form/salacres-bruiser/59746
#3 FOLLOW MY LEAD
https://www.timeform.com/greyhound-racing/greyhound-form/follow-my-lead/54898
#4 HONOUR SAMURAI
https://www.timeform.com/greyhound-racing/greyhound-form/honour-samurai/53100
#5 NIDDERDALEFLURRY
https://www.timeform.com/greyhound-racing/greyhound-form/nidderdaleflurry/56446
#6 SPORTY MELODY
https://www.timeform.com/greyhound-racing/greyhound-form/sporty-melody/58746
I already have a Power Query function I have developed to scrape the form data from that url page. I am just struggling to get that full list of 6x greyhound form urls (as above) for each and every race.
If that makes sense?

VBA issue with MSXML2.ServerXMLHTTP.6.0 that works with InternetExplorerMedium

I like to use MSXML2.ServerXMLHTTP.6.0 when I can as it is faster. However, I have not been able to figure out how to use it when I need to interact with the website.
Thats probably for another question.
However, I am having an issue on why I get different URL returned. When I use the following code, I get the required results
...
Sub GoogleSfund()
Set objIExplorer = New InternetExplorerMedium
objIExplorer.Silent = True
objIExplorer.Visible = False 'for testing change to true
objIExplorer.Navigate "https://www.google.com/search?q=DWCPF"
Do While objIExplorer.Busy or Not objIExplorer.ReadyState = 4: DoEvents: Loop
a = objIExplorer.Document.body.getElementsByTagName("g-card-section")
pos1 = InStr(a.innerText, "INDEXDJX: DWCPF")
pos2 = InStr(a.innerText, "Disclaimer")
b = Mid(a.innerText, pos1, pos2 - pos1)
b = Replace(b, vbCrLf & vbCrLf, vbCrLf)
MsgBox b
TSP_Test.lblSfund.Caption = b
objIExplorer = ""
End Sub...
With (MSXML2.ServerXMLHTTP.6.0) it does not grab the page with the same URL
Sub GoogleSfundFAST()
Dim sSourceUrl As String
Dim HttpReq as Object
Set HttpReq = CreateObject("MSXML2.ServerXMLHTTP.6.0")
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim HTMLH3Doc As New MSHTML.HTMLDocument
Dim HTMLInstProcDoc As New MSHTML.HTMLDocument
sSourceUrl = "https://www.google.com/search?q=dwcpf"
'sSourceUrl = "https://www.google.com/search"
HttpReq.Open "GET", sSourceUrl, False
HttpReq.send
If HttpReq.Status = 200 Then
HttpReq.getAllResponseHeaders
HTMLDoc.body.innerHTML = HttpReq.responseText
End If
Dim Obj As MSHTML.HTMLGenericElement
Dim Heading As MSHTML.IHTMLElementCollection
Dim HD As HTMLElementCollection
Debug.Print HTMLDoc.body.innerHTML
End Sub
Any Ideas why it is different?

Excel VBA - Error 91 problem when HTML value is nothing

Hi I recently discovered excel VBA and am using it to aid my study of German.
I have a list of German words but no meaning/part of speech, example sentences, etc.
I wrote a macro to go to website (https://dictionary.cambridge.org/dictionary/german-english/) and fetch html data.
However, for some words, the example sentences are not provided (Hence the html returning no value and the error 91).
I have referred to other posts concerning this and added If Not HTMLDoc.getElementsByClassName() Is Nothing Then statements, but no luck so far.
Could you please tell me how to write a code such that if there is no html value, the macro moves on and go to the next word? (word is set by integer corresponding to the cell number in the excel sheet)
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim i As Integer
Dim strURL As String
For i = 2 To 3493
strURL = "https://dictionary.cambridge.org/dictionary/german-english/" & Range("A" & i)
XMLReq.Open "Get", strURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Error."
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
'Part
If IsObject(HTMLDoc.getElementsByClassName("pos dpos")) Then
Range("B" & i) = HTMLDoc.getElementsByClassName("pos dpos")(0).innerText
End If
'Meaning
If IsObject(HTMLDoc.getElementsByClassName("ddef_h")) Then
Range("C" & i) = HTMLDoc.getElementsByClassName("ddef_h")(0).innerText
End If
'ExampleGer
If Not HTMLDoc.getElementsByClassName("eg deg") Is Nothing Then
i = i + 1
Else
Range("D" & i) = HTMLDoc.getElementsByClassName("eg deg")(0).innerText
End If
'ExampleEng
If Not HTMLDoc.getElementsByClassName("trans dtrans hdb") Is Nothing Then
i = i + 1
Else
Range("E" & i) = HTMLDoc.getElementsByClassName("trans dtrans hdb")(0).innerText
End If
Next i
End Sub
Ok, I'm a German and therefore did not need any example words.
A word that delivers all 4 values: Haus (house)
A word that delivers only 2 values: Gummibaum (rubber plant)
Try the following code and please ...
NEVER! NEVER! NEVER! manipulate the counting variable of a for loop in the code block of the loop. Never use this i = i + 1 if i is the counting variable of the for loop. If you do that you run into problems in 99.9%
Sub Dictionary()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim i As Integer
Dim strURL As String
'Use object variables for each node you want to read from the DOM tree
'In the code below, these variables are then used to check whether an object exists or not
Dim nodePart As Object
Dim nodeMeaning As Object
Dim nodeExampleGer As Object
Dim nodeExampleEng As Object
For i = 2 To 3493
strURL = "https://dictionary.cambridge.org/dictionary/german-english/" & Range("A" & i)
'strURL = "https://dictionary.cambridge.org/dictionary/german-english/haus"
XMLReq.Open "Get", strURL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Error."
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
'Part
Set nodePart = HTMLDoc.getElementsByClassName("pos dpos")(0)
If Not nodePart Is Nothing Then
Range("B" & i) = nodePart.innerText
End If
'Meaning
Set nodeMeaning = HTMLDoc.getElementsByClassName("ddef_h")(0)
If Not nodeMeaning Is Nothing Then
Range("C" & i) = nodeMeaning.innerText
End If
'ExampleGer
Set nodeExampleGer = HTMLDoc.getElementsByClassName("eg deg")(0)
If Not nodeExampleGer Is Nothing Then
Range("D" & i) = nodeExampleGer.innerText
End If
'ExampleEng
Set nodeExampleEng = HTMLDoc.getElementsByClassName("trans dtrans hdb")(0)
If Not nodeExampleEng Is Nothing Then
Range("E" & i) = nodeExampleEng.innerText
End If
Next i
End Sub

Webscrape a specific part of a webpage

My webscrape stopped working. The owner changed the html.
I believe it is the Set allElements = doc.getElementsByClassName("el-col el-col-8") line that needs changing.
I am trying to grab text from the webpage that includes the "52-week Range (undefined)" section. I managed to grab text from before and after but not the section I need. An example webpage is https://www.gurufocus.com/stock/gliba/summary and my code should fill my cell with "38.72 - 73.63" after I do some trimming.
I need to do it this way so I can get my head round it and change it in the future when necessary so please just focus on correcting my set line of code (assuming that is the problem!) rather than a whole new more sophisticated method as it will be beyond me. (My other set line of code does what I want it to do.)
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim incomeStmtURLs As Variant
Dim sURL As String
Dim lastrow As Long
Dim allRowOfData As Object
Dim i As Integer
Dim allElements As IHTMLElementCollection
Dim anElement As IHTMLElement
Dim aCell As HTMLTableCell
Application.DisplayAlerts = False
Call ToggleEvents(False)
incomeStmtURLs = Range("Sheet1!h1:h2").Value
For i = 1 To UBound(incomeStmtURLs)
Set wb = CreateObject("internetExplorer.Application")
sURL = incomeStmtURLs(i, 1)
wb.navigate sURL
wb.Visible = False
While wb.Busy
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
Set doc = wb.document
On Error GoTo err_clear
Set allElements = doc.getElementsByClassName("el-col el-col-8")
While allElements.Length = 0
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
x = allElements(0).innerText
' Debug.Print x
Sheet6.Cells(i + 1, 2).Value = Trim(Replace(Mid(x, InStr(1, x, "52-Week Range (undefined)") + 25, 25), vbLf, ""))
Set allElements = doc.getElementsByClassName("fs-x-large fc-primary fw-bolder")
x = allElements(0).innerText
Sheet6.Cells(i + 1, 4).Value = Trim(Replace(Mid(x, InStr(1, x, "$") + 1, 7), vbLf, ""))
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Next i
Call ToggleEvents(True)
End Sub
Sub ToggleEvents(blnState As Boolean)
Application.DisplayAlerts = blnState
Application.EnableEvents = blnState
If blnState Then Application.CutCopyMode = False
If blnState Then Application.StatusBar = False
End Sub
The page dynamically updates content as you scroll down. You likely need to scroll that part of the page into view then use grab all the elements with classname statictics-item then take the n-2 index e.g. Without the scrolling part:
Set elems = ie.document.getElementsByClassName("statictics-item")
If elems.length > 1 Then Debug.print elems(elems.length-2).innerText
For future readers (I know OP doesn't want this):
I would avoid the whole scrolling pickle, dynamic html and browser and issue an xmlhttp request and regex out the appropriate values from the javscript objects the web page uses for updating. N.B. I would probably add in validation on regex match positions.
Public Sub test()
Dim r As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.gurufocus.com/stock/gliba/summary", False
.send
r = GetMatches(.responseText, "price52wlow:(.*?),|price52whigh:(.*?),")
If r <> "NA" Then MsgBox r
End With
End Sub
Public Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As String
Dim matches As Object
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
'If .test(inputString) Then
Set matches = .Execute(inputString)
If matches.Count = 2 Then
GetMatches = matches.Item(0).submatches(0) & "-" & matches.Item(1).submatches(1)
Else
GetMatches = "NA"
End If
End With
End Function
Regex:

XMLHTTP60 Req not showing entire HTML Document

I am trying to obtain the HTML document from a website to what else, scrape for data!
Unfortunately, I cannot obtain the entire HTML Document associated with the web page. My debug.print statement doesn't show the entire web page as I would like, it gets cut off. I'm somewhat new to programming, help would be greatly appreciated!
My code is below:
Const SecForm4 As String = "https://www.secform4.com/significant-buys.htm"
Sub LoadWebPage()
Dim XMLReq As New MSXML2.XMLHTTP60
XMLReq.Open "GET", SecForm4, False
XMLReq.send
If XMLReq.Status <> 200 Or XMLReq.readyState <> 4 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & "-" & XMLReq.statusText
Exit Sub
End If
ParsingHTMLDocument XMLReq.responseText
End Sub
Sub ParsingHTMLDocument(HTMLText As String)
Dim HTMLDoc As New MSHTML.HTMLDocument
HTMLDoc.body.innerHTML = HTMLText
Debug.Print HTMLText
End Sub
The following works in terms of grabbing the document and the table is present. You are unlikely to be able to print the entire document to the immediate window as it has limitations on capacity. Instead you could write to a text file and inspect.
Change the filepath ,"C:\Users\User\Desktop\Test.txt", to one for you.
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument, hTable As HTMLTable
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.secform4.com/significant-buys.htm", False
.Send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
WriteTxtFile sResponse
With html
.body.innerHTML = sResponse
Set hTable = .getElementById("filing_table")
MsgBox hTable.localName
End With
End Sub
Public Sub WriteTxtFile(ByVal aString As String, Optional ByVal filePath As String = "C:\Users\User\Desktop\Test.txt")
Dim fso As Object, Fileout As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Set Fileout = fso.CreateTextFile(filePath, True, True)
Fileout.Write aString
Fileout.Close
End Sub
Reference to HTML Object Library required.