Using XMLHTTP object to parse some websites in VBA - html

I am trying to pick up "key people" field from a Wikipedia page: https://en.wikipedia.org/wiki/Abbott_Laboratories and to copy that value in my Excel spread sheet.
I managed to do it using xml http which is a method I like for its speed, you can see the code below that is working.
The code is however not flexible enough as the structure of the wiki page can change, for example it doesn't work on this page: https://en.wikipedia.org/wiki/3M
as the tr td structure is not exactly the same (key people is no longer 8th TR for the 3M page)
How can I improve my code?
Public Sub parsehtml()
Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://en.wikipedia.org/wiki/Abbott_Laboratories", False
http.send
html.body.innerHTML = http.responseText
Set topic = html.getElementsByTagName("tr")(8)
Set titleElem = topic.getElementsByTagName("td")(0)
ThisWorkbook.Sheets(1).Cells(1, 1).Value = titleElem.innerText
End Sub

If row of the table is not fixed for "Key people", then why don't loop the table for "Key people"
I tested with followings modification, it is found working correctly.
In declaration section
Dim topics As HTMLTable, Rw As HTMLTableRow
and then finally
html.body.innerHTML = http.responseText
Set topic = html.getElementsByClassName("infobox vcard")(0)
For Each Rw In topic.Rows
If Rw.Cells(0).innerText = "Key people" Then
ThisWorkbook.Sheets(1).Cells(1, 1).Value = Rw.Cells(1).innerText
Exit For
End If
Next

There is a better faster way. At least for given urls. Match on class name of element and index into returned nodeList. Less returned items to deal with, the path to the element is shorter, and matching with class name is faster than matching on element type.
Option Explicit
Public Sub GetKeyPeople()
Dim html As HTMLDocument, body As String, urls(), i As Long, keyPeople
Set html = New HTMLDocument
urls = Array("https://en.wikipedia.org/wiki/Abbott_Laboratories", "https://en.wikipedia.org/wiki/3M")
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(urls) To UBound(urls)
.Open "GET", urls(i), False
.send
html.body.innerHTML = .responseText
keyPeople = html.querySelectorAll(".agent").item(1).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(i + 1, 1).Value = keyPeople
Next
End With
End Sub

Related

Amazon DVD details Web Scraping not able to pick the required Element

I passing EAN numbers of the certain movies and extracting movie name and ASIN number in Amazon.
"https://www.amazon.de/s?k=7321925005738&__mk_de_DE=ÅMÅŽÕÑ&ref=nb_sb_noss"
But in Amazon website,i face problem like some time the search results contains Sponsored product result also (It may or may not come),How ever i want to extract other than Sponsored products.
So when ever i debug.print the Amazon ASIN number and Movie name,it prints all the ASIN number and movie name (Which includes sponsored product).
For identifying the sponsored product,the way i am using is data-component-type="sp-sponsored-result"
in the response text,
where as actual product does not contain this id at "data-component-type" all,so i am not able to separate the actual movie name (Other than sponsored result)
I tried if not xxxx then ,still my code prints all here i am attaching my code
Here is my code
Sub Amazon_Pull()
Dim Link_2 As String
Link_2 = "https://www.amazon.de/s?k=7321925005738&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
Dim xhr As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
'Debug.Print html.body.innerHTMLDebug.Print html.getElementsByTagName("div").getAttribute("data-index").Length
Dim hTable As Object
Dim hba As Object
Set hTable = html.getElementsByTagName("div")
For Each hba In hTable
If Left(hba.getAttribute("data-asin"), 1) = "B" Then
If hba.getElementsByTagName("div")(2).getAttribute("data-component-type") <> "sp-sponsored-result" Then
Debug.Print hba.getAttribute("data-asin")
End If
End If
Next hba
Set xhr = Nothing
Set html = Nothing
'-------------
End Sub
Use a css attribute = value selector to restrict to the appropriate nodes
Dim nodeList As Object, i As Long
Set nodelist = hba.querySelectorall("[data-asin]")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).getAttribute("data-asin")
Next
You can remove your conditional statements and add all the conditional logic into the css selector with starts with ^ operator for the character B
Dim nodeList As Object, i As Long
Set nodelist = hba.querySelectorall("[data-asin^=B]")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).getAttribute("data-asin")
Next
You can do it quick and dirty like this. But if the word "Gesponsert" is part of film title, it fails ;-)
Sub Amazon_Pull()
Dim Link_2 As String
Dim xhr As MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument
Dim hTable As Object
Dim hba As Object
Dim i As Long
Link_2 = "https://www.amazon.de/s?k=7321925005738"
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.querySelectorAll("div[data-index]")
For i = 0 To hTable.Length - 1
If InStr(1, hTable(i).innerText, "Gesponsert") = 0 Then
Debug.Print hTable(i).getAttribute("data-asin") & " " & hTable(i).getElementsByTagName("h2")(0).innerText
End If
Next i
Set xhr = Nothing
Set html = Nothing
End Sub
In my opinion it's always better to use the code structure of a page and not a part of the content. I know, that's not always possible and it's often more complex.
To check if an offer on amazon is sponsored, you can use the structure of the page code like this. One advantage is that it also works on the international Amazon platforms, regardless of the national language.
(Not tested, because Amazon blocked me as a bot.)
Sub Amazon_Pull()
Dim Link_2 As String
Dim xhr As MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument
Dim hTable As Object
Dim hba As Object
Dim i As Long
Dim check As Long
Dim sponsored As Boolean
Dim checkSponsored As Object
Link_2 = "https://www.amazon.de/s?k=7321925005738"
'Link_2 = "https://www.amazon.de/s?k=apple"
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.querySelectorAll("div[data-index]")
For i = 0 To hTable.Length - 1
sponsored = False
Set checkSponsored = hTable(i).querySelectorAll("div[data-component-type]")
For check = 0 To checkSponsored.Length - 1
If checkSponsored.getAttribute("data-component-type") = "sp-sponsored-result" Then
sponsored = True
End If
Next check
If Not sponsored Then
Debug.Print hTable(i).getAttribute("data-asin") & " " & hTable(i).getElementsByTagName("h2")(0).innerText
End If
Next i
Set xhr = Nothing
Set html = Nothing
End Sub

Extract value from HTML Source

I had a macro that used to go to a website pull a value from the A column, for example 517167000, from a particular part of the code and returning that value to a cell.
The html source has changed now and i cant seem to get it to work.
My original code was
Public Function UnitPerBox(searchTerm As String) As String
Static request As Object
If request Is Nothing Then Set request = CreateObject("msxml2.xmlhttp")
With request
.Open "GET", "https://larsonjuhl.co.uk/mouldings/larson-juhl-essentials/arq-essentials-moulding-" & searchTerm, False
.send
UnitPerBox = Trim(Split(Split(.responseText, "Units per box</td>")(1), "<tr")(0))
End With
End Function
So a working example of the website is
https://larsonjuhl.co.uk/mouldings/larson-juhl-essentials/arq-essentials-moulding-517167000
So that you can go to the website and view the source.
The new html code looks like the below, but its been so long since i did the original macro, that i assumed that i could change
"Units per box</td>")(1), "<tr"
to
"Units per pack</td> <td class="value">")(1), "<tr"
as the below new html code is what is now on the site, and i need the value 2.74 for example, but its not working.
<tr>
<td class="name">Units per pack</td>
<td class="value">2.74</td>
</tr>
Any help would be much appreciated.
An example of
Cheers
If you go and work with .responseText using Split() doing text manipulation you might as well use a regular expression without setting it's Global parameter:
Public Function UnitPerBox(searchTerm As String) As String
Static request As Object
If request Is Nothing Then Set request = CreateObject("msxml2.xmlhttp")
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Pattern = "\d+(?:\.\d+)?"
With request
.Open "GET", "https://larsonjuhl.co.uk/mouldings/larson-juhl-essentials/arq-essentials-moulding-" & searchTerm, False
.send
UnitPerBox = RegEx.Execute(Split(.responsetext, "Units per pack</td>")(1))(0)
End With
End Function
Neater (IMO) however is to avoid text manipulation on the .responseText alltogether and work through the HTML document, retrieve the appropriate data straigt from the HTML-table by element-ID and table indexes:
Public Function UnitPerBox(searchTerm As String) As String
Static request As Object
If request Is Nothing Then Set request = CreateObject("msxml2.xmlhttp")
Dim htmlResponse As Object: Set htmlResponse = CreateObject("htmlfile")
With request
.Open "GET", "https://larsonjuhl.co.uk/mouldings/larson-juhl-essentials/arq-essentials-moulding-" & searchTerm, False
.send
htmlResponse.body.innerHTML = .responseText
UnitPerBox = htmlResponse.body.document.getElementById("specifications").getElementsByTagName("tr")(10).getElementsByTagName("td")(1).innerText
End With
End Function
Note that the table is 0-indexed meaning we are actually retrieving our value from the 11th row, second column. In case you are not sure that the tablecontent is always found on the same indexes, you could also just loop the child nodes:
Public Function UnitPerBox(searchTerm As String) As String
Static request As Object
If request Is Nothing Then Set request = CreateObject("msxml2.xmlhttp")
Dim htmlResponse As Object: Set htmlResponse = CreateObject("htmlfile")
Dim Rws As Object
With request
.Open "GET", "https://larsonjuhl.co.uk/mouldings/larson-juhl-essentials/arq-essentials-moulding-" & searchTerm, False
.send
htmlResponse.body.innerHTML = .responseText
Set Rws = htmlResponse.body.document.getElementById("specifications").getElementsByTagName("tr")
For Each Rw In Rws
If Rw.getElementsByTagName("td")(0).InnerText = "Units per pack" Then
UnitPerBox = Rw.getElementsByTagName("td")(1).InnerText
Exit For
End If
Next
End With
End Function
Where I personally would prefer to use HTML document over text manipulation, all above options work to retrieve your value =)

How to fix 'for each' iteration with getELementsbyTagName?

I'm using MSXML and WinHTTP within VBA/Excel. I'm attempting to extract 'innertext' from all tag elements within a element.
How can the sub iterate through all tags within a specific class and populate a worksheet?
Thanks in advance.
I'm trying adapt this strategy [0] to this website [1]
[0] https://codingislove.com/parse-html-in-excel-vba/
[1] https://www.fool.com/earnings/call-transcripts/2019/07/17/netflix-inc-nflx-q2-2019-earnings-call-transcript.aspx
Sub tryKeywordsearch()
Dim http As Object, html As New HTMLDocument
Dim paras As Object, titleElem As Object, detailsElem As Object, para As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://www.fool.com/earnings/call-transcripts/2019/07/17/netflix-inc-nflx-q2-2019-earnings-call-transcript.aspx", False
http.Send
html.body.innerHTML = http.responseText
Set paras = html.getElementsByClassName("article-content")
i = 1
For Each para In paras
Set para = para.getElementsByTagName("p")(i)
Sheets(1).Cells(i, 1).Value = para.innerText
i = i + 1
Next
End sub
There is in fact only one element with that class name, article-content, so you are doing an outer loop of one and thus get no further than i = 1. Additionally, during your first loop you are changing the variable you are looping over which will most likely lead to an error.
For Each para In paras
Set para = para.getElementsByTagName("p")(i)
In the above, para is your loop variable.
Also, the collection returned by para.getElementsByTagName("p") will start at 0.
How your code would work is if you indexed into initial collection returned by getElementsByClassName and then chain on getElementsByTagName, and use that as your collection to For Each over (leaving index starting at 1 as you can then use it to write out to the correct row; you can use your loop variable para to get the current node innerText):
Option Explicit
Public Sub TryKeywordSearch()
Dim http As Object, html As New HTMLDocument
Dim paras As Object, para As Object, i As Long
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://www.fool.com/earnings/call-transcripts/2019/07/17/netflix-inc-nflx-q2-2019-earnings-call-transcript.aspx", False
http.send
html.body.innerHTML = http.responseText
Set paras = html.getElementsByClassName("article-content")(0).getElementsByTagName("p")
i = 1
For Each para In paras
ThisWorkbook.Worksheets("Sheet1").Cells(i, 1).Value = para.innerText
i = i + 1
Next
End Sub
Instead, you can use a faster, and more readable IMO, css selector combination to get all p tags within a parent with class article-content:
Option Explicit
Public Sub GetParagraphs()
Dim http As Object, html As HTMLDocument, paragraphs As Object, i As Long
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.fool.com/earnings/call-transcripts/2019/07/17/netflix-inc-nflx-q2-2019-earnings-call-transcript.aspx", False
.send
html.body.innerHTML = .responseText
End With
Set paragraphs = html.querySelectorAll(".article-content p")
For i = 0 To paragraphs.Length - 1
ThisWorkbook.Worksheets("Sheet1").Cells(i + 1, 1) = paragraphs.item(i).innerText
Next i
End Sub

Exploring the Instr VBA Function In Webscraping

I want to scrape this URL https://www.realtor.com/realestateandhomes-search/06510 using the VBA InStr function and extract all URLs with this substring "06510"
Here's is a sample code I've been trying to make work.
Option Explicit
Sub GetLinks()
'
'To use HTMLDocument you need to set a reference to Tools -> References -> Microsoft HTML Object Library
Dim HTML As New HTMLDocument
Dim http As Object
Dim links As Object
Dim link As HTMLHtmlElement
Dim counter As Long
Dim website As Range
Dim LastRange As Range
Dim row As Long
Dim continue As Boolean
Dim respHead As String
Dim lRow As Long
Application.ScreenUpdating = False
' The row where website addresses start
row = 30
continue = True
lRow = Cells(Rows.count, 1).End(xlUp).row + 1
' XMLHTTP gives errors where ServerXMLHTTP does not
' even when using the same URL's
'Set http = CreateObject("MSXML2.XMLHTTP")
Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
Do While continue
' Could set this to first cell with URL then OFFSET columns to get next web site
Set website = Range("A" & row)
Set LastRange = Range("B" & lRow)
If Len(website.Value) < 1 Then
continue = False
Exit Sub
End If
If website Is Nothing Then
continue = False
End If
'Debug.Print website
With http
On Error Resume Next
.Open "GET", website.Value, False
.send
' If Err.Num is not 0 then an error occurred accessing the website
' This checks for badly formatted URL's. The website can still return an error
' which should be checked in .Status
'Debug.Print Err.Number
' Clear the row of any previous results
Range("B" & row & ":e" & row).Clear
' If the website sent a valid response to our request
If Err.Number = 0 Then
If .Status = 200 Then
HTML.body.innerHTML = http.responseText
Set links = HTML.getElementsByTagName("a")
For Each link In links
If InStr(link.outerHTML, "06510") Then
LastRange.Value = link.href
End If
Next
End If
Set website = Nothing
Else
'Debug.Print "Error loading page"
LastRange.Value = "Error with website address"
End If
On Error GoTo 0
End With
row = row + 1
Loop
Application.ScreenUpdating = True
End Sub
After inspecting the page, here's a sample of the kind of URL to extract - https://www.realtor.com/realestateandhomes-detail/239-Bradley-St_New-Haven_CT_06510_M36855-92189. Any help will be appreciated
Using QHarr's code in a simplified way...
Sub GetLinks()
Dim url As String, links_count As Integer
Dim j As Integer, row As Integer
Dim XMLHTTP As Object, html As Object
'Dim tr_coll As Object, tr As Object
'Dim elements As Object
Dim i As Long, allLinksOfInterest As Object
'Dim td_coll As Object, td As Object, td_col, objT
url = "https://www.realtor.com/realestateandhomes-search/06510"
Set XMLHTTP = CreateObject("MSXML2.XMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.responseText
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
Debug.Print allLinksOfInterest.Item(i).href
Next
End Sub
Please check if I'm missing anything. I'm still getting the error "Object doesn't support this property or method"
Don't use Instr on entire node outerHTML during a loop of all a tags. There are times when this is required but this shouldn't be one of them (hopefully).
You want to use attribute = value css selector with contains, *, operator. It is specifically for the purpose of matching on substrings in attribute values. This is more efficient.
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
So,
Dim i As Long, allLinksOfInterest As Object
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
Debug.Print allLinksOfInterest.Item(i).href
Next
Attribute = value with contains operator:
[attr*=value]
Represents elements with an attribute name of attr whose
value contains at least one occurrence of value within the string.
VBA:
Produces 26 links currently.All are relative links so need domain added as shown in loop. Some are duplicates so consider adding to a dictionary as keys so as remove duplicates.
Option Explicit
Public Sub GetLinks()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.realtor.com/realestateandhomes-search/06510", False
.send
html.body.innerHTML = .responseText
End With
Dim i As Long, allLinksOfInterest As Object
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
Debug.Print Replace$(allLinksOfInterest.item(i).href,"about:","https://www.realtor.com")
Next
End Sub
If InStr(link.outerHTML, "06510") Then
In the code above, InStr function was used like boolean function. But it is not boolean, instead it returns integer. So, you should add comparison operator after function. May be like:
If InStr(link.outerHTML, "06510")>0 Then

scraping web vba tags nested

I always have problems when I do web scraping with vba if I find tags nested like this in the link
http://forebet.com
scrape the link data from the menu on the left, but I'm wrong when I get to the championships nested as England, Spain
Sub championshipforebet()
Dim objIE As Object
Dim itemEle As Object
Dim itemEle1 As Object
Dim away As Object
Dim desc As String, pt1 As String, pt2 As String, price As String
Dim i As Integer
Cells.Select
Selection.ClearContents
Selection.NumberFormat = "#"
Set objIE = CreateObject("internetexplorer.application")
objIE.Visible = True
objIE.navigate "https://www.forebet.com/it/"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set itemEle = objIE.document.getElementsByClassName("moduletable_foremenu")(1)
Set itemEle1 = itemEle.getElementsByClassName("tree_foremenu")
i = 1
For Each ele1 In itemEle1(0).getElementsByTagName("li")
i = i + 1
Cells(i, 1) = ele1(0).getElementsByClassName("mainlevel_foremenu").href
Next ele1
End Sub
Although there is no remarkable difference between the two answers, I decided to post mine as I've already created one. The for loop part and the split function might be helpful for future readers. As QHarr has already described about .querySelector() I didn't repeat the same.
This is the code you can try as well:
Sub GrabLinks()
Const Baseurl$ = "https://www.forebet.com"
Dim S$, I&
With New XMLHTTP60
.Open "GET", Baseurl & "/", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
With .querySelectorAll(".mainlevel_foremenu,.sublevel_foremenu")
For I = 0 To .Length - 1
Cells(I + 1, 1) = Baseurl & Split(.Item(I).getAttribute("href"), "about:")(1)
Next I
End With
End With
End Sub
Reference to add to the library:
Microsoft XML, v6.0
Microsoft HTML Object Library
You can use css selectors including the OR operator ",". Also, use XMLHTTPRequest (XHR) to retrieve the data far more quickly than by opening IE.
The links are in one of two classes:
The top level have class:
mainlevel_foremenu
The nested have class:
sublevel_foremenu
You can use a CSS selector combination to get all elements with either of these two classes, combining them with the OR operator.
a.mainlevel_foremenu,a.sublevel_foremenu
The a means a tag (element selector) and the . is a class selector. So elements with an a tag that have class name mainlevel_foremenu, or (",") sublevel_foremenu. The a is not actually required in this instance so you could shorten to .mainlevel_foremenu, .sublevel_foremenu.
The CSS selector returns more than one item so .querySelectorAll method of document is used to return a nodeList. The length of the nodeList is then iterated to access individual links by index.
The links are relative so Replace function is used to remove the "about:" and the base path is prefixed to the link address.
If we look at the first switch on the page:
The combined CSS selector results show we are also getting the nested level (note I have cut out some of the intermediary results).
VBA:
Option Explicit
Public Sub GetInfo()
Application.Screenupdating = False
Dim sResponse As String, i As Long, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.forebet.com/", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim links As Object
With html
.body.innerHTML = sResponse
Set links = .querySelectorAll("a.mainlevel_foremenu,a.sublevel_foremenu")
End With
For i = 0 To links.Length - 1
Cells(i + 1, 1) = "https://www.forebet.com" & Replace$(links(i).href, "about:", vbNullString)
Next i
Application.Screenupdating = True
End Sub
Sample results:
References (VBE > Tools > References):
Microsoft HTML Object Library