Amazon DVD details Web Scraping not able to pick the required Element - html

I passing EAN numbers of the certain movies and extracting movie name and ASIN number in Amazon.
"https://www.amazon.de/s?k=7321925005738&__mk_de_DE=ÅMÅŽÕÑ&ref=nb_sb_noss"
But in Amazon website,i face problem like some time the search results contains Sponsored product result also (It may or may not come),How ever i want to extract other than Sponsored products.
So when ever i debug.print the Amazon ASIN number and Movie name,it prints all the ASIN number and movie name (Which includes sponsored product).
For identifying the sponsored product,the way i am using is data-component-type="sp-sponsored-result"
in the response text,
where as actual product does not contain this id at "data-component-type" all,so i am not able to separate the actual movie name (Other than sponsored result)
I tried if not xxxx then ,still my code prints all here i am attaching my code
Here is my code
Sub Amazon_Pull()
Dim Link_2 As String
Link_2 = "https://www.amazon.de/s?k=7321925005738&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
Dim xhr As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
'Debug.Print html.body.innerHTMLDebug.Print html.getElementsByTagName("div").getAttribute("data-index").Length
Dim hTable As Object
Dim hba As Object
Set hTable = html.getElementsByTagName("div")
For Each hba In hTable
If Left(hba.getAttribute("data-asin"), 1) = "B" Then
If hba.getElementsByTagName("div")(2).getAttribute("data-component-type") <> "sp-sponsored-result" Then
Debug.Print hba.getAttribute("data-asin")
End If
End If
Next hba
Set xhr = Nothing
Set html = Nothing
'-------------
End Sub

Use a css attribute = value selector to restrict to the appropriate nodes
Dim nodeList As Object, i As Long
Set nodelist = hba.querySelectorall("[data-asin]")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).getAttribute("data-asin")
Next
You can remove your conditional statements and add all the conditional logic into the css selector with starts with ^ operator for the character B
Dim nodeList As Object, i As Long
Set nodelist = hba.querySelectorall("[data-asin^=B]")
For i = 0 To nodeList.Length - 1
Debug.Print nodeList.item(i).getAttribute("data-asin")
Next

You can do it quick and dirty like this. But if the word "Gesponsert" is part of film title, it fails ;-)
Sub Amazon_Pull()
Dim Link_2 As String
Dim xhr As MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument
Dim hTable As Object
Dim hba As Object
Dim i As Long
Link_2 = "https://www.amazon.de/s?k=7321925005738"
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.querySelectorAll("div[data-index]")
For i = 0 To hTable.Length - 1
If InStr(1, hTable(i).innerText, "Gesponsert") = 0 Then
Debug.Print hTable(i).getAttribute("data-asin") & " " & hTable(i).getElementsByTagName("h2")(0).innerText
End If
Next i
Set xhr = Nothing
Set html = Nothing
End Sub
In my opinion it's always better to use the code structure of a page and not a part of the content. I know, that's not always possible and it's often more complex.
To check if an offer on amazon is sponsored, you can use the structure of the page code like this. One advantage is that it also works on the international Amazon platforms, regardless of the national language.
(Not tested, because Amazon blocked me as a bot.)
Sub Amazon_Pull()
Dim Link_2 As String
Dim xhr As MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument
Dim hTable As Object
Dim hba As Object
Dim i As Long
Dim check As Long
Dim sponsored As Boolean
Dim checkSponsored As Object
Link_2 = "https://www.amazon.de/s?k=7321925005738"
'Link_2 = "https://www.amazon.de/s?k=apple"
Set xhr = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", Link_2, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.querySelectorAll("div[data-index]")
For i = 0 To hTable.Length - 1
sponsored = False
Set checkSponsored = hTable(i).querySelectorAll("div[data-component-type]")
For check = 0 To checkSponsored.Length - 1
If checkSponsored.getAttribute("data-component-type") = "sp-sponsored-result" Then
sponsored = True
End If
Next check
If Not sponsored Then
Debug.Print hTable(i).getAttribute("data-asin") & " " & hTable(i).getElementsByTagName("h2")(0).innerText
End If
Next i
Set xhr = Nothing
Set html = Nothing
End Sub

Related

Why am I not able to add an HTML Classname to an Element Collection using MSXML2 with VBA

I have tried many proven methods from various posts to get some data from a web page without success. I am able to get a list of linked items on the opening page but once I navigate to any other page, I draw a blank with the code below.
When I run the code, I get no results in Cats.
Sub Main()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim Cats As MSHTML.IHTMLElementCollection
Dim Cat As MSHTML.IHTMLElement
Dim NextHref As String
Dim NextURL As String
XMLReq.Open "GET", URL, False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem"
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set XMLReq = Nothing
Set Cats = HTMLDoc.getElementsByClassName("ng-tns-c329-5 product-grid--tile ng-star-inserted")
Debug.Print Cats.Length 'Returns 0
'For Each Cat In Cats
' NextHref = Cat.getAttribute("href")
' NextURL = URL & Mid(NextHref, InStr(NextHref, ":") + 2)
' ListItemsInCats Cat.innerText, NextURL
'Next Cat
End Sub
Expanded Element structure
Collased structure
Thanks for any assistance.
The problem with the website you are trying to scrape from is that:
In XMLHTTP Request method - The product details are dynamic content that is pulled from Fetch/XHR which XMLHTTP does not run, XMLHTTP only gives you the HTML document as it is without any script running.
In Internet Explorer method - The webpage is considered ready before the product details are actually loaded so the usual loop check for Busy and ReadyState is not sufficient.
The code below uses Internet Explorer and to resolve the issue mentioned above, I have put up some checks (Which is not perfect I believe but it works so far in my testing) that will wait until the first product has been loaded before proceeding to pull the product details:
Private Sub GetBakeryProducts()
Const URL As String = "https://www.woolworths.com.au/shop/browse/bakery"
Dim ieObj As InternetExplorer
Set ieObj = New InternetExplorer
ieObj.navigate URL
ieObj.Visible = True
Do While ieObj.Busy Or ieObj.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Do While ieObj.document.getElementsByClassName("productCarousel-header").Length = 0
DoEvents
Loop
Dim ieDoc As MSHTML.HTMLDocument
Set ieDoc = ieObj.document
Dim productList As Object
Set productList = ieDoc.getElementsByClassName("product-grid--tile")
'==== Test if the website has finish loading the 1st product details
On Error Resume Next
Dim testStatus As String
Do
Err.Clear
testStatus = productList(0).getElementsByClassName("shelfProductTile-descriptionLink")(0).innerText
Loop Until Err.Number = 0
'====
Dim outputArr() As String
ReDim outputArr(1 To productList.Length, 1 To 2) As String
Dim outputIndex As Long
Dim i As Long
For i = 0 To productList.Length - 1
If productList(i).getElementsByClassName("shelfProductTile-descriptionLink").Length <> 0 Then
If Err.Number <> 0 Then
Err.Clear
Exit For
End If
Dim productName As String
Dim productPrice As String
productName = productList(i).getElementsByClassName("shelfProductTile-descriptionLink")(0).innerText
productPrice = Replace(productList(i).getElementsByClassName("price")(0).innerText, vbNewLine, vbNullString)
outputIndex = outputIndex + 1
outputArr(outputIndex, 1) = productName
outputArr(outputIndex, 2) = productPrice
End If
Next i
ReDim Preserve outputArr(1 To outputIndex, 1 To 2) As String
ieObj.Quit
Set ieObj = Nothing
ThisWorkbook.Sheets("Sheet1").Range("A1").Resize(outputIndex, UBound(outputArr, 2)).Value = outputArr
End Sub
Running this will pull the data from the website and paste the output starting from cell A1 in Sheet1, please change the worksheet name and range as you see fits.

Webscraping in VBA where some HTML information has no way to refer to it

I have this VBA script scraping from this URL https://accessgudid.nlm.nih.gov/devices/10806378034350
I want the LOT,SERIAL, and EXPIRATION information which in the below pic, has a "Yes" or "No" inside the HTML.
How do I return just that Yes or No information?
Sub GetInnerInformation(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLResult As MSHTML.IHTMLElement
Dim HTMLResults As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
Set HTMLResults = HTMLPage.getElementsByClassName("device-attribute")
For Each HTMLResult In HTMLResults
If (HTMLResult.innerText Like "*Lot*") = True Then
Debug.Print HTMLResult.innerText, HTMLResult.outerText, HTMLResult.innerHTML
End If
Next HTMLResult
End Sub
In my Immediate Window I get:
Lot or Batch Number: Lot or Batch Number: Lot or Batch Number:
So no reference to the Yes or No that is in the HTML.
HTML Parser:
You could use a css attribute = value selector to target the span with [?] that is just before the div of interest. Then climb up to shared parent with parentElement, and move to the div of interest with NextSibling. You can then use getElementsByTagName to grab the labels nodes, and loop that nodeList to write out required info. To get the values associated with labels, you again need to use NextSibling to handle the br children within the parent div.
I use xmlhttp to make the request which is faster than opening a browser.
Option Explicit
Public Sub WriteOutYesNos()
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350", False
.send
html.body.innerHTML = .responseText
End With
Dim nodes As Object, i As Long
Set nodes = html.querySelector("[title*='A production identifier (PI) is a variable']").parentElement.NextSibling.getElementsByTagName("LABEL")
For i = 0 To nodes.Length - 3
With ActiveSheet
.Cells(i + 1, 1) = nodes(i).innerText
.Cells(i + 1, 2) = nodes(i).NextSibling.NodeValue
End With
Next
End Sub
JSON Parser:
Data is also available as json which means you can use a json parser to handle. I use jsonconverter.bas as the json parser to handle response. Download raw code from here and add to standard module called JsonConverter . You then need to go VBE > Tools > References > Add reference to Microsoft Scripting Runtime. Remove the top Attribute line from the copied code.
Option Explicit
Public Sub WriteOutYesNos()
Dim json As Object, ws As Worksheet, results(), i As Long, s As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
results = Array("lotBatch", "serialNumber", "manufacturingDate")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350.json", False
.send
Set json = JsonConverter.ParseJson(.responseText)
End With
With ws
For i = LBound(results) To UBound(results)
.Cells(i + 1, 1) = results(i)
.Cells(i + 1, 2).Value = IIf(json(results(i)), "Yes", "No")
Next
End With
End Sub
XML Parser:
Results also come as xml which you can parse with xml parser provided you handle the default namespace appropriately:
Option Explicit
Public Sub WriteOutYesNos()
Dim xmlDoc As Object, ws As Worksheet, results(), i As Long
Set xmlDoc = CreateObject("MSXML2.DOMDocument")
Set ws = ThisWorkbook.Worksheets("Sheet1")
results = Array("lotBatch", "serialNumber", "manufacturingDate")
With xmlDoc
.validateOnParse = True
.setProperty "SelectionLanguage", "XPath"
.setProperty "SelectionNamespaces", "xmlns:i='http://www.fda.gov/cdrh/gudid'"
.async = False
If Not .Load("https://accessgudid.nlm.nih.gov/devices/10806378034350.xml") Then
Err.Raise .parseError.ErrorCode, , .parseError.reason
Exit Sub
End If
End With
With ws
For i = LBound(results) To UBound(results)
.Cells(i + 1, 1) = results(i)
.Cells(i + 1, 2).Value = IIf(xmlDoc.SelectSingleNode("//i:" & results(i)).Text, "Yes", "No")
Next
End With
End Sub
Tinkered around and found it. I had to hardcode the results a little but here is what I got. Let me know if you've found a more elegant answer!
Sub GetInnerInformation(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLResult As MSHTML.IHTMLElement
Dim HTMLResults As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
Dim Lot As Boolean
Dim Serial As Boolean
Dim Expiration As Boolean
Set HTMLResults = HTMLPage.getElementsByClassName("expandable-device-content")
For Each HTMLResult In HTMLResults
If (HTMLResult.innerText Like "*Lot or Batch Number*") = True Then
Debug.Print HTMLResult.innerText
If HTMLResult.innerText Like "*Lot or Batch Number: Yes*" Then
Lot = True
End If
If HTMLResult.innerText Like "*Lot or Batch Number: No*" Then
Lot = False
End If
If HTMLResult.innerText Like "*Serial Number: Yes*" Then
Serial = True
End If
If HTMLResult.innerText Like "*Serial Number: No*" Then
Serial = False
End If
If HTMLResult.innerText Like "*Expiration Date: Yes*" Then
Serial = True
End If
If HTMLResult.innerText Like "*Expiration Date: No*" Then
Serial = False
End If
Debug.Print Lot, Serial, Expiration
End If
Next HTMLResult
End Sub

How to Isolate multiple innertext entries when using get elementbyID

I'm trying to isolate 2 different innerText strings from a webpage, but cannot single them out. The innerText for all the tags comes as a whole.
The date and season number are the issue.
I am using getElementById and this gives me a single element. The div with id "next_episode" has what looks like 2 different entries for inner text that I'm interested in. When I loop through the inner text of its children, these 2 entries are skipped. I can't figure out how to isolate the 2 different innerText entries of just the "next_episode" tag. I'm isolating the text I require by using the index number in the arrays my code returns.
Dim IE_00 As SHDocVw.InternetExplorer
Dim HTMLDoc_00 As MSHTML.HTMLDocument
Set IE_00 = New SHDocVw.InternetExplorer
IE_00.Visible = True
IE_00.navigate "https://next-episode.net/final-space"
Do While IE_00.readyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc_00 = IE_00.document
Dim NETC_05 As MSHTML.IHTMLElementCollection
Dim NET_05 As MSHTML.IHTMLElement
'Can loop through the inner text of the children one by one and find what
I need
Set NETC_05 = HTMLDoc_00.getElementById("next_episode").Children
For Each NET_05 In NETC_05
Debug.Print NET_05.innerText
Next NET_05
'This just gives a big block of text that includes the missing inner text
I need
Set NET_05 = HTMLDoc_00.getElementById("next_episode")
Debug.Print NET_05.innerText
Data (for the most part) is in NextSiblings:
The Node.nextSibling read-only property returns the node immediately
following the specified one in their parent's childNodes, or returns
null if the specified node is the last child in the parent element.
*1
You could write a function, like GetNextSiblings, which checks the current node for specific search strings and then extracts required values from NextSibling. I have re-ordered output columns to make for less code, but you could easily loop an alternative headers array, and use that ordering to access from dict info to write out values in a different order. I determine order of output by order of entry of keys in the dict. I loop the headers array to populate the dict keys and later update the dict with the values scraped.
The overhead of a browser is not needed as the required content is not dynamically loaded. A simple, and much faster, xhr request will suffice.
Side-note:
I would recommend, for this type of page, to use Python 3 and BeautifulSoup (bs4 4.7.1+) as this gives you access to pseudo selector :contains. The code could then be much more concise and the program faster. I show this at the end.
VBA:
Option Explicit
Public Sub GetShowInfo()
Dim html As MSHTML.HTMLDocument, headers(), i As Long, aCollection As Object, info As Object
headers = Array("Name:", "Countdown:", "Date:", "Season:", "Episode:", "Status:")
Set html = New HTMLDocument
With CreateObject("Msxml2.xmlhttp")
.Open "GET", "https://next-episode.net/final-space", False
.send
html.body.innerHTML = .responseText
End With
Set info = CreateObject("Scripting.Dictionary")
For i = LBound(headers) To UBound(headers)
info(headers(i)) = vbNullString
Next
info("Name:") = html.querySelector("#next_episode .sub_main").innerText
info("Countdown:") = html.querySelector("#next_episode span").innerText
Set aCollection = html.getElementById("middle_section").getElementsByTagName("div")
Set info = GetNextSiblings(aCollection, headers, info)
Set aCollection = html.getElementById("next_episode").getElementsByTagName("div")
Set info = GetNextSiblings(aCollection, headers, info)
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 1).Resize(1, info.Count) = info.keys
.Cells(2, 1).Resize(1, info.Count) = info.items
End With
End Sub
Public Function GetNextSiblings(ByVal aCollection As Object, ByRef headers(), ByVal info As Object) As Object
Dim item As Object, i As Long
For Each item In aCollection
For i = 2 To UBound(headers)
If InStr(item.outerHTML, headers(i)) > 0 Then
If headers(i) = "Episode:" Then
info(headers(i)) = item.NextSibling.innerText
Else
info(headers(i)) = item.NextSibling.NodeValue
End If
Exit For
End If
Next
Next
Set GetNextSiblings = info
End Function
Reading:
NextSibling
CSS selectors
querySelector
Python (with bs4 4.7.1+):
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://next-episode.net/final-space')
soup = bs(r.content, 'lxml')
current_nodes = ['Status:','Name:', 'Countdown:','Date:','Season:','Episode:']
for node in current_nodes:
selector = f'#middle_section div:contains("{node}"), #next_episode div:contains("{node}")'
if node in ['Episode:','Name:']:
print(node, soup.select_one(selector).text.replace(node,''))
elif node == 'Countdown:':
print(node, soup.select_one(selector).next_sibling.text)
else:
print(node, soup.select_one(selector).next_sibling)
'Setting XML 05 as an Object
Dim XML_05 As New MSXML2.XMLHTTP60
'Setting HTML Document 05 as an Object
Dim HTML_05 As New MSHTML.HTMLDocument
XML_05.Open "GET", Cells(Row, NextEpisodeURL).Value, False
XML_05.send
HTML_05.body.innerHTML = XML_05.responseText
'Setting Net Element Tag Collection 05 as an Object
Dim NETC_05 As MSHTML.IHTMLElementCollection
'Setting Net Element Tag 05 as an Object
Dim NET_05 As MSHTML.IHTMLElement
'Setting Reg EX 05 as an Object
Dim REO_05 As VBScript_RegExp_55.RegExp
'Setting Match Object 05 as Object
Dim MO_05 As Object
'Setting Season array as Array
Dim SN_05() As String
'Setting Episode Name 05 as Array
Dim ENA_05() As String
'Setting Episode Number 05 as Array
Dim EN_05() As String
'Getting Episode Name Episode Number and Season Number From Net
'Set NETC_05 = HTML_05.getElementsByClassName("sub_main")
Set NET_05 = HTML_05.getElementById("previous_episode")
Set REO_05 = New VBScript_RegExp_55.RegExp
REO_05.Global = True
REO_05.IgnoreCase = True
'Getting Episode Name
REO_05.Pattern = "(Name:(.*))"
Set MO_05 = REO_05.Execute(NET_05.innerText)
Debug.Print MO_05.Count
Debug.Print MO_05(0).Value
ENA_05 = Split(MO_05(0), ":")
Debug.Print ENA_05(1)
Cells(Row, NextEpName).Value = ENA_05(1)
'Getting Episode Number
REO_05.Pattern = "(Episode:([0-9]*))"
Set MO_05 = REO_05.Execute(NET_05.innerText)
Debug.Print MO_05.Count
Debug.Print MO_05(0).Value
EN_05 = Split(MO_05(0), ":")
Debug.Print EN_05(1)
Cells(Row, EpisodeNet).Value = EN_05(1)
'Getting Season Number
REO_05.Pattern = "(Season:([0-9]*))"
Set MO_05 = REO_05.Execute(NET_05.innerText)
Debug.Print MO_05.Count
Debug.Print MO_05(0).Value
SN_05 = Split(MO_05(0), ":")
Debug.Print SN_05(1)
Cells(Row, SeasonNet).Value = SN_05(1)
'Getting Countdown From Net
Set NETC_05 = HTML_05.getElementById("next_episode").Children
Cells(Row, Countdown).Value = NETC_05(5).innerText
Debug.Print NETC_05(5).innerText
Public Sub GetShowInfo()
Dim html As MSHTML.HTMLDocument, headers(), i As Long, aCollection As Object, info As Object
Set html = New HTMLDocument
With CreateObject("Msxml2.xmlhttp")
.Open "GET", "https://next-episode.net/chicago-fire", False
.send
html.body.innerHTML = .responseText
End With
Set aCollection = html.getElementById("next_episode").getElementsByTagName("div")
Set aCollection = html.getElementById("next_episode").getElementsByClassName("subheadline")
On Error Resume Next
For Each ele In aCollection
Debug.Print ele.innerText
Debug.Print ele.outerText
Debug.Print ele.nextElementSibling.innerText
Debug.Print ele.nextElementSibling.innerText
Next ele
End Sub

Exploring the Instr VBA Function In Webscraping

I want to scrape this URL https://www.realtor.com/realestateandhomes-search/06510 using the VBA InStr function and extract all URLs with this substring "06510"
Here's is a sample code I've been trying to make work.
Option Explicit
Sub GetLinks()
'
'To use HTMLDocument you need to set a reference to Tools -> References -> Microsoft HTML Object Library
Dim HTML As New HTMLDocument
Dim http As Object
Dim links As Object
Dim link As HTMLHtmlElement
Dim counter As Long
Dim website As Range
Dim LastRange As Range
Dim row As Long
Dim continue As Boolean
Dim respHead As String
Dim lRow As Long
Application.ScreenUpdating = False
' The row where website addresses start
row = 30
continue = True
lRow = Cells(Rows.count, 1).End(xlUp).row + 1
' XMLHTTP gives errors where ServerXMLHTTP does not
' even when using the same URL's
'Set http = CreateObject("MSXML2.XMLHTTP")
Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
Do While continue
' Could set this to first cell with URL then OFFSET columns to get next web site
Set website = Range("A" & row)
Set LastRange = Range("B" & lRow)
If Len(website.Value) < 1 Then
continue = False
Exit Sub
End If
If website Is Nothing Then
continue = False
End If
'Debug.Print website
With http
On Error Resume Next
.Open "GET", website.Value, False
.send
' If Err.Num is not 0 then an error occurred accessing the website
' This checks for badly formatted URL's. The website can still return an error
' which should be checked in .Status
'Debug.Print Err.Number
' Clear the row of any previous results
Range("B" & row & ":e" & row).Clear
' If the website sent a valid response to our request
If Err.Number = 0 Then
If .Status = 200 Then
HTML.body.innerHTML = http.responseText
Set links = HTML.getElementsByTagName("a")
For Each link In links
If InStr(link.outerHTML, "06510") Then
LastRange.Value = link.href
End If
Next
End If
Set website = Nothing
Else
'Debug.Print "Error loading page"
LastRange.Value = "Error with website address"
End If
On Error GoTo 0
End With
row = row + 1
Loop
Application.ScreenUpdating = True
End Sub
After inspecting the page, here's a sample of the kind of URL to extract - https://www.realtor.com/realestateandhomes-detail/239-Bradley-St_New-Haven_CT_06510_M36855-92189. Any help will be appreciated
Using QHarr's code in a simplified way...
Sub GetLinks()
Dim url As String, links_count As Integer
Dim j As Integer, row As Integer
Dim XMLHTTP As Object, html As Object
'Dim tr_coll As Object, tr As Object
'Dim elements As Object
Dim i As Long, allLinksOfInterest As Object
'Dim td_coll As Object, td As Object, td_col, objT
url = "https://www.realtor.com/realestateandhomes-search/06510"
Set XMLHTTP = CreateObject("MSXML2.XMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.responseText
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
Debug.Print allLinksOfInterest.Item(i).href
Next
End Sub
Please check if I'm missing anything. I'm still getting the error "Object doesn't support this property or method"
Don't use Instr on entire node outerHTML during a loop of all a tags. There are times when this is required but this shouldn't be one of them (hopefully).
You want to use attribute = value css selector with contains, *, operator. It is specifically for the purpose of matching on substrings in attribute values. This is more efficient.
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
So,
Dim i As Long, allLinksOfInterest As Object
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
Debug.Print allLinksOfInterest.Item(i).href
Next
Attribute = value with contains operator:
[attr*=value]
Represents elements with an attribute name of attr whose
value contains at least one occurrence of value within the string.
VBA:
Produces 26 links currently.All are relative links so need domain added as shown in loop. Some are duplicates so consider adding to a dictionary as keys so as remove duplicates.
Option Explicit
Public Sub GetLinks()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.realtor.com/realestateandhomes-search/06510", False
.send
html.body.innerHTML = .responseText
End With
Dim i As Long, allLinksOfInterest As Object
Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
Debug.Print Replace$(allLinksOfInterest.item(i).href,"about:","https://www.realtor.com")
Next
End Sub
If InStr(link.outerHTML, "06510") Then
In the code above, InStr function was used like boolean function. But it is not boolean, instead it returns integer. So, you should add comparison operator after function. May be like:
If InStr(link.outerHTML, "06510")>0 Then

Scraping a specific <p> class from Yahoo Finance (VBA, Excel)

I have been trying to extract specific information from a certain nested class from the code at the following location
https://finance.yahoo.com/quote/ASUR?p=ASUR
The class where in innertext is located is named "D(ib) Va(t)" and as far as I have seen at least this text is unique for the class name. I am using the following code to get the data.
Private Sub CommandButton1_Click()
Dim IE4 As Object
Dim strURL3 As String
Dim divs1 As Object
Dim symbol1 As String
Dim rowd As Integer
Dim divs2 As Object
'turn calculation off
Application.ScreenUpdating = False
Application.DisplayAlerts = False
Application.Calculation = xlCalculationManual
rowd = 1
'Sheets.Add After:=Sheets(ActiveWorkbook.Sheets.Count)
'Sheets(ActiveSheet.Name).Name = "Industry"
'Worksheets("Panel").Activate
'Range("B9").Select
Set IE4 = CreateObject("InternetExplorer.Application") 'Create only one IE instance
'Do Until ActiveCell.Value = "" 'Loop
'symbol1 = ActiveCell.Value
strURL3 = "https://finance.yahoo.com/quote/ASUR?p=ASUR"
IE4.Visible = True 'Flag to remove IE visibility
VBA.Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 264", vbMinimizedNoFocus
VBA.Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 258", vbMinimizedNoFocus
IE4.Navigate strURL3
While IE4.Busy: DoEvents: Wend 'Break of 4 seconds after loading
Application.Wait (Now + TimeValue("0:00:04"))
Set divs1 = IE4.Document.getelementsbytagname("div")
Worksheets("Industry").Activate
ActiveSheet.Cells("1,2").Select
For Each div In divs1
Set divs2 = IE4.Document.getelementsbytagname("p")
For Each p In divs2
If p.classname = "D(ib) Va(t)" Then
Debug.Print p.innertext
'Cells(rowd, 2) = p.innertext
'rowd = rowd + 1
End If
Next p
Next div
'Sheets("Panel").Select
' ActiveCell.Offset(1, 0).Select
'Loop
'Sheets("Panel").Select
'Range("B9").Select 'range that selects rows and columns to paste in every company sheet
'Range(Selection, Selection.End(xlDown)).Select
'Range(Selection, Selection.End(xlDown)).Select
'Range(Selection, Selection.End(xlDown)).Select
' Selection.Copy
' Sheets("Industry").Select
'Range("A1").Select
'ActiveSheet.Paste
' Application.CutCopyMode = False
IE4.Quit
'turn calculation on
Application.Calculation = xlCalculationAutomatic
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub
It does not capture the class, can anyone help me with this?
It's not a good practice to go for selecting compound classes as long as there is a way to avoid that. Check out the following implementation to achieve the same result:
Sub FetchText()
Const Link$ = "https://finance.yahoo.com/quote/ASUR?p=ASUR"
Dim Http As New XMLHTTP60, Htmldoc As New HTMLDocument
With Http
.Open "GET", Link, False
.send
Htmldoc.body.innerHTML = .responseText
End With
MsgBox Htmldoc.querySelector("p.businessSummary").PreviousSibling.LastChild.innerText
End Sub
The p tag element with that class includes the company sector, industry and employee info. You can use a faster method of xmlhttp to retrieve by avoiding opening a browser. Then use a css selector combination to target the element
Option Explicit
Public Sub GetInfo()
Const URL As String = "https://finance.yahoo.com/quote/ASUR?p=ASUR"
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.Send
html.body.innerHTML = .responsetext
End With
Debug.Print html.querySelector("p.D\(ib\).Va\(t\)").innerText
End Sub
If you want to avoid compound classes you can use the following:
Debug.Print html.querySelector("p + .D\(ib\)").innerText
References (VBE > Tools > References):
Microsoft HTML Object Library