Excel VBA extracting href value - html

I have a macro that tries to extract all the href values from a page but it only seems to get the first one. If someone could help me out that would be greatly appreciated.
The URL I used is https://www.facebook.com/marketplace/vancouver/entertainment
Screenshot of HTML
<div class="_3-98" data-testid="marketplace_home_feed">
<div>
<div>
<div class="_65db">
<a class="_1oem" href="/marketplace/item/920841554781924" data-testid="marketplace_feed_item">
<a class="_1oem" href="/marketplace/item/580124349088759" data-testid="marketplace_feed_item">
<a class="_1oem" href="/marketplace/item/1060730340772072" data-testid="marketplace_feed_item">
Sub Macro1()
``marker = 0
Set objShell = CreateObject("Shell.Application")
IE_count = objShell.Windows.Count
For x = 0 To (IE_count - 1)
On Error Resume Next ' sometimes more web pages are counted than are open
my_url = objShell.Windows(x).document.Location
my_title = objShell.Windows(x).document.Title
If my_title Like "Facebook" & "*" Then 'compare to find if the desired web page is already open
Set ie = objShell.Windows(x)
marker = 1
Exit For
Else
End If
Next
Set my_data = ie.document.getElementsByClassName("_3-98")
Dim link
i = 1
For Each elem In my_data
Set link = elem.getElementsByTagName("a")(0)
i = i + 1
'copy the data to the excel sheet
ActiveSheet.Cells(i, 4).Value = link.href
Next
End Sub

You can use a CSS selector combination to get the elements. If you provide the actual HTML, not as an image it will be easier to test and determine best combination. The selector is applied via the querySelectorAll method to return a nodeList of all matching elements. You traverse the .Length of the nodeList to access items by index from 0 to .Length-1.
VBA:
Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll("._1oem[href]")
For i = 0 To aNodeList.Length-1
Activesheet.Cells(i + 2,4) = aNodeList.item(i)
Next
The css selector combination is ._1oem[href], which selects the href attributes of elements with a class of _1oem. The "." is a class selector and the [] an attribute selector. It is a fast and robust method.
The above assumes there are no parent form/frame/iframe tags to negotiate.
An alternative selector that matches on the two attributes, rather than the class would be:
html.querySelectorAll("[data-testid='marketplace_feed_item'][href]")
Full example:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
With IE
.Visible = True
.navigate "https://www.facebook.com/marketplace/vancouver/entertainment"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim aNodeList As Object, i As Long
Set aNodeList = IE.document.querySelectorAll("._1oem[href]")
For i = 0 To aNodeList.Length - 1
ActiveSheet.Cells(i + 2, 4) = aNodeList.item(i)
Next
'Quit '<== Remember to quit application
End With
End Sub

You only ask for the first anchor element within each element with a _3-98 class. Iterate through the collection of anchor elements within the parent element.
...
dim j as long
Set my_data = ie.document.getElementsByClassName("_65db")
For Each elem In my_data
for i = 0 to elem.getelementsbytagname("a").count -1
j = j+1
ActiveSheet.Cells(j, 4).Value = elem.getElementsByTagName("a")(i).href
next i
Next elem
...

Related

How can I scrape a child span class using VBA?

I would like to use the below code to scrape the ranking position in one of amazon's sub categories (Candy & Chocolate Bars). Please see attached picture for the html code. Referring to "a-list-item" gives me a lot of results, however, the "#204" is not part of it. Thus, my question is two-fold: (i) is there a way to get all the items labeled "a-list-item" and (ii) is there a way that just gives me back the ranking position without pulling all other items. Option (ii) would be superior for my purposes.
Thanks for your time and efforts.
html code with target value
Public Sub social()
'--------------------------------------------------------------------------
Dim WSactive As Worksheet
Dim IE As New InternetExplorer
Set WBactive = ActiveWorkbook
Set WSactive = WBactive.Sheets("Tabelle1")
'-----------------------------------------------------------------------------
On Error Resume Next
'-------------------------------------------------------------------------
With WSactive
Dim results(0 To 4) ', counter As Long, i As Long
With IE
.Visible = False
'If TBLurl.DataBodyRange.Cells(i, w).Value <> "" Then
url = "https://www.amazon.com/dp/B08X19ZCHS?ref=myi_title_dp"
.navigate url
'--------------------------------------------------------------------------
While .Busy Or .readyState < 4: DoEvents: Wend
Dim aNodeList As Object, ele As Object, t As Date
Const MAX_WAIT_SEC As Long = 5
t = Timer
Do
DoEvents
On Error Resume Next
Set ele = .document.querySelector(".rhpdm")
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ele Is Nothing
'--------------------------------------------------------------------------
Set aNodeList = .document.querySelectorAll(".a-list-item")
Dim j As Long
For j = 0 To aNodeList.Length - 1
Debug.Print aNodeList.Item(j).innerText
Next j
Set aNodeList = Nothing: Set ele = Nothing
.Quit 'close IE
End With
End With
'-------------------------------------------------------------------------
End Sub
There is only 1 Url shown and a css path to that node can be
Debug.Print .document.querySelector('#detailBulletsWrapper_feature_div #detailBullets_feature_div + ul .a-list-item .a-list-item').innerText
The + is an adjacent sibling combinator specifying ul that follows at same level an element with id detailBullets_feature_div. The spaces are descendant combinators, meaning elements on right are children of elements on left of space. The . are class selectors for the class names of elements.
The html in your image appears to be slightly different from the provided url however. Your mileage with the above solution may vary. I have gone for positional matching on assumption other pages will have similar layout.
Even if you were doing a loop over different pages, pages as per the provided url would have only that one ranking to pick up (if as shown in your image).
You don't need the overhead of the browser. The following shows you how to get both rankings on the page (not just the one you showed) using xhr:
Option Explicit
Public Sub PrintRankings()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.amazon.com/dp/B07YG6MTD3?ref=myi_title_dp&th=1", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Debug.Print html.querySelector("#detailBullets_feature_div + ul span > span").NextSibling.NodeValue
Debug.Print html.querySelector("#detailBulletsWrapper_feature_div #detailBullets_feature_div + ul .a-list-item .a-list-item").innerText
End Sub

VBA Web Scraping using getElementsByClassName to names and addresses

I'm trying to extract the clinic name and corresponding address for all the clinics from the following web page: https://medimap.ca/Location/Calgary,%20AB,%20Canada
I'm having issues locating the exact area where I should be drilling down into. All the clinic names have the same class name of "_1FLG5" and the addresses are all "_1-Gov" . However, when I run through the below code nothing happens - no errors just nothing.
I'm also unsure if the reference after .getElementsByClassName is correct, as I want the inner text from the same row as where the "_1FLG5" is I referenced (0) and since I wanted the text from two rows below "_1-Gov" I referenced (2).
Option Explicit
Sub GetClinicData()
Dim objIE As InternetExplorer
Dim clinicEle As Object
Dim clinicAdd As Object
Dim clinicName As String
Dim address As String
Dim y As Integer
Dim x As Integer
Set objIE = New InternetExplorer
objIE.Visible = False
objIE.navigate "https://medimap.ca/Location/Calgary,%20AB,%20Canada"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each clinicEle In objIE.document.getElementsByClassName("_1FLG5")
clinicName = clinicEle.getElementsByClassName("_1FLG5")(0).innerText
Sheets("Sheet1").Range("A" & y).Value = clinicName
y = y + 1
Next
x = 1
For Each clinicAdd In objIE.document.getElementsByClassName("_1-Gov")
clinicAdd = clinicAdd.getElementsByClassName("_1-Gov")(2).innerText
Sheets("Sheet1").Range("B" & x).Value = clinicAdd
x = x + 1
Next
End Sub
Content is dynamically loaded so you need a wait condition to ensure content loaded - otherwise your collections end up being of length 0. I use querySelectorAll to apply the class names which return nodeList you For Loop over the .Length of. Ideally you should add a timeout condition to the loop. I show a timed loop here.
Option Explicit
'VBE > Tools > References: Microsoft Internet Controls
Public Sub GetData()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate2 "https://medimap.ca/Location/Calgary,%20AB,%20Canada"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim clinics As Object, addresses As Object, i As Long
With .document
Do
Set clinics = .querySelectorAll("._1FLG5")
Set addresses = .querySelectorAll("._1-Gov")
Loop While clinics.Length = 0
For i = 0 To clinics.Length - 1
With ThisWorkbook.Worksheets("Sheet1")
.Cells(i + 1, 1) = Trim$(clinics.item(i).innerText)
.Cells(i + 1, 2) = Trim$(addresses.item(i).innerText)
End With
Next
End With
.Quit
End With
End Sub

Extract hyperlink from website using VBA facing error

I am trying to extract all the hyperlinks which contains"http://www.bursamalaysia.com/market/listed-companies/company-announcements/" from the webpages I input.
Firstly, the code ran well but after then I am facing the problems which I could not extract the url link that I needed. It just missing every time i run the sub.
Link:http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All
Sub scrapeHyperlinks()
Dim IE As InternetExplorer
Dim html As HTMLDocument
Dim ElementCol As Object
Dim Link As Object
Dim erow As Long
Application.ScreenUpdating = False
Set IE = New InternetExplorer
For u = 1 To 50
IE.Visible = False
IE.navigate Cells(u, 2).Value
Do While IE.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to websitehahaha"
DoEvents
Loop
Set html = IE.document
Set ElementCol = html.getElementsByTagName("a")
For Each Link In ElementCol
erow = Worksheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1).Value = Link
Cells(erow, 1).Columns.AutoFit
Next
Next u
ActiveSheet.Range("$A$1:$A$152184").AutoFilter Field:=1, Criteria1:="http://www.bursamalaysia.com/market/listed-companies/company-announcements/???????", Operator:=xlAnd
For k = 1 To [A65536].End(xlUp).Row
If Rows(k).Hidden = True Then
Rows(k).EntireRow.Delete
k = k - 1
End If
Next k
Set IE = Nothing
Application.StatusBar = ""
Application.ScreenUpdating = True
End Sub
Just to get the qualifying hrefs that you mention from the URL given I would use the following. It uses a CSS selector combination to target the URLs of interest from the specified page.
The CSS selector combination is
#bm_ajax_container [href^='/market/listed-companies/company-announcements/']
This is a descendant selector looking for elements with attribute href whose value starts with /market/listed-companies/company-announcements/, and having a parent element with id of bm_ajax_container. That parent element is the ajax container div. The "#" is an id selector and the "[] " indicates an attribute selector. The "^" means starts with.
Example of container div and first matching href:
As more than one element is to be matched the CSS selector combination is applied via querySelectorAll method. This returns a nodeList whose .Length can be traversed to access individual items by index.
The full set of qualifying links are written out to the worksheet.
Example CSS query results from page using selector (sample):
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
Application.ScreenUpdating = False
With IE
.Visible = True
.navigate "http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim links As Object, i As Long
Set links = .document.querySelectorAll("#bm_ajax_container [href^='/market/listed-companies/company-announcements/']")
For i = 0 To links.Length - 1
With ThisWorkbook.Worksheets("Sheet1")
.Cells(i + 1, 1) = links.item(i)
End With
Next i
.Quit
End With
Application.ScreenUpdating = True
End Sub

Extracting a series of URL using VBA

I just trying to run through a list of url link, but it keep showing run time error'91',object variable or with block variable not set.
The data I want to extract is from iframes. It do shown some of the values but it stuck in the middle of process with error.
Below is the sample url link that I want to extract value from:http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201
Public Sub GetInfo()
Dim IE As New InternetExplorer As Object
With IE
.Visible = False
For u = 2 To 100
.navigate Cells(u, 1).Value
While .Busy Or .readyState < 4: DoEvents: Wend
With .document.getElementById("bm_ann_detail_iframe").contentDocument
ThisWorkbook.Worksheets("Sheet1").Cells(u, 3) = .getElementById("main").innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 4) = .getElementsByClassName("company_name")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 5) = .getElementsByClassName("formContentData")(0).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 6) = .getElementsByClassName("formContentData")(5).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 7) = .getElementsByClassName("formContentData")(7).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 8) = .getElementsByClassName("formContentData")(8).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 10) = .getElementsByClassName("formContentData")(10).innerText
ThisWorkbook.Worksheets("Sheet1").Cells(u, 11) = .getElementsByClassName("formContentData")(11).innerText
End With
Next u
End With
End Sub
tl;dr
Your error is due to the fact there are different numbers of elements for the given class name depending on the results per page. So you can't used fixed indexes. For the page you indicated the last index for that class, via the iframe, is 9 i.e. ThisWorkbook.Worksheets("Sheet1").cells(u, 9) = .getElementsByClassName("formContentData")(9).innerText . 10 and 11 are invalid. Below I show a way to determine the number of results and extract info from each result row.
General principle:
Ok... so the following works on the principle of targeting the Details of Changes table for most of the info.
Example extract:
More specifically, I target the rows that repeat the info for No, Date of Change, #Securities, Type of Transaction and Nature of Interest. These values are stored in an array of arrays (one array per row of information). Then the results arrays are stored in a collection to later be written out to the sheet. I loop each table cell in the targeted rows (td tag elements within parent tr) to populate the arrays.
I add in the Name from the table above on the page and also, because there can be more than one row of results, depending on the webpage, and because I am writing the results to a new Results sheet, I add in the URL before each result to indicate source of information.
TODO:
Refactor the code to be more modular
Potentially add in some error handling
CSS selectors:
① I select the Name element, which I refer to as title, from the Particulars of substantial Securities Holder table.
Example name element:
Inspecting the HTML for this element shows it has a class of formContentLabel, and that it is the first class with this value on the page.
Example HTML for target Name:
This means I can use a class selector , .formContentLabel, to target the element. As it is a single element I want I use the querySelector method to apply the CSS selector.
② I target the rows of interest in the Details of Changes table with a selector combination of .ven_table tr. This is descendant selector combination combining selecting elements with tr tag having parent with class ven_table. As these are multiple elements I use the querySelectorAll method to apply the CSS selector combination.
Example of a target row:
Example results returned by CSS selector (sample):
The rows I am interested start at 1 and repeat every + 4 rows after e.g. row 5 , 9 etc.
So I use a little maths in the code to return just the rows of interest:
Set currentRow = data.item(i * 4 + 1)
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, headers(), u As Long, resultCollection As Collection
headers = Array("URL", "Name", "No", "Date of change", "# Securities", "Type of Transaction", "Nature of Interest")
Set resultCollection = New Collection
Dim links()
links = Application.Transpose(ThisWorkbook.Worksheets("Sheet1").Range("A2:A3")) 'A100
With IE
.Visible = True
For u = LBound(links) To UBound(links)
If InStr(links(u), "http") > 0 Then
.navigate links(u)
While .Busy Or .readyState < 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 1) '<you may not always need this. Or may need to increase.
Dim data As Object, title As Object
With .document.getElementById("bm_ann_detail_iframe").contentDocument
Set title = .querySelector(".formContentData")
Set data = .querySelectorAll(".ven_table tr")
End With
Dim results(), numberOfRows As Long, i As Long, currentRow As Object, td As Object, c As Long, r As Long
numberOfRows = Round(data.Length / 4, 0)
ReDim results(1 To numberOfRows, 1 To 7)
For i = 0 To numberOfRows - 1
r = i + 1
results(r, 1) = links(u): results(r, 2) = title.innerText
Set currentRow = data.item(i * 4 + 1)
c = 3
For Each td In currentRow.getElementsByTagName("td")
results(r, c) = Replace$(td.innerText, "document.write(rownum++);", vbNullString)
c = c + 1
Next td
Next i
resultCollection.Add results
Set data = Nothing: Set title = Nothing
End If
Next u
.Quit
End With
Dim ws As Worksheet, item As Long
If Not resultCollection.Count > 0 Then Exit Sub
If Not Evaluate("ISREF('Results'!A1)") Then '<==Credit to #Rory for this test
Set ws = Worksheets.Add
ws.NAME = "Results"
Else
Set ws = ThisWorkbook.Worksheets("Results")
ws.cells.Clear
End If
Dim outputRow As Long: outputRow = 2
With ws
.cells(1, 1).Resize(1, UBound(headers) + 1) = headers
For item = 1 To resultCollection.Count
Dim arr()
arr = resultCollection(item)
For i = LBound(arr, 1) To UBound(arr, 1)
.cells(outputRow, 1).Resize(1, 7) = Application.WorksheetFunction.Index(arr, i, 0)
outputRow = outputRow + 1
Next
Next
End With
End Sub
Example results using 2 provided tests URLs:
Sample URLs in sheet1:
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5928057
http://www.bursamalaysia.com/market/listed-companies/company-announcements/5927201

find a string in DOM explorer using VBA

I need to find the string in the attached picture using a vba in excel. I have the code below, but it is not finding the date that I am looking for.
The for loop to find is at the last "For Each Element In Elements2"
Dim Doc As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim elements1 As IHTMLElementCollection
Dim Elements2 As IHTMLElementCollection
Dim iCnt As Integer
Dim Element As IHTMLElement
Dim appIE As InternetExplorerMedium
Sheets(1).Cells(1, 1).Value = ""
complete = 0
Set appIE = New InternetExplorerMedium
sURL = "https://example.com"
With appIE
.Navigate sURL
.Visible = True
Do While appIE.Busy Or appIE.ReadyState <> 4
DoEvents
Loop
Do While complete <> 1
Set Doc = appIE.Document
Set Elements = Doc.GetElementsByTagName("input")
Set elements1 = Doc.GetElementsByTagName("strong")
Set Elements2 = Doc.GetElementsByTagName("td")
For Each Element In Elements
If Element.ID = "form-id" Then
requestnumber = Element.GetAttribute("Value")
End If
If Element.ID = "remedy-case-info" Then
CaseInfo = Element.GetAttribute("Value")
End If
Next Element
For Each Element In elements1
If InStr(1, Element.InnerHtml, "EM") Then
For iCnt = 1 To Len(Element.InnerText)
If IsNumeric(Left(Element.InnerText, 2)) Then
NumericOnly (Element.InnerText)
End If
Next iCnt
End If
Next Element
AClientCount = tempcount
For Each Element In Elements2
' If InStr(1, Element.InnerHtml, "td") Then
If InStr(1, Element.InnerHtml, "value-field align-top") Then
Requestdate = Element.GetAttribute("Value")
End If
' End If
Next Element
Set Elements = Nothing
If requestnumber <> "" Then
Sheets(1).Cells(1, 1).Value = requestnumber & " - " & CaseInfo & " - " & tempcount & " - " & Requestdate
complete = 1
End If
Loop
.Quit
End With
The innerHtml property picks up the content within a tag but does not pick up the tag itself. The outerHTML property includes the tag itself as well as the tag's content.
Example:
HTML <p class="fee fie foe fum">bar <b>bat</b> <i>cat</i> car</p>
innerHTML bar <B>bat</B> <I>cat</I> car
outerHTML <P class="fee fie foe fum">bar <B>bat</B> <I>cat</I> car</P>
To perform a text match on an attribute value of an element, you would need to look at the outerHTML property of the element and not the innerHTML property.
However, the class attribute of an element can be accessed via the className property so you could replace the InStr on innerHTML with this:
If InStr(1, Element.className, "value-field align-top") Then
This is not ideal because it would be perfectly valid to write the class names in a different order - e.g. class="align-top value-field" - and this would not be picked up by the InStr function.
It would be better to start with getElementsByClassName (which doesn't care about which order the class names are in) and then use the tagName property to check we have the correct tag, like this:
Set Elements2 = Doc.getElementsByClassName("value-field align-top")
' code for the loops on Elements and Elements1 goes here
For Each Element In Elements2
If Element.tagName = "td" Then
Finally, Element.getAttribute("value") will return Null unless the element has a named attribute called "value". To get the text value of the element, use this instead:
Requestdate = Element.innerText