How can I scrape a child span class using VBA? - html

I would like to use the below code to scrape the ranking position in one of amazon's sub categories (Candy & Chocolate Bars). Please see attached picture for the html code. Referring to "a-list-item" gives me a lot of results, however, the "#204" is not part of it. Thus, my question is two-fold: (i) is there a way to get all the items labeled "a-list-item" and (ii) is there a way that just gives me back the ranking position without pulling all other items. Option (ii) would be superior for my purposes.
Thanks for your time and efforts.
html code with target value
Public Sub social()
'--------------------------------------------------------------------------
Dim WSactive As Worksheet
Dim IE As New InternetExplorer
Set WBactive = ActiveWorkbook
Set WSactive = WBactive.Sheets("Tabelle1")
'-----------------------------------------------------------------------------
On Error Resume Next
'-------------------------------------------------------------------------
With WSactive
Dim results(0 To 4) ', counter As Long, i As Long
With IE
.Visible = False
'If TBLurl.DataBodyRange.Cells(i, w).Value <> "" Then
url = "https://www.amazon.com/dp/B08X19ZCHS?ref=myi_title_dp"
.navigate url
'--------------------------------------------------------------------------
While .Busy Or .readyState < 4: DoEvents: Wend
Dim aNodeList As Object, ele As Object, t As Date
Const MAX_WAIT_SEC As Long = 5
t = Timer
Do
DoEvents
On Error Resume Next
Set ele = .document.querySelector(".rhpdm")
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ele Is Nothing
'--------------------------------------------------------------------------
Set aNodeList = .document.querySelectorAll(".a-list-item")
Dim j As Long
For j = 0 To aNodeList.Length - 1
Debug.Print aNodeList.Item(j).innerText
Next j
Set aNodeList = Nothing: Set ele = Nothing
.Quit 'close IE
End With
End With
'-------------------------------------------------------------------------
End Sub

There is only 1 Url shown and a css path to that node can be
Debug.Print .document.querySelector('#detailBulletsWrapper_feature_div #detailBullets_feature_div + ul .a-list-item .a-list-item').innerText
The + is an adjacent sibling combinator specifying ul that follows at same level an element with id detailBullets_feature_div. The spaces are descendant combinators, meaning elements on right are children of elements on left of space. The . are class selectors for the class names of elements.
The html in your image appears to be slightly different from the provided url however. Your mileage with the above solution may vary. I have gone for positional matching on assumption other pages will have similar layout.
Even if you were doing a loop over different pages, pages as per the provided url would have only that one ranking to pick up (if as shown in your image).
You don't need the overhead of the browser. The following shows you how to get both rankings on the page (not just the one you showed) using xhr:
Option Explicit
Public Sub PrintRankings()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.amazon.com/dp/B07YG6MTD3?ref=myi_title_dp&th=1", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Debug.Print html.querySelector("#detailBullets_feature_div + ul span > span").NextSibling.NodeValue
Debug.Print html.querySelector("#detailBulletsWrapper_feature_div #detailBullets_feature_div + ul .a-list-item .a-list-item").innerText
End Sub

Related

How do I pull the second td tag from a table using Excel-VBA

I'm trying to pull the second td tag, or the US Ten Year Treasury rate from https://www.bankrate.com/rates/interest-rates/federal-funds-rate.aspx using Excel VBA. Here's what I have so far:
Sub Ten_Year_Treasury()
' Record the US Ten Year Treasury rate from https://www.bankrate.com/rates/interest-rates/federal-funds-rate.aspx
Range("A2").ClearContents
Dim ie As InternetExplorer
Dim htmlEle As IHTMLElement
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "https://www.bankrate.com/rates/interest-rates/federal-funds-rate.aspx"
Application.Wait (Now + TimeValue("00:00:04"))
Set Element = ie.document.getElementsByClassName("table-inline__caption")
For Each htmlEle In Element
With Sheets("10-Year Treasury")
.Range("A2").Value = htmlEle.Children(0).innerText
End With
Next
ie.Quit
'Remove Underline
Range("A2").Font.Underline = False
'Make Font Bold
Range("A2").Font.Bold = True
End Sub
I know it has something to do with my "Element", and I've seen videos where they talk about using "children" or "sibling". Advice on how to fix this?
You are using the wrong class name and in so doing are selecting the caption rather than the table. You can use a css class selector combined with nth-of-type to get 2nd td. I use one of the class values present in the table element.
.Range("A2").Value = ie.document.querySelector(".table-inline td:nth-of-type(2)").innerText
As that content is static you can use a faster xhr rather than browser to retrieve value. I show a variety of ways of then getting the node you want.
Option Explicit
Public Sub GetInterestRate()
Dim xhr As MSXML2.xmlhttp60, html As MSHTML.HTMLDocument
'required VBE (Alt+F11) > Tools > References > Microsoft HTML Object Library ; Microsoft XML, v6 (your version may vary)
Set xhr = New MSXML2.xmlhttp60
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.bankrate.com/rates/interest-rates/federal-funds-rate.aspx", False
.send
html.body.innerHTML = .responseText
End With
ActiveSheet.Cells(1, 1) = html.querySelectorAll(".table-inline td")(1).innerText
'html.querySelector(".table-inline").rows(1).cells(1).innertext
'html.querySelector(".table-inline").rows(1).children(1).innertext
'html.querySelector(".table-inline td + td").innertext
'html.querySelector(".table-inline td").nextsibling.innertext
End Sub
Read about:
css selectors
xhr
nextSibling
querySelector

Search a website using excel vba with excel data and extract the active state in flowchart of search result and mapping it into column

I am hoping someone can help....
I have around 7000 values in a excel spreadsheet that I need to search in a website and then record active state of result flowchart from the website to be inputted back into the excel spreadsheet. Since I am new to macros web scrape I used to automate web code modified input ids for the website which I want to extract information (https://nacionalidade.justica.gov.pt/). I am a bit confused in how to apply if condition to get the active state having seven classes in flowhchart, Here is the flow chart.
Now that I have access codes each will be on different stage, I only want to pick the state and put it in column E in front of the access code(currently doing manually)
I am unclear how to extract that info being new to this type of web data extraction - any help would be incredible!
Here is my code:(couldn't be able to change for mentioned web after this)
objIE.document.getElementById("btnPesquisa").Click
Code:
'start a new subroutine called SearchBot
Sub SearchBot()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim aEle As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://nacionalidade.justica.gov.pt/"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("SenhaAcesso").Value = _
Sheets("Guy Touti").Range("D2").Value
'click the 'go' button
objIE.document.getElementById("btnPesquisa").Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'the first search result will go in row 2
y = 2
'for each <a> element in the collection of objects with class of 'result__a'...
For Each aEle In objIE.document.getElementsByClassName("result__a")
'...get the href link and print it to the sheet in col C, row y
result = aEle
Sheets("Guy Touti").Range("E" & y).Value = result
'...get the text within the element and print it to the sheet in col D
Sheets("Guy Touti").Range("D" & y).Value = aEle.innerText
Debug.Print aEle.innerText
'is it a yellowpages link?
If InStr(result, "yellowpages.com") > 0 Or InStr(result, "yp.com") > 0 Then
'make the result red
Sheets("Guy Touti").Range("C" & y).Interior.ColorIndex = 3
'place a 1 to the left
Sheets("Guy Touti").Range("B" & y).Value = 1
End If
'increment our row counter, so the next result goes below
y = y + 1
'repeat times the # of ele's we have in the collection
Next
'add up the yellowpages listings
Sheets("Guy Touti").Range("B1").Value = _
Application.WorksheetFunction.Sum(Sheets("Guy Touti").Range("B2:B100"))
'close the browser
objIE.Quit
'exit our SearchBot subroutine
End Sub
I did try this first but after a while started searching for a better way. Can you help????
You can simplify the POST XHR request the page makes to get data and use the classnames to limit to nodes with either active1 or active3. Take the last node in that nodelist and extract the step number and convert colour via lookup (if wanted). With 7,000 requests it might be considerate to add a delay in every 50 requests, or less, of 1-2 seconds. You can i mod 50 to determine this in the loop and use Application.Wait Now + Timeserial(0,0,2)
Option Explicit
Public Sub GetStatus()
Dim html As MSHTML.HTMLDocument, xhr As Object, colourLkup As Object
Dim ws As Worksheet, senhas(), i As Long, results()
Set ws = ThisWorkbook.Worksheets("Sheet1")
senhas = Application.Transpose(ws.Range("D2:D" & ws.Cells(ws.Rows.Count, "D").End(xlUp).Row))
ReDim results(1 To UBound(senhas))
Set colourLkup = CreateObject("Scripting.Dictionary")
colourLkup.Add "active1", "green"
colourLkup.Add "active3", "orange"
Set html = New MSHTML.HTMLDocument
Set xhr = CreateObject("MSXML2.XMLHTTP")
For i = LBound(senhas) To UBound(senhas)
If senhas(i) <> vbNullString Then
With xhr
.Open "POST", "https://nacionalidade.justica.gov.pt/Home/GetEstadoProcessoAjax", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send "SenhaAcesso=" & senhas(i)
html.body.innerHTML = .responseText
End With
Dim nodes As Object, classinfo() As String
Set nodes = html.querySelectorAll(".active1, .active3")
classinfo = Split(nodes(nodes.Length - 1).className, Chr$(32))
results(i) = Replace$(classinfo(1), "step", vbNullString) & "-" & colourLkup(classinfo(2))
End If
Set nodes = Nothing
Next
ws.Cells(2, 5).Resize(UBound(results), 1) = Application.Transpose(results)
End Sub

Extract hyperlink from website using VBA facing error

I am trying to extract all the hyperlinks which contains"http://www.bursamalaysia.com/market/listed-companies/company-announcements/" from the webpages I input.
Firstly, the code ran well but after then I am facing the problems which I could not extract the url link that I needed. It just missing every time i run the sub.
Link:http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All
Sub scrapeHyperlinks()
Dim IE As InternetExplorer
Dim html As HTMLDocument
Dim ElementCol As Object
Dim Link As Object
Dim erow As Long
Application.ScreenUpdating = False
Set IE = New InternetExplorer
For u = 1 To 50
IE.Visible = False
IE.navigate Cells(u, 2).Value
Do While IE.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to websiteî“‘hahaha"
DoEvents
Loop
Set html = IE.document
Set ElementCol = html.getElementsByTagName("a")
For Each Link In ElementCol
erow = Worksheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1).Value = Link
Cells(erow, 1).Columns.AutoFit
Next
Next u
ActiveSheet.Range("$A$1:$A$152184").AutoFilter Field:=1, Criteria1:="http://www.bursamalaysia.com/market/listed-companies/company-announcements/???????", Operator:=xlAnd
For k = 1 To [A65536].End(xlUp).Row
If Rows(k).Hidden = True Then
Rows(k).EntireRow.Delete
k = k - 1
End If
Next k
Set IE = Nothing
Application.StatusBar = ""
Application.ScreenUpdating = True
End Sub
Just to get the qualifying hrefs that you mention from the URL given I would use the following. It uses a CSS selector combination to target the URLs of interest from the specified page.
The CSS selector combination is
#bm_ajax_container [href^='/market/listed-companies/company-announcements/']
This is a descendant selector looking for elements with attribute href whose value starts with /market/listed-companies/company-announcements/, and having a parent element with id of bm_ajax_container. That parent element is the ajax container div. The "#" is an id selector and the "[] " indicates an attribute selector. The "^" means starts with.
Example of container div and first matching href:
As more than one element is to be matched the CSS selector combination is applied via querySelectorAll method. This returns a nodeList whose .Length can be traversed to access individual items by index.
The full set of qualifying links are written out to the worksheet.
Example CSS query results from page using selector (sample):
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
Application.ScreenUpdating = False
With IE
.Visible = True
.navigate "http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim links As Object, i As Long
Set links = .document.querySelectorAll("#bm_ajax_container [href^='/market/listed-companies/company-announcements/']")
For i = 0 To links.Length - 1
With ThisWorkbook.Worksheets("Sheet1")
.Cells(i + 1, 1) = links.item(i)
End With
Next i
.Quit
End With
Application.ScreenUpdating = True
End Sub

scraping web vba tags nested

I always have problems when I do web scraping with vba if I find tags nested like this in the link
http://forebet.com
scrape the link data from the menu on the left, but I'm wrong when I get to the championships nested as England, Spain
Sub championshipforebet()
Dim objIE As Object
Dim itemEle As Object
Dim itemEle1 As Object
Dim away As Object
Dim desc As String, pt1 As String, pt2 As String, price As String
Dim i As Integer
Cells.Select
Selection.ClearContents
Selection.NumberFormat = "#"
Set objIE = CreateObject("internetexplorer.application")
objIE.Visible = True
objIE.navigate "https://www.forebet.com/it/"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set itemEle = objIE.document.getElementsByClassName("moduletable_foremenu")(1)
Set itemEle1 = itemEle.getElementsByClassName("tree_foremenu")
i = 1
For Each ele1 In itemEle1(0).getElementsByTagName("li")
i = i + 1
Cells(i, 1) = ele1(0).getElementsByClassName("mainlevel_foremenu").href
Next ele1
End Sub
Although there is no remarkable difference between the two answers, I decided to post mine as I've already created one. The for loop part and the split function might be helpful for future readers. As QHarr has already described about .querySelector() I didn't repeat the same.
This is the code you can try as well:
Sub GrabLinks()
Const Baseurl$ = "https://www.forebet.com"
Dim S$, I&
With New XMLHTTP60
.Open "GET", Baseurl & "/", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
With .querySelectorAll(".mainlevel_foremenu,.sublevel_foremenu")
For I = 0 To .Length - 1
Cells(I + 1, 1) = Baseurl & Split(.Item(I).getAttribute("href"), "about:")(1)
Next I
End With
End With
End Sub
Reference to add to the library:
Microsoft XML, v6.0
Microsoft HTML Object Library
You can use css selectors including the OR operator ",". Also, use XMLHTTPRequest (XHR) to retrieve the data far more quickly than by opening IE.
The links are in one of two classes:
The top level have class:
mainlevel_foremenu
The nested have class:
sublevel_foremenu
You can use a CSS selector combination to get all elements with either of these two classes, combining them with the OR operator.
a.mainlevel_foremenu,a.sublevel_foremenu
The a means a tag (element selector) and the . is a class selector. So elements with an a tag that have class name mainlevel_foremenu, or (",") sublevel_foremenu. The a is not actually required in this instance so you could shorten to .mainlevel_foremenu, .sublevel_foremenu.
The CSS selector returns more than one item so .querySelectorAll method of document is used to return a nodeList. The length of the nodeList is then iterated to access individual links by index.
The links are relative so Replace function is used to remove the "about:" and the base path is prefixed to the link address.
If we look at the first switch on the page:
The combined CSS selector results show we are also getting the nested level (note I have cut out some of the intermediary results).
VBA:
Option Explicit
Public Sub GetInfo()
Application.Screenupdating = False
Dim sResponse As String, i As Long, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.forebet.com/", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim links As Object
With html
.body.innerHTML = sResponse
Set links = .querySelectorAll("a.mainlevel_foremenu,a.sublevel_foremenu")
End With
For i = 0 To links.Length - 1
Cells(i + 1, 1) = "https://www.forebet.com" & Replace$(links(i).href, "about:", vbNullString)
Next i
Application.Screenupdating = True
End Sub
Sample results:
References (VBE > Tools > References):
Microsoft HTML Object Library

Cycling Through List of URLs Using Excel VBA

I am much more familiar with Excel now, but one thing is still baffling me - how to cycle through URLs in a loop. My current conundrum is that I have this list of URLs of packages, and need to obtain the status of each package on each page using its HTML. What I currently have to cycle through the list is:
Sub TrackingDeliveryStatusResults()
Dim IE As Object
Dim URL As Range
Dim wb1 As Workbook, ws1 As Worksheet
Dim filterRange As Range
Dim copyRange As Range
Dim lastRow As Long
Set wb1 = Application.Workbooks.Open("\\S51\******\Folders\******\TrackingDeliveryStatus.xls")
Set ws1 = wb1.Worksheets("TrackingDeliveryStatusResults")
Set IE = New InternetExplorer
With IE
.Visible = True
For Each URL In Range("C2:C & lastRow")
.Navigate URL.Value
While .Busy Or .ReadyState <> 4: DoEvents: Wend
MsgBox .Document.body.innerText
Next
End With
End Sub
And the list of URLs
My goal here is:
Cycle through each URL (inserts URL in IE and keeps going without opening new tabs)
Obtain the status of the item for each URL from the HTML element
FedEx: Delivered (td class="status")
UPS: Delivered (id="tt_spStatus")
USPS: Arrived at USPS Facility (class= "info-text first)
Finish the loop and save as a csv if at all possible (I've already done that, so I'm just posting the code portion I'm having a problem with).
My understanding is that I have to code a different if statement for each different url, since all of them have different HTML tags for their delivery status. Loops are simple, but to loop through webpages is new to me. The code has been throwing me errors no matter what changes I make.
The IE object opens up but then Excel hits an error and the code stops running.
OK Ill start with the proper syntax for you to get your code going and I will edit this answer for further code
Sub Sample()
Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False
Application.EnableEvents = True
Dim wsSheet As Worksheet, Rows As Long, links As Variant, IE As Object, link As Variant
Set wb = ThisWorkbook
Set wsSheet = wb.Sheets("Sheet1")
Set IE = New InternetExplorer
Rows = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row
links = wsSheet.Range("A1:A" & Rows)
With IE
.Visible = True
For Each link In links
.navigate (link)
While .Busy Or .ReadyState <> 4: DoEvents: Wend
MsgBox .Document.body.innerText
Next link
End With
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
This will get you looping I think you had some general syntax issues which you can see the difference in my code in order to loop through in the for each the link has to be of type object or variant and links I set to variant assuming it will default to a string