Identify NextSibling in XMLHTTP response - html

I am still trying to learn about NextSibling and I am using XMLHTTP in excel VBA.
Here's the HTML for the element
<ul class="list-unstyled list-specification">
<li><span>ID</span> <span class="text-info">22928</span></li>
<li><span>Category</span> <span class="text-info">Mechanical</span></li>
<li><span>Discipline</span> <span class="text-info">Mechanical </span></li>
<li><span>Commodity</span> <span class="text-info">Pipe</span></li>
<li><span>Sub commodity</span> <span class="text-info">12 In Pipe </span></li>
<li><span>UOM</span> <span class="text-info">EA</span></li>
<li><span>Available quantity</span> <span class="text-info">30</span></li>
<li><span>Age</span> <span class="text-info">8</span></li>
</ul>
I have used this line to spot on the first span in the li (lists) so as to identify the headers for each part
Set post = html.querySelectorAll(".list-specification li span")
Then I used loops like that
For j = 0 To post.Length - 1
If post.Item(j).innerText = "ID" Then
Debug.Print post.Item(j).NextSibling.innerText
End If
Next j
I got an error when trying to use NextSibling. I feel stuck as for that NextSibling .. Can you guide me?
for example ID is the first in the list and I would like to get that ID based on my approach
I got an error when trying nextElementSibling
Sub Test()
Dim html As New HTMLDocument, post As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "Get", "C:\Sample.html", False
.send
html.body.innerHTML = .responseText
End With
Set post = html.querySelectorAll(".list-specification li span")
For i = 0 To post.Length - 1
If post.Item(i).innerText = "ID" Then
MsgBox post.Item(i).nextElementSibling.innerText: Exit For
End If
Next i
End Sub

Try doing another NextSibling and then you should find it working:
Set post = Html.querySelectorAll(".list-specification li span")
For j = 0 To post.Length - 1
If post.Item(j).innerText = "ID" Then
MsgBox post.Item(j).NextSibling.NextSibling.innerText
Exit For
End If
Next j

The correct property to access I was expecting to be nextElementSibling, but it seems VBA does not implement this.
The NonDocumentTypeChildNode.nextElementSibling read-only property
returns the element immediately following the specified one in its
parent's children list, or null if the specified element is the last
one in the list.
You can however, more correctly, simply take the next index in post i.e. post.item(1). You are collecting both headers and values in the same nodeList so you can use odd/even distinction to separate headers from values.
You can see this if you run the following in console:
post = document.querySelectorAll(".list-specification li span");
var res = ''; for (let [i] of Object.entries(post)) {res += post.item(`${i}`).innerText + ' '};console.log(res);
Spans are inline containers and you can see from html that you have a space between spans which is part of the parent li and this becomes a child text node. This is why your nextSibling hits a text node and errors with the attempt at .innerText accessor. You would want a text node property such as .nodeValue (if you were at the right node).
You can step through, in the console, and see the different properties in action:
As nextElementSibling is not implemented in VBA you would need to chain nextSibling, as per #Sim's answer, if you want to explore nextSibling to solve this particular navigation. However, note that a test of nodeType would avoid throwing an error as you could then apply the appropriate accessor.

Related

Data extraction from HTML

I am trying to pull data from html text.
I am having an issue with the extraction code.
Normally I deal with div or Li, this html seems to be a bit more complicated.
It is using Div id, ul Class and Span Class.
What do I put in for Class or Li extraction?
For Each li In HTMLdoc.getElementsByTagName("li")
If li.getAttribute("class") = "a-link-normal" Then
Set link = li.getElementsByTagName("a")(0)
.Cells(i, 1).Value = link.getAttribute("href")
i = i + 1
End If
Next li
I have also posted this here.
The new code from PEH seems to work.
However I am getting an error message.
Error Line In Code
With this code If li.getAttribute("class") = "a-link-normal" Then you check if the current li has a class attribute a-link-normal like <li class="a-link-normal"> but is is actually a link element with the class a-link-normal and not a list element. So I think it should be somehow like this:
For Each li In HTMLdoc.getElementsByTagName("li")
Set link = li.getElementsByTagName("a")(0)
If link.getAttribute("class") = "a-link-normal" Then
.Cells(i, 1).Value = link.getAttribute("href")
i = i + 1
End If
Next li
You might come accross <li> elements that have no links <a> inside.
For Each li In HTMLdoc.getElementsByTagName("li")
Set link = Nothing
On Error Resume Next
Set link = li.getElementsByTagName("a")(0)
On Error Goto 0
If Not link Is Nothing Then
If link.getAttribute("class") = "a-link-normal" Then
.Cells(i, 1).Value = link.getAttribute("href")
i = i + 1
End If
End if
Next li
It is simpler and faster to just use the class direct. The css class selector "." shown below is combined with href attribute selector [href] so you only retrieve elements that match that class and have an href attribute
Dim items As Object
Set items = HTMLdoc.querySelectorAll(".a-link-normal[href]")
For i = 0 To items.Length - 1
.Cells(i + 1, 1).Value = items.item(i).href
End If

VBA: Access element inside a div loaded with angularjs, when ID and Classname changes randomly

I am a beginner in VBA and I make an application that fills the data coming from a xls page to a website that uses angularjs.
the website contains a table to fill out, so its ID and classname changes the four latest number randomly with each refresh. (for exemple: the code source below: 02XC of ID will be change to something different, but the prefix will remain the same). how can i manipulate this table?
Code source web site
<div tabindex="-1" class="ui-grid-cell ng-scope ui-grid-coluiGrid-02XC cell editable" id="1554799061475-1-uiGrid-02XC-cell" role="gridcell" aria-selected="false" ng-class="{ 'ui-grid-row-header-cell': col.isRowHeader }" ng-repeat="(colRenderIndex, col) in colContainer.renderedColumns track by col.uid" ui-grid-cell="" ui-grid-one-bind-id-grid="rowRenderIndex + '-' + col.uid + '-cell'">[Here is the web table that i want to upload values][1]</div>
my code
Set objHtml = New HTMLDocument
Set objHtml = internetExplorer.document
Set modric = objHtml.getElementsByClassName("ui-grid-cell ng-scope ui-grid-coluiGrid-02XC cell editable")
If modric.Length <> 0 Then
For q = 0 To modric.Length - 1
modric(q).Click
modric(q).innerText = Workbooks("remplir").Worksheets("sheet1").Range("F1").Offset(n + 1, 0).Value
Next q
End If
as i said, in the next refresh, 02XC 'll be change to a four other random number.
I want to get the ID or classname for each refresh to manipulate the table.
or every others ways that could help me to manage the table.
sorry my English is so poor. hope your understand me :)
You can use attribute = value selector with starts with (^) operator
[id^='1554799061475-1-uiGrid-']
I am not sure if you want a single or multiple matches
single match
Dim ele As Object
Set ele = objHtml.querySelector("[id^='1554799061475-1-uiGrid-']")
This assumes the id starts with 1554799061475-1-uiGrid- and only the end bit changes.
You can use querySelectorAll to return a nodeList if more than one match possible.

Selecting HTML dropdown with ReactJS in VBA

I'm trying to use VBA to select a dropdown item from an HTML website that uses ReactJS. For this example, we can use the following website:
https://jedwatson.github.io/react-select/
<span class="Select-value-label" role="option" aria-selected="true" id="react-select-2--value-item">New South Wales</span>
If an HTML page lists all the select options on the dropdown, I can easily set the elementID to one of the dropdown values.
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate "about:blank"
'with for page load
ieWaitForURL
.Navigate "https://jedwatson.github.io/react-select/"
ie.Document.getelementbyid("react-select-2--value-item").Value = "Victoria"
But the HTML of the ReactJS website doesn't list all the options of the dropdown, and the value of the innertext changes as I make a different selection.
Is there a way to select from a ReactJS dropdown using VBA if all the options aren't listed in the HTML?
It was actually a lot easier than I thought. The following uses selenium basic. Install selenium basic, ensure latest chromedriver.exe is in selenium folder, vbe > tools > references > add reference to selenium type library
I show grabbing all the option values into a dictionary. Also, selecting an item from the dropdown.
The key here is that the option menu is not a traditional select element, with child options, but uses React Select. The range of possible values are pulled via Ajax from this script.
I show how you could also retrieve the possible values from that script direct, at the end, using python, but am happy to translate to vba if you really are interested. Once the dropdown is clicked the list of available values can be collected.
If you want to go down the IE route you can use the same approach but need to trigger the events that will open the dropdown. These are also detailed in the js
script I think.
Option Explicit
Public Sub MakeSelection()
Dim d As WebDriver, i As Long, dropDownOptions As Object
Const URL = "https://jedwatson.github.io/react-select/"
Set d = New ChromeDriver
Set dropDownOptions = CreateObject("Scripting.Dictionary")
With d
.Start "Chrome"
.get URL
.FindElementByCss("button:nth-of-type(2)").Click
.FindElementByCss(".Select-arrow-zone").ClickAndHold
Dim item As Object
For Each item In .FindElementsByCss(".Select-menu div") 'put list of options in dictionary
dropDownOptions(item.Text) = i
i = i + 1
Next
For Each item In .FindElementsByCss(".Select-menu div") 'loop to select an option
If item.Text = "Victoria" Then 'If item.Text = dropDownOptions.item(3) etc....
item.Click
Exit For
End If
Next
Stop
.Quit
End With
End Sub
Python script to parse possible dropdown values from json:
This shows the 3 different element parts which are updated through the dropdown (labels, classes and values)
import requests
import re
import json
r = requests.get('https://jedwatson.github.io/react-select/app.js')
s = str(r.content)
p1 = re.compile('t\.AU=(.*)')
p2 = re.compile('.+?(?=,\[\d+\])')
data1 = re.findall(p1, s)[0]
data2 = re.findall(p2, data1)[0].replace(',disabled:!0','')
replacements = ['value:','label:','className:']
for item in replacements:
data2 = re.sub(item, '"' + item[:-1] + '":' , data2)
finals = data2.split(',t.US=')
finalAus = json.loads(finals[0])
# finalUs = json.loads(finals[1])
d = {}
i = 0
for item in finalAus:
d[item['label']] = i
# item['label']
# item['value']
# item['className']
i+=1
print(d)

Cannot click an link using click() event

I am trying to click a link within an unordered list. The unordered list is within frames and I am not exactly sure of the frame name, so I used a recursive search (code obtained from this forum),
Dim elem2 As Object
Set elem2 = FindInputByName(ie.document, "0/2")
If Not elem2 Is Nothing Then
elem2.Click 'THIS IS NOT WORKING
End If
Function FindInputByName(document As Object, name As String) As Object
Dim i As Integer, subdocument As Object, elem As Variant
Set FindInputByName = Nothing
For i = 0 To document.frames.Length - 1
Set subdocument = document.frames.Item(i).document
Set FindInputByName = FindInputByName(subdocument, name)
If Not FindInputByName Is Nothing Then Exit Function
Next i
For Each elem In document.getElementsByTagName("a")
If elem.ID = name Then
Set FindInputByName = elem
Exit Function
End If
Next elem
End Function
Using this code no 'click' is carried out.
Instead of click, I tried elem2.Focus elem2.FireEvent ("tree[i].onclick"), then the link gets selected but there is no click again.
the html snippet is,
<a id="0/2" style="padding-left: 13px;" href="#">GENERAL INFORMATION</a>
But the element has a click event 'tree[i].onclick' . So what should I do to click the link?
Thanks in advance.
After adding 'application.wait', on click is getting executed

Retrieve attributes and span using HTMLAgilityPack library

In this piece of HTML code:
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
<img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
</div>
<div class="release">
<h3>Wolf Eyes</h3>
<h4>
Lower Demos
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
Rock
Pop
</div>
</div>
I know how to parse it in other ways, but I would like to retrieve this Info using HTMLAgilityPack library:
Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year : 2013
Genres: Rock, Pop
URL : http://www.mp3crank.com/wolf-eyes/lower-demos-121866
Which are these html lines:
Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year : <span>2013</span>
Genre1: Rock
Genre2: Pop
URL : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866"
This is what I'm trying, but I always get an object reference not set exception when trying to select a single node,
Sorry but I'm very newbie with HTML, I've tried to follow the steps of this question HtmlAgilityPack basic how to get title and link?
Public Class Form1
Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing
Private Title As String = String.Empty
Private Cover As String = String.Empty
Private Genres As String() = {String.Empty}
Private Year As Integer = -0
Private URL as String = String.Empty
Private Sub Test() Handles MyBase.Shown
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop trough the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode("//div[#class='release']").Attributes("title").Value
Cover = node.SelectSingleNode("//div[#class='thumb']").Attributes("src").Value
Year = CInt(node.SelectSingleNode("//div[#class='release-year']").Attributes("span").Value)
Genres = ¿select multiple nodes?
URL = node.SelectSingleNode("//div[#class='release']").Attributes("href").Value
Next
End Sub
End Class
Your mistake here it to try to access an attribute of a childnode from the one you've found.
When you call node.SelectSingleNode("//div[#class='release']") you get the correct div returned, but calling .Attributes returns just the attributes for the div tag itself, not any of the inner HTML elements.
It's possible to write XPATH queries that select the sub-node, e.g. //div[#class='release']/a - see http://www.w3schools.com/xpath/xpath_syntax.asp for more information on XPATH. Although the examples are for XML, most of the principles should apply to a HTML document.
Another approach is to use further XPATH calls on the node you've found. I've amended your code to make it work using this approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Dim releaseNode = node.SelectSingleNode(".//div[#class='release']")
'Assumes we find the node and it has a a-tag
Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value
Dim thumbNode = node.SelectSingleNode(".//div[#class='thumb']")
Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value
Dim releaseYearNode = node.SelectSingleNode(".//div[#class='release-year']")
Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)
Dim genreNode = node.SelectSingleNode(".//div[#class='genre']")
Dim genreLinks = genreNode.SelectNodes(".//a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Next
Note that in this code we're assuming the document is correctly formed and that each node/element/attribute exists and is correct. You might want to add a lot of error checking to this, e.g. If someNode Is Nothing Then ....
Edit: I've amended the code above slightly, to ensure each .SelectSingleNode uses the ".//" prefix - this ensures it works if there are several "item" nodes, otherwise it selects the first match from the document not the current node.
If you want a shorter XPATH solution, here is the same code using that approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode(".//div[#class='release']/h4/a[#title]").Attributes("title").Value
URL = node.SelectSingleNode(".//div[#class='release']/h4/a[#href]").Attributes("href").Value
Cover = node.SelectSingleNode(".//div[#class='thumb']/a/img[#src]").Attributes("src").Value
Year = CInt(node.SelectSingleNode(".//div[#class='release-year']/span").InnerText)
Dim genreLinks = node.SelectNodes(".//div[#class='genre']/a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Console.WriteLine()
Next
You were not that far from the solution. Two important notes:
// is a recursive call. It can have some heavy performance impact, and also it may select nodes you don't want, so I suggest you only use it when the hierarchy is deep or complex or variable, and you don't want to specify the whole path.
There is a useful helper method on XmlNode named GetAttributeValue which will you get an attribute even if it does not exist (you need to specify the default value).
Here is a sample that seems to work:
' select the base/parent DIV (here we use a discriminant CLASS attribute)
' all select calls below will use this DIV element as a starting point
Dim node As HtmlNode = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Title :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("title", CStr(Nothing))))
' get to the IMG tag which is a child or grand child (//) of a 'thumb' DIV
Console.WriteLine(("Cover :" & node.SelectSingleNode("div[#class='thumb']//img").GetAttributeValue("src", CStr(Nothing))))
' get to the SPAN tag which is a child or grand child (//) of a 'release-year' DIV
Console.WriteLine(("Year :" & node.SelectSingleNode("div[#class='release-year']//span").InnerText))
' get all A elements which are child or grand child(//) of a 'genre' DIV
Dim nodes As HtmlNodeCollection = node.SelectNodes("div[#class='genre']//a")
Dim i As Integer
For i = 0 To nodes.Count - 1
Console.WriteLine(String.Concat(New Object() { "Genre", (i + 1), ":", nodes.Item(i).InnerText }))
Next i
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Url :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("href", CStr(Nothing))))