Data extraction from HTML - html

I am trying to pull data from html text.
I am having an issue with the extraction code.
Normally I deal with div or Li, this html seems to be a bit more complicated.
It is using Div id, ul Class and Span Class.
What do I put in for Class or Li extraction?
For Each li In HTMLdoc.getElementsByTagName("li")
If li.getAttribute("class") = "a-link-normal" Then
Set link = li.getElementsByTagName("a")(0)
.Cells(i, 1).Value = link.getAttribute("href")
i = i + 1
End If
Next li
I have also posted this here.
The new code from PEH seems to work.
However I am getting an error message.
Error Line In Code

With this code If li.getAttribute("class") = "a-link-normal" Then you check if the current li has a class attribute a-link-normal like <li class="a-link-normal"> but is is actually a link element with the class a-link-normal and not a list element. So I think it should be somehow like this:
For Each li In HTMLdoc.getElementsByTagName("li")
Set link = li.getElementsByTagName("a")(0)
If link.getAttribute("class") = "a-link-normal" Then
.Cells(i, 1).Value = link.getAttribute("href")
i = i + 1
End If
Next li
You might come accross <li> elements that have no links <a> inside.
For Each li In HTMLdoc.getElementsByTagName("li")
Set link = Nothing
On Error Resume Next
Set link = li.getElementsByTagName("a")(0)
On Error Goto 0
If Not link Is Nothing Then
If link.getAttribute("class") = "a-link-normal" Then
.Cells(i, 1).Value = link.getAttribute("href")
i = i + 1
End If
End if
Next li

It is simpler and faster to just use the class direct. The css class selector "." shown below is combined with href attribute selector [href] so you only retrieve elements that match that class and have an href attribute
Dim items As Object
Set items = HTMLdoc.querySelectorAll(".a-link-normal[href]")
For i = 0 To items.Length - 1
.Cells(i + 1, 1).Value = items.item(i).href
End If

Related

Identify NextSibling in XMLHTTP response

I am still trying to learn about NextSibling and I am using XMLHTTP in excel VBA.
Here's the HTML for the element
<ul class="list-unstyled list-specification">
<li><span>ID</span> <span class="text-info">22928</span></li>
<li><span>Category</span> <span class="text-info">Mechanical</span></li>
<li><span>Discipline</span> <span class="text-info">Mechanical </span></li>
<li><span>Commodity</span> <span class="text-info">Pipe</span></li>
<li><span>Sub commodity</span> <span class="text-info">12 In Pipe </span></li>
<li><span>UOM</span> <span class="text-info">EA</span></li>
<li><span>Available quantity</span> <span class="text-info">30</span></li>
<li><span>Age</span> <span class="text-info">8</span></li>
</ul>
I have used this line to spot on the first span in the li (lists) so as to identify the headers for each part
Set post = html.querySelectorAll(".list-specification li span")
Then I used loops like that
For j = 0 To post.Length - 1
If post.Item(j).innerText = "ID" Then
Debug.Print post.Item(j).NextSibling.innerText
End If
Next j
I got an error when trying to use NextSibling. I feel stuck as for that NextSibling .. Can you guide me?
for example ID is the first in the list and I would like to get that ID based on my approach
I got an error when trying nextElementSibling
Sub Test()
Dim html As New HTMLDocument, post As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "Get", "C:\Sample.html", False
.send
html.body.innerHTML = .responseText
End With
Set post = html.querySelectorAll(".list-specification li span")
For i = 0 To post.Length - 1
If post.Item(i).innerText = "ID" Then
MsgBox post.Item(i).nextElementSibling.innerText: Exit For
End If
Next i
End Sub
Try doing another NextSibling and then you should find it working:
Set post = Html.querySelectorAll(".list-specification li span")
For j = 0 To post.Length - 1
If post.Item(j).innerText = "ID" Then
MsgBox post.Item(j).NextSibling.NextSibling.innerText
Exit For
End If
Next j
The correct property to access I was expecting to be nextElementSibling, but it seems VBA does not implement this.
The NonDocumentTypeChildNode.nextElementSibling read-only property
returns the element immediately following the specified one in its
parent's children list, or null if the specified element is the last
one in the list.
You can however, more correctly, simply take the next index in post i.e. post.item(1). You are collecting both headers and values in the same nodeList so you can use odd/even distinction to separate headers from values.
You can see this if you run the following in console:
post = document.querySelectorAll(".list-specification li span");
var res = ''; for (let [i] of Object.entries(post)) {res += post.item(`${i}`).innerText + ' '};console.log(res);
Spans are inline containers and you can see from html that you have a space between spans which is part of the parent li and this becomes a child text node. This is why your nextSibling hits a text node and errors with the attempt at .innerText accessor. You would want a text node property such as .nodeValue (if you were at the right node).
You can step through, in the console, and see the different properties in action:
As nextElementSibling is not implemented in VBA you would need to chain nextSibling, as per #Sim's answer, if you want to explore nextSibling to solve this particular navigation. However, note that a test of nodeType would avoid throwing an error as you could then apply the appropriate accessor.

How to copy HTML span attribute to excel by vba

I am trying to copy web span value to excel sheet by excel vba. I try copy from input box which same value with span but it's unable to copy / null value. This HTML to copy
<td nowrap class="iddisplay"><span style="font-size: 14px" tabIndex="0">IAR/19326/8JM3Z</span>
<input type="hidden" name="transactionId" id="transactionId" value="IAR193268JM3Z"></td>
so i want to copy value IAR/19326/8JM3Z from span or IAR193268JM3Z from transactionId value
The excel vba code that i use
Set str_val12 = IE.document.getElementById("transactionId")
clip.SetText str_val12.innerText
'clip.SetText str_val12.value
clip.PutInClipboard
test.Cells(i + 2, 6).Select
test.Cells(i + 2, 6).PasteSpecial "Unicode Text"
Thanks
If the css class iddisplay is only used in that specific td tag:
Cells(i + 2, 6) = IE.document.getElementsByClassName("iddisplay")(0).FirstChild.innertext
I haven't tested this but something along the lines of:
Dim element as Variant
For Each element in IE.document.getElementsById("transactionId")
If element.value = "IAR193268JM3Z" Then
Set str_val12 = element.innerText
test.cells(i + 2, 6).Value = str_val12
End If
Next element
Let me know if this throws an error and on which line. I am unsure if .Value is a way to check the value="IAR193268JM3Z" embedded within the input tag, but let's see.

Compare index of 2 elements in a collection

Issue : I have some issues figuring out a way to select elements in my HTMLDocument which are under a certain point in the page.
In the following code sample, as you can see in the comments, I first select a part of them which respect my queryselector criteria
IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")
In this example I have 10 elements in this collection. Each of this element in contained in a table which is its parent on the 7th degree.
MsgBox TypeName(IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")(2).ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode) ' HTMLTable
Some of those elements are in the same table.
You can see here the form which contains all the tables .
Now, the thing is that I want to select the innerHTML of some of those elements only and not all of them. The criterion to know if I one of those 10 elements interests me or not is it's position on the webpage. I want all the elements which are under the message Part Usage. There is only one table containing the Part Usage text and so my idea was to see if the table in which are contained each element has a higher or lower index in the "form" collection.
If the index is higher I want this element, otherwise I discard it.
What I did for this is the following code :
I set the ID Bim to all the tables containing one or more
from the 10 elements.
For Each Element In IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") ' here for all of the 10 numbers found with the queryselectorall we'll find their respective table in the collection (form) and set its Class as "Bim". But since some of the numbers are in the same table, we won't have 10 tables with a classname "Bim" at the end of the process. We'll have only x tables with the classname "Bim"
Element.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.Class = "Bim"
Next
I set the ID Stop to the table containing the text Part Usage
For Each Element In IEDoc.getElementsByClassName("SectionHead")
If Element.innerHTML = "Part Usage" Then
'MsgBox TypeName(Element.ParentNode.ParentNode.ParentNode)' HTMLTable
Element.ParentNode.ParentNode.ParentNode.ID = "Stop"
End If
Next
I check which tables with the Classname Bim are under (=higher index) the table with the ID Stop. For the table ( there is actually only one) matching the criterion of point 3 I apply IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") inside of them so that I get all the elements in contains and more paricularly their innerHTML.
For Each Element In IEDoc.getElementsByClassName("Bim") ' Here we check all the x tables which have the Classname "Bim"
If Element.indexInTheWholeForm > IEDoc.getElementById("Stop").indexInTheWholeForm Then 'and compare somehow if their index in the (form) collection if higher than the table with the ID "Stop" ( this is similar to checking if the element if lower on the webpage in thic case) ( we only want the element which have a higher index aka under the Part Usage table)
For Each Element2 In Element.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") ' Now we are in the table which contains the part numbers and we'll look for all the part numbers it contains by applying the queryselectorall again, but this time only in this specific table
array_parts2(iteration2) = Element.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")(iteration2).innerHTML
ActiveWorkbook.Worksheets(1).Cells(iteration2 + 1, 19) = array_parts2(iteration2)
iteration2 = iteration2 + 1
Next
End If
Next
of course what doesn't work is the indexInTheWholeForm property which doesn't exist. Any ideas on how to do this ?
Thank for reaching that line :)
Untested but I would do something like this (assuming I understood you correctly)
Sub Tester()
Const S_MATCH As String = "td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']"
Dim e, tbl, bHit As Boolean
'...
'load page etc
'...
'get all the matching rows and cycle though them
For Each e In IEDoc.querySelectorAll(S_MATCH)
'did we get to the table of interest yet?
If Not bHit Then
Set tbl = e.ParentNode.ParentNode.ParentNode.ParentNode. _
ParentNode.ParentNode.ParentNode
If IsPartUsageTable(tbl) Then bHit = True
End If
If bHit Then
'we reached the table of interest, so
' do something with e
End If
Next
End Sub
Function IsPartUsageTable(tbl) As Boolean
Dim e, rv As Boolean
For Each e In tbl.getElementsByClassName("SectionHead")
If Element.innerHTML = "Part Usage" Then
rv = True
Exit For
End If
Next
IsPartUsageTable = rv
End Function
Ok, so as unexpected as it sounds, I think I found a solution to my own question. I will confirm you that it works as soon as I have the possibility to run it with my colleague.
So I keep point 1 and 2 from my initial post and I replaced point 3 with the following :
For i = 0 To IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table").length
If IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).ID = "Stop" Then
index_Part_Usage = i
Position_Part_Usage = index + 1
Exit For
End If
Next
'MsgBox Position_Part_Usage
For i = 0 To IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table").length
If IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).className = "Bim" Then
index = i
Position = index + 1
If index > index_Part_Usage Then
For Each Element2 In IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") ' Now we are in the table which contains the part numbers and we'll look for all the part numbers it contains by applying the queryselectorall again, but this time only in this specific table
array_parts2(iteration2) = IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")(iteration2).innerHTML
ActiveWorkbook.Worksheets(1).Cells(iteration2 + 1, 19) = array_parts2(iteration2)
iteration2 = iteration2 + 1
Next
End If
End If
Next i

How to create img tags dynamically from the code behind

I have the following img tag:
<img id="cross" runat="server" />
I would like to create several img tags according to data the I'm getting back from the Database. but I'm keep getting an error 'System.Web.UI.HtmlControls.HtmlImage' does not allow child controls.
How can I create img tag dynamically?
For example if I'm getting back from the database 5 dots(XCoord,YCoord) to be created I want to create new 5 images.
Here is my code:
Dim ds As DataSet = dba.GetIncidentsByZone(135, "02/11/2015", "05/12/2015", m_User.CompanyCode)
If Not ds Is Nothing Then
For Each dr As DataRow In ds.Tables(0).Rows()
cross.Controls.Add(New HtmlImage() With {.Src = "C:/Inetpub/temp/pointer.gif", .Alt = ""})
cross.Style.Add("left", dr("XCoord").ToString() + "px")
cross.Style.Add("top", dr("YCoord").ToString() + "px")
cross.Attributes.Add("style", "visibility:visible")
Next
End If
Based on your comment, your loop needs to look more like this:
If Not ds Is Nothing Then
For Each dr As DataRow In ds.Tables(0).Rows()
Dim newImage as New HtmlImage() With {.Src = "C:/Inetpub/temp/pointer.gif", .Alt = ""}
newImage.Style.Add("left", dr("XCoord").ToString() + "px")
newImage.Style.Add("top", dr("YCoord").ToString() + "px")
someOtherContainerItemLikeADiv.Controls.Add(newImage)
Next
End If
On each iteration of the loop you need to make a new image, set it's properties and then add that to some kind of control that allows child controls, like a div.

Retrieve attributes and span using HTMLAgilityPack library

In this piece of HTML code:
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
<img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
</div>
<div class="release">
<h3>Wolf Eyes</h3>
<h4>
Lower Demos
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
Rock
Pop
</div>
</div>
I know how to parse it in other ways, but I would like to retrieve this Info using HTMLAgilityPack library:
Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year : 2013
Genres: Rock, Pop
URL : http://www.mp3crank.com/wolf-eyes/lower-demos-121866
Which are these html lines:
Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year : <span>2013</span>
Genre1: Rock
Genre2: Pop
URL : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866"
This is what I'm trying, but I always get an object reference not set exception when trying to select a single node,
Sorry but I'm very newbie with HTML, I've tried to follow the steps of this question HtmlAgilityPack basic how to get title and link?
Public Class Form1
Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing
Private Title As String = String.Empty
Private Cover As String = String.Empty
Private Genres As String() = {String.Empty}
Private Year As Integer = -0
Private URL as String = String.Empty
Private Sub Test() Handles MyBase.Shown
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop trough the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode("//div[#class='release']").Attributes("title").Value
Cover = node.SelectSingleNode("//div[#class='thumb']").Attributes("src").Value
Year = CInt(node.SelectSingleNode("//div[#class='release-year']").Attributes("span").Value)
Genres = ¿select multiple nodes?
URL = node.SelectSingleNode("//div[#class='release']").Attributes("href").Value
Next
End Sub
End Class
Your mistake here it to try to access an attribute of a childnode from the one you've found.
When you call node.SelectSingleNode("//div[#class='release']") you get the correct div returned, but calling .Attributes returns just the attributes for the div tag itself, not any of the inner HTML elements.
It's possible to write XPATH queries that select the sub-node, e.g. //div[#class='release']/a - see http://www.w3schools.com/xpath/xpath_syntax.asp for more information on XPATH. Although the examples are for XML, most of the principles should apply to a HTML document.
Another approach is to use further XPATH calls on the node you've found. I've amended your code to make it work using this approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Dim releaseNode = node.SelectSingleNode(".//div[#class='release']")
'Assumes we find the node and it has a a-tag
Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value
Dim thumbNode = node.SelectSingleNode(".//div[#class='thumb']")
Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value
Dim releaseYearNode = node.SelectSingleNode(".//div[#class='release-year']")
Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)
Dim genreNode = node.SelectSingleNode(".//div[#class='genre']")
Dim genreLinks = genreNode.SelectNodes(".//a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Next
Note that in this code we're assuming the document is correctly formed and that each node/element/attribute exists and is correct. You might want to add a lot of error checking to this, e.g. If someNode Is Nothing Then ....
Edit: I've amended the code above slightly, to ensure each .SelectSingleNode uses the ".//" prefix - this ensures it works if there are several "item" nodes, otherwise it selects the first match from the document not the current node.
If you want a shorter XPATH solution, here is the same code using that approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode(".//div[#class='release']/h4/a[#title]").Attributes("title").Value
URL = node.SelectSingleNode(".//div[#class='release']/h4/a[#href]").Attributes("href").Value
Cover = node.SelectSingleNode(".//div[#class='thumb']/a/img[#src]").Attributes("src").Value
Year = CInt(node.SelectSingleNode(".//div[#class='release-year']/span").InnerText)
Dim genreLinks = node.SelectNodes(".//div[#class='genre']/a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Console.WriteLine()
Next
You were not that far from the solution. Two important notes:
// is a recursive call. It can have some heavy performance impact, and also it may select nodes you don't want, so I suggest you only use it when the hierarchy is deep or complex or variable, and you don't want to specify the whole path.
There is a useful helper method on XmlNode named GetAttributeValue which will you get an attribute even if it does not exist (you need to specify the default value).
Here is a sample that seems to work:
' select the base/parent DIV (here we use a discriminant CLASS attribute)
' all select calls below will use this DIV element as a starting point
Dim node As HtmlNode = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Title :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("title", CStr(Nothing))))
' get to the IMG tag which is a child or grand child (//) of a 'thumb' DIV
Console.WriteLine(("Cover :" & node.SelectSingleNode("div[#class='thumb']//img").GetAttributeValue("src", CStr(Nothing))))
' get to the SPAN tag which is a child or grand child (//) of a 'release-year' DIV
Console.WriteLine(("Year :" & node.SelectSingleNode("div[#class='release-year']//span").InnerText))
' get all A elements which are child or grand child(//) of a 'genre' DIV
Dim nodes As HtmlNodeCollection = node.SelectNodes("div[#class='genre']//a")
Dim i As Integer
For i = 0 To nodes.Count - 1
Console.WriteLine(String.Concat(New Object() { "Genre", (i + 1), ":", nodes.Item(i).InnerText }))
Next i
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Url :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("href", CStr(Nothing))))