Get data from a web table with table tag - html

I have this code in HTML:
<table cellspacing = "0" cellpadding = "0" width = "100%" border="0">
<td class="TOlinha2"><span id="Co">140200586125</span>
I already have a VBA function that accesses a web site, logs in and goes to the right page. Now I'm trying to take the td tags inside a table in HTML. The value I want is 140200586125, but I want a lot of td tags, so I intend to use a for loop to get those tds and put them in a worksheet.
I have tried both:
.document.getElementByClass()
and:
.document.getElementyById()
but neither worked.
Appreciate the help. I'm from Brazil, so sorry about any English mistakes.

There is not enough HTML to determine if the TOlinha2 is a consistent class name for all the tds within the table of interest; and is limited only to this table. If it is then you can indeed use .querySelectorAll
You could use the CSS selector:
ie.document.querySelectorAll(".TOlinha2")
Where "." stands for className.
You cannot iterate over the returned NodeList with a For Each Loop. See my question Excel crashes when attempting to inspect DispStaticNodeList. Excel will crash and you will lose any unsaved data.
You have to loop the length of the nodeList e.g.
Dim i As Long
For i = 0 To Len(nodeList) -1
Debug.Print nodeList(i).innerText
Next i
Sometimes you need different syntax which is:
Debug.Print nodeList.Item(i).innerText
You can seek to further narrow this CSS selector down with more qualifying elements such as, the element must be within tbody i.e. a table, and preceeded by a tr (table row) and have classname .TOLinha2
ie.document.querySelectorAll("tbody tr .TOlinha2")

Since you mentioned you need to retrieve multiple <td> tags, it would make more sense to retrieve the entire collection rather than using getElementById() to get them one-at-a-time.
Based on your HTML above, this would match all <span> nodes within a <td> with a class='TOlinha2':
Dim node, nodeList
Set nodeList = ie.document.querySelectorAll("td.TOlinha2 > span")
For Each node In nodeList
MsgBox node.innerText ' This should return the text within the <span>
Next

Related

How to click on a button on a webpage using <td> and <tr>?

I am trying to click o the first "Completed" button in the highlighted part of the webpage below.
Here is a piece of the VBA code of the website page:
I tried to click on the FIRST completed button in many different ways such as :
For Each element In ie3.getElementsByTagName("main_table_data_right_border main_table_data_bottom_border")(5)
If element.innerText = "Completed" Then
' Application.Wait (Now + TimeValue("0:03:00"))
element.Click
Application.Wait (Now + TimeValue("0:00:20"))
Exit For
Else
End If
Next
Or
doc.querySelector("#divPage > table.advancedSearch_table > tbody"). _ getElementsByTagName("tr")(3).getElementsByTagName("td")(5).Children(0).Click
But none of them seem to work. When I debug the code and I go through this part and this particular line, nothing really happens. So the button is not being clicked.
Can anyone help me with that?
You could use the getElementsByTagName method to find the hyperlink. Please refer to the following sample:
VBA code to find the hyperlink and click the button (in this sample, I just find the special cell in the first row. If you want to loop through the hyperlink, you need to use For Each statement to loop through the array).
Sub Test()
Dim ie As Object
Dim Rank As Object
Set ie = CreateObject("InternetExplorer.application")
ie.Visible = True
ie.Navigate ("http://localhost:54382/HtmlPage47.html")
Do
If ie.ReadyState = 4 Then
Exit Do
Else
End If
Loop
Set doc = ie.document
doc.getElementsByTagName("tr")(1).getElementsByTagName("td")(5).getElementsByTagName("a")(0).Click
End Sub
Code in the Web page:
<div>
<table class="main_table" style="text-align:center;">
<tr class="main_table_header">
<td></td>
<td>Export Type</td>
<td>Criteria</td>
<td>Rep./List</td>
<td>Creation Date</td>
<td>Status</td>
<td>Reference</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello AA')" id="link1" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello BB')" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello CC')" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello DD')" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
</table>
</div>
The result is like this:
I see you are a bit confused as to how to access HTML elements, so I'll take this opportunity to demonstrate the logic of doing so in a very detailed manner, which I also believe to be very intuitive. There are other ways to do it, but I believe the following one is the most comprehensive and intuitive one and ideal for a beginner.
Firstly, I will go ahead and assume that ie3 is an InternetExplorer object.
When you use this object to navigate to a page, you can access the html of that page by using the ie3.document, which holds an HTML document object.
To take full advantage of the HTML document object you should add a reference to the Microsoft HTML Object Library. This Library will allow you to use a number of HTML elements which make your life easier.
In your case, the elements you want to be able to access are
HTML tables and their rows and cells
HTML anchor elements ()
So my declarations would be the following:
Dim ie3 As New InternetExplorer 'To be used to navigate to the page of interest
Dim doc As HTMLDocument 'this will hold the HTML document corresponding to the page
Dim toBeClicked As HTMLAnchorElement 'To be used to store the <a></a> element
Dim table As HTMLTable 'To be used to store the table element
Dim tableRow As HTMLTableRow 'To be used to store a row of the table element
Dim tableCell As HTMLTableCell 'To be used to store a cellof the table element
Assuming that you have already used the ie3 to navigate to the website of interest, you can store it's HTML document in doc like so:
Set doc = ie3.document
Once you have access to the HTML document of the webpage, you can also get access to its elements in a number of ways, some more targeted than others. Below I am demonstrating the most common methods to do that, using the table element as an example.
If the table has a unique ID, you can get access to it by using the .getElementById() method. This method returns a single element. In your case, the table you're after, doesn't have an ID.
If the table belongs to a class, you can get access to it by using the .getElementsByClassName() method. This method returns a collection of elements, all of which belong to the same class. To get access to a member of this collection you can use a (item index) kind of notation. The first member has an index of 0. In your case the table belongs to class "advancedSearch_table", which happens to only have one member.
If there's no class or ID you can use the .getElementsByTagName method. This method returns a collection of all the elements who have the same tag. In your case you would need all the tables in the document. To get access to a member of this collection you can use a (item index) kind of notation. The first member has an index of 0. Tags in HTML look like so <tagName attribute="something">Something</tagName>.
Below I demonstrate all three methods. You can use either one of the first two:
Set table = doc.getElementsByClassName("advancedSearch_table")(0)
Set table = doc.getElementsByTagName("table")(0)
Set table = doc.getElementById("ID of the table") 'only for demostration purposes, it doesn't apply to your case, as the table has no ID.
Keep in mind that in your case, there is only one table in the document and there's only one element that belongs to the class "advancedSearch_table". This means that you need the first element of the corresponding collections. That's why I use 0 as index.
By the same logic as above, now that the table has been stored, you can get access to its rows and cells. More specifically, you need the 5th cell of the 4th row. That's where the link that you want to click is:
Set tableRow = table.getElementsByTagName("tr")(3)
Set tableCell = tableRow.getElementsByTagName("td")(4)
Finally, now that the cell of interest has been stored, you can access the anchor element and click it. Again, there's only one anchor element in the cell, so it's going to be the first one in the corresponding collection:
Set toBeClicked = tableCell.getElementsByTagName("a")(0)
toBeClicked.Click
BONUS
If you want to click on all the "Completed" links, one by one, you need to loop through the corresponding elements. Here'w two ways to do it:
Click on the anchor in the 5th cell of each row:
For Each tableRow In table.Rows
Set toBeClicked = tableRow.getElementsByTagName("td")(4).getElementsByTagName("a")(0)
toBeClicked.Click
Next tableRow
Loop through all rows and though all cells of the table, find the inner text that you're looking for and click the corresponding anchor:
For Each tableRow In table.Rows
For Each tableCell In tableRow.Cells
If tableCell.innerText = "Something" Then
Set toBeClicked = tableCell.getElementsByTagName("a")(0)
toBeClicked.Click
Next tableCell
Next tableRow
Here, once you click on completed hyperlink, JavaScript gets executed and it opens an Excel file, here you can use ie3.Navigate "javascript:openExcelFile('t83_Kerrfinancialadvisorsinc/455X3/ExportActivity_66260230820190703122002139.xlsx)"
Since it's tied with a hyperlink, you can also try using
element.Click
element.FireEvent ("onclick")
or you can use execScript
Call ie3.document.parentWindow.execScript("your script in webpage", "JavaScript")

VBA to click a dynamic href

I'm trying to click a link on a website with the tag:
<a href="/dbget-bin/www_bget?dr:D01441:>D01441</a>
However, I'm doing this after searching for a unique item (I have an array of >9000 unique items), and the "D01441" part is different for each item, and I don't know in advance what it will be for each. The following code is in a loop that goes through each item and searches for it one at a time. After searching, I would like to click on a link that appears (the code above) and do more things on that next web page.
Dim IE As Object
Dim ele As Object
Set IE = CreateObject("InternetExplorer.Application")
...
For Each ele In IE.document.getElementsByTagName("a")
If ele.Href = "/dbget-bin/www_bget?dr:D01441" Then
ele.Click
Exit For
End If
Next
The above code doesn't work and I'm not sure why. But once I get it to work, I don't know how to modify the "D01441" part so that I can click on any searched item's link. Here's more html around the link I want:
<tbody>
<tr> ... </tr>
<tr>
<td class = "data1">
<a href = "/dbget-bin/www_bget?dr:D01441:>D01441</a>
</td>
<td class = "data1">..</td>
<td class = "data1">..</td>
...
EDIT: To try to deal with the changing "D01441", I tried using InStr but it doesn't work either:
For Each ele In IE.document.getElementsByTagName("a")
If InStr(ele.Href, "/dbget-bin/www_bget?dr:") = 1 Then
MsgBox "There"
ele.Click
Exit For
End If
Next
CSS selectors:
Try using a CSS selector combination applied via querySelector method of document to target the common start part of the href.
Applying the selector combination:
IE.document.querySelector("a[href^='/dbget-bin/www_bget?dr:']").Click
Understanding the selector combination:
This uses a CSS selector combination to target the element with:
a[href^='/dbget-bin/www_bget?dr:']
This says element with a tag having attribute href whose value starts with
'/dbget-bin/www_bget?dr:' . The ^ means starts with.
Query in action:
Here is the selector in action on your HTML sample:
Side note:
If you have multiple elements with a tags and an href that starts with /dbget-bin/www_bget?dr:, it will match the first one, in most instances. If that is the case seeing more HTML would help. I think there are a few problems with that HTML sample because in theory a more selective CSS query might be .data1 a[href^='/dbget-bin/www_bget?dr:'], so as to include the parent element class of data1, "." being a class selector.
#QHarr answer is the elegant and best solution, but...
To address your issue of getting the part number from the href, you can use the InStr like this
For Each ele In IE.document.getElementsByTagName("a")
Dim partNumber As String
Dim colonPosition As Long
colonPosition = InStr(1, ele.Href, ":", vbTextCompare)
If colonPosition > 0 Then
partNumber = Right$(ele.Href, Len(ele.Href) - colonPosition)
Debug.Print partNumber
End If
Next ele

VBA Excel Scraping

I am getting started with trying to learn about scraping. I got this page that is behind a login and I remember reading that you should not try to do the (1), (2) or (3) thing after get element by tagname. But that you should rather get something more unique like a Classname or ID. But can someone please tell me why
This the GetTag works and
Dim Companyname As String
Companyname = ie.document.getElementsByTagName("span")(1).innertext
This GetClass do not work
Dim Companyname As String
Companyname = ie.document.getElementsByClassName("account-website-name").innertext
This is the text that I am scraping
<span class="account-website-name" data-journey-name="true">Dwellington Journey</span>
getELEMENTbyProperty vs getELEMENTSbyProperty
There are primarily two distinct types of commands to retrieve one or more elements from a web page's .Document; those that return a single object and those that return a collection of objects.
Getting an ELEMENT
When getElementById is used, you are asking for a single object (e.g. MSHTML.IHTMLElement). In this case the properties (e.g. .Value, .innerText, .outerHtml, etc) can be retrieved directly. There isn't supposed to be more than a single unique id property within an HTML body so this function should safely return the only element within the i.e.document that matches.
'typical VBA use of getElementById
Dim CompanyName As String
CompanyName = ie.document.getElementById("CompanyID").innerText
Caveat: I've noticed a growing number of web designers who seem to think that using the same id for multiple elements is oh-key-doh-key as long as the id's are within different parent elements like different <div> elements. AFAIK, this is patently wrong but seems to be a growing practise. Be careful on what is returned when using .getElementById.
Getting ELEMENTS
When using getElementsByTagName, getElementsByClassName, etc. where the word Elements is plural, you are returning a collection (e.g. MSHTML.IHTMLElementCollection) of objects, even if that collection contains only one or even none. If you want to use these to directly access an property of one of the elements within the collection, an ordinal index number must be supplied so that a single element within the collection is referenced. The index number within these collections is zero based (i.e. the first starts at (0)).
'retrieve the text from the third <span> element on a webpage
Dim CompanyName As String
CompanyName = ie.document.getElementsByTagName("span")(2).innerText
'output all <span> classnames to the Immediate window until the right one comes along
'retrieve the text from the first <span> element with a classname of 'account-website-name'
Dim e as long, es as long
es = ie.document.getElementsByTagName("span").Length - 1
For e = 0 To es
Debug.Print ie.document.getElementsByTagName("span")(e).className
If ie.document.getElementsByTagName("span")(e).className = "account-website-name" Then
CompanyName = ie.document.getElementsByTagName("span")(e).innerText
Exit For
End If
Next e
'same thing, different method
Dim eSPN as MSHTML.IHTMLElement, ecSPNs as MSHTML.IHTMLElementCollection
ecSPNs = ie.document.getElementsByTagName("span")
For Each eSPN in ecSPNs
Debug.Print eSPN.className
If eSPN.className = "account-website-name" Then
CompanyName = eSPN.innerText
Exit For
End If
Next eSPN
Set eSPN = Nothing: Set ecSPNs = Nothing
To summarize, if your Internet.Explorer method uses Elements (plural) rather than Element (singular), you are returning a collection which must have an index number appended if you wish to treat one of the elements within the collection as a single element.
CSS selector:
You can achieve the same thing with a CSS selector of .account-website-name
The "." means className. This will return a collection of matching elements if there are more than one.
CSS query:
VBA:
You apply the selector with the .querySelectorAll method of .document. This returns a nodeList which you traverse the .Length of, accessing items by index, starting from 0.
Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll(".account-website-name")
For i = 0 To aNodeList.Length -1
Debug.Print aNodeList.Item(i).innerText
' Debug.Print aNodeList(i).innerText ''<== sometimes this syntax instead
Next

Excel VBA: using HTML DOM

I need to know what all can I do with HTML over Excel VBA. for example I know that I can find element by id
ie.document.getElementByID().
I will work with HTML table which doesn't have elements with id, so that it will look like child->child->sibling->child..... i think.
Can anybody pleas show me part of code, which will get text "hello" from this example table? first node will be found by his ID.
...
<table id="something">
<tr>
<td></td><td></td>
</tr>
<tr>
<td></td><td>hello</td>
</tr>
...
I'm looking at this type of thing at the moment...
I believe the something like the below should do it:
ie.getelementbyid("something").getelementsbytagname("TD")(3).innertext
How it works:
It searches the HTML doc for the element where the ID is equal to "something" (will take first iteration if more than 1 but you can loop through many). It will then get take the table data tag and go to the iteration (3) where the text is (0 would equal the first TD).
Let us know if this works.
Many HTML documents have names in generalized tables as opposed to ID's
I commonly use some form of what is shown below;
Set HTML = IE.Document
Set SomethingID = HTML.GetElementByID("something").Children
For Each TR in SomethingID
If TR.TagName = "td" Then
If TR.Name = "myname" Then
'code in here if you are looking for a name to the element
ElseIf TR.Value = "myValue" Then
'Code in here if you are looking for a specific value
ElseIf TR.innerText = "mytext" Then
'more code in here if you are looking for a specific inner text
End If
End If
Next TR

extracting text from a specific <h> element using GetElementById

I have created a VBS script file that looks at an XML data file.
Within the XML data file, the HTML data I need is embedded within the
<![CDATA[]'other interesting HTML data here'].
I have stripped out this HTML data using XPATH and insterted into a Div object (myDiv) element that is represented as a variable (its not written to a document).
So for example, the contents of myDiv.innerHTML looks like this;
<table>
<tr><td>text in cell 1</td></tr>
<tr><td><h1 id="myId1">my text for H1</h></td><tr>
<tr><td><h2 id="myId2">my text for h2</h></td></tr>
</table>
What I want to do at first is simply select the appropriate tag with the Id that matches "myId1", therefore, I used a statement like this;
MyIdText = MyDiv.getElementById("myId1")
However, the aplpication I am using says "Err 438, Object doesn't support this property or method".
I am a bit of a newbie with code and can understand some of the basic fundamantals, but get a bit lost when it becomes a bit more complex (sorry). I have looked through other postings on this board, and all of them seem to rlate to HTML nad Javascript, not VBScript (the application I am using will not allow Java Script).
Am I using the code wrong?
To use getElementById() you should write: document.getElementById("myId1"). This way you tell the browser to search inside 'document' for the specified ID. Your variable is not defined and it does not have this method attached, so your code will generate the above error.
To extract the text inside the specific H element:
MyIdText = document.getElementById("myId1").textContent;
many thanks for the help, unfortunately, I know a little VBS and even littler about DOM and I am trying to learn both by experimenting. There are certain restrictions within the environment/application I am working with (Its called ASCE and its a tool for managing Safety Cases - but thats not important right now).
However, so that we are comparing apples with apples, I have tried to experiment within an HTML page to give me a better understanding of what the DOM/VBS commands can actually do. I have had some partial success, but still cant understand why it falls over where it does.
Here is the exact file I am experimenting with, I have added comment text for each section;
<html>
<head>
<table border=1>
<tr>
<td>text in cell 1</td>
</tr>
<tr>
<td><h1 id="myId1">my text for H1</h1></td>
</tr>
<tr>
<td><h1 id="myId2">my text for h2</h2></td>
</tr>
</table>
<script type="text/vbscript">
DoStuff
Sub DoStuff
' Section 1: Get a node with the Id value of "myId1" from the above HTML
' and assign it to the variable 'GetValue'
' This works fine :-)
Dim GetValue
GetValue = document.getElementById("myId1").innerHTML
MsgBox "the text=" & GetValue
' Section 2: Create a query that assigs to the variable 'MyH1Tags' to all of the <h1>
' tags in the document.
' I assumed that this would be a 'collection of <h1> tags so I set up a loop to itterate
' through however many there were, but this fails as the browser says that this object
' doesn't support this property or method - This is where I am stuck
Dim MyH1Tags
Dim H1Tag
MyH1Tags = document.getElementsByTagName("h1") ' this works
For Each H1Tag in MyH1Tags ' this is where it falls over
MSgbox "Hello"
Next
' Section 3: Create a new Div element 'NewDiv' and then insert some HTML 'MyHTML'
' into 'NewDiv'. Create a query 'MyHeadings' that extracts all h1 headings from 'NewDiv'
' then loop round for however many h1 headings there are in 'MyHeadings'
' and display the text content. This works Ok
Dim NewDiv
Dim MyHTML
Dim MyHeadings
Dim MyHeading
Set NewDiv = document.createElement("DIV")
MyHTML="<h1 id=""a"">heading1</h1><h2 id=""b"">Heading2</h2>"
NewDiv.innerHTML=MyHTML
Set MyHeadings = NewDiv.getElementsByTagName("h1")
For Each MyHeading in MyHeadings
Msgbox "MyHeading=" & MyHeading.innerHTML
Next
'Section 4: Do a combination of Section 1 (that works) and Section 3 (that works)
' by creating a new Div element 'NewDiv2' and then paste into it some HTML
' 'MyHTML2' and then attempt to create a query that extracts the inner HTML from
' an id attribute with the value of "a". But this doesnt work either.
' I have tried "Set MyId = NewDiv2.getElementById("a").innerHTML" and
' also tried "Set MyId = NewDiv2.getElementById("a")" and it always falls over
' at the same line.
Dim NewDiv2
Dim MyHTML2
Dim MyId
Set NewDiv2 = document.createElement("DIV")
MyHTML2="<h1 id=""a"">heading1</h1><h2 id=""b"">Heading2</h2>"
NewDiv2.innerHTML=MyHTML
MyId = NewDiv2.getElementById("a").innerHTML
End Sub
</script>
</head>
<body>