Extract data from a web page that may not be formatted as a table - html

For starters I am by no means an expert in VBA. Just know enough to be dangerous 8).
I started out by doing a search on how to extract a table from a web page and saw many many people have asked the same question. Unfortunately most of what I was reading was over my head. One article I read pointed me to this detailed article by Siddharth Rout, but alas I could not follow what was going on other than there are two methods internet explorer or some other methods. Since I only have IE11 installed and MS Office I would prefer to go the IE route.
I have encountered this problem several times in the past and have always dropped the project or done things manually. Today I thought I would try and learn how to do this and make my future life hopefully a little easier. As such I am going to use data from a gaming website since it mimics other things I have encountered in the past.
So today's (this week's..no this month's..I am an optimist!) project is to build a list of every team involved in a tournament and copy their results into excel. This would be akin to pulling cricket, hockey, baseball, soccer, or football stats. I tried using Excel's built in Get Data From Web process, but it did not identify the table on the web page.
The address for the web page is: http://worldoftanks.com/en/tournaments/1000000017/
and is in the image below
So the basics and my starting point is to simply pull the list of teams from 1 group and paste it in an excel page with no formatting. Basically the area in yellow in the image above. The image could not fit the whole page but there are actually 10 teams in this group. However I would like to make it variable as sometimes you may have more or less than 10 teams in a group. I am going to assume the number of rows is a minor issue at this point.
Once I get that part figured out I am hoping it will be relatively easy to switch to the next group, grab that list of teams and results and add them to the end of the list I am building in excel. On the web page this would be done by selecting the blue areas.
Now once I have those two things figured out I would need to build the list again from scratch based on the stage of the tournament areas in green and put that list on a new page. I have some ideas how to achieve this but it will really depend on what the previous two steps look like.
I have a bonus task for myself too which is to pull the schedule for each team in a group to see how they did against various other teams. Who beat who type deal. I am hoping I can figure that part out based on the information learned from the task above.
So I am pretty sure there are other languages/prgs that are better suited for the task at hand, but I would like to stick with what I have...and the little I know so far. So I tried a wee bit of VBA code and commented on what I need to achieve. So far I think I have opened the webpage! and built a bit a thought process in comments on how to do some of the things.
Sub GetTeamData()
Dim IE As Object
Dim roundcounter As Integer
Dim groupcounter As Integer
Dim TeamList As Variant
Dim WebAddress As String
Dim Number_of_rounds as Integer
Dim Number_of_Groups as Integer
'set webaddress of site to link to
WebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Visible = True
.navigate (WebAddress)
End With
'What does this chunk of code do? Wait for webpage to finish loading?
While IE.readyState <> 4
DoEvents
Wend
'set initial parameters for loops. I am ok with hardcoding this for now.
Number_of_groups = 125
Number_of_rounds = 5
'start pulling teamdata
'For roundcounter = 1 To number_of_rounds
'select roundcounter on webpage
'for groupcounter = 1 to number_of_groups
'select groupcounter on webpage
'grab table of 6-10 teams (position, team name, battles, wins, losses, ties, and points)
'add table to TeamList
'next groupcounter
'paste TeamList to sheet roundcounter cell A1
'clear TeamList
'next roundcounter
'Next task
'based on results on how to pull group table date, pull individual team schedule results to build matrix result
Set IE = Nothing
End Sub
One thing I was thinking about was that instead of using for next loops with a counter is if it would be easier to set it up to do a loop until an error had occurred like exceeding the number of groups or rounds. Now I am rambling.
Anyhow if someone would be so kind to get me started on how to pull the yellow area from the image above that would be much appreciated! Please be gentle! I do realize that this question has been asked many a time... I just did not understand what I was reading. Also if this is not possible or extremely difficult to do please let me know. Thank you in advance for your assistance in educating me.
UPDATE 16/03/19 0900
So I tried the Get Data From Web process again this morning with a bit more luck...but not much.
after 1 error window which I click yes to I get the web page to load
I got the little yellow arrow to show up once on the page in the very top left corner. So I tried it and it did pull in information.
but I did notice there were no yellow boxes next to the table I want which makes me wonder if it is not a table.
When I did pull in information, it was not the information I was looking for. When I scanned through the results, I could see where the data I am looking for should be, but all the results are missing, just the table column headers show up in about Row 263 or so.
So then I tried doing a copy and paste method from the web page using select all for the copy on the web page. For the paste I tried different methods. keeping source formatting resulted in nothing. keep destination formatting brought in information. I tried paste special (html, Unicode and text) HTML made things look pretty and the other two put everything into a single column. More importantly the results were in the table.
Now if I only needed round 1 group 1 team list and results I could work with this. Simply delete all the rows above and below the table and voila! however since the web address is the same for every group and every round I have no idea how to "click" on the blue or green areas to update the info. If I knew this I could automate the process by copying and pasting each page, then editing the results to just the table, and moving the table to another sheet just below the last results.
To me there seems like there should be a better method.
16/03/19 1600
<!-- ko if: visibleBracketType() === ROUND_ROBIN -->
<table class="tournament-table tournament-table__indent" cellpadding="0" cellspacing="0">
<tr class="tournament-table_tr">
<th class="tournament-table_th tournament-table_th__numb">#</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-team">Team</span>
</div>
<div class="tournament-table_heading-text">
Team
</div>
</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-battles">Battles</span>
</div>
<div class="tournament-table_heading-text">
Battles
</div>
</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-victory">Victories</span>
</div>
<div class="tournament-table_heading-text">
Victories
</div>
</th>
<th class="tournament-table_th tournament-table_th__mobile-hide">
<div class="tournament-table_ico-holder">
<span class="ico-flag">Defeats</span>
</div>
<div class="tournament-table_heading-text">
Defeats
</div>
</th>
<th class="tournament-table_th tournament-table_th__mobile-hide">
<div class="tournament-table_ico-holder">
<span class="ico-division">Draws</span>
</div>
<div class="tournament-table_heading-text">
Draws
</div>
</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-points">Points</span>
</div>
<div class="tournament-table_heading-text">
Points
</div>
</th>
</tr>
<!-- ko foreach: {data: rrBrackets().teams, as: 'team' } -->
<tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<td class="tournament-table_td" data-bind="text: team.position"></td>
<td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<a class="tournament-table_team tournament-table_team__big" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}"></a>
</td>
<td class="tournament-table_td" data-bind="text: team.battle_played"></td>
<td class="tournament-table_td" data-bind="text: team.wins"></td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses"></td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws"></td>
<td class="tournament-table_td" data-bind="text: team.extra_statistics.points"></td>
</tr>
<!-- /ko -->
</table>​
ok, from what I am gathering from the various posts I have been reading and videos I have been watching, I need to find some critical "Tag" in the coding of the web page and from that I can eventually start pulling data. I hit F12 on IE to view the code, and then in the code area I did a search on some of the display text in the area I was looking and found the above chunk of "code". With a lot of GUESSING I am hoping I grabbed the right chunk. Now to figure out what that critical tag is and how to use it. By the way, what code is that web page in?

Although extracting data from a webpage can be automated with VBA (see below), the specific example webpage you provided comes with some obstacles:
This webpage loads and displays only a small portion of the desired data at a time. This is probably done for performance reasons, since the whole table of Teams would consist of several thousand entries.
Only the Teams of the currently displayed Round and currently displayed Group are loaded.
If you click on another Group, a JavaScript program (running in your browser) is started that connects to the server, fetches the Teams of that Group and replaces the data in the webpage. You can verify this by yourself if you press F12 and observe the Network tab that lists all requests to the server.
Thus, the webpage does not provide at any point a complete list of Teams. You would have to work around this:
Make your program automatically click on each Round, then click on each Group and finally extract the 9 teams of that Group, merging everything together afterwards.
Hook into the JavaScript code that loads each Group's Teams and call it in a loop, or reverse-engineer the requests made by that code and try to re-create them in VBA. Although this could be an elegant solution, many website owners do not like having their API used in ways they did not intend.
A misuse could create a huge server load. I would only recommend this method if the API was designed for this purpose (some websites do this, like Twitter or Steam).
The following will focus on just extracting content from a given page, that is, retrieving the Teams of the currently loaded Group. I won't use any of the workarounds mentioned above.
The program basically consists of these three parts:
Open Webpage
The following is a helper function that opens a webpage and returns an object with the webpage's content.
It needs the libraries Microsoft Internet Controls and Microsoft HTML Object Library referenced (see here for instructions).
' return the document containg the DOM of the page strWebAddress
' returns Nothing if the timeout lngTimeoutInSeconds was reached
Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
Dim IE As SHDocVw.InternetExplorer
Dim IEDocument As MSHTML.HTMLDocument
Dim dateNow As Date
' create an IE application, representing a tab
Set IE = New SHDocVw.InternetExplorer
' optionally make the application visible, though it will work perfectly fine in the background otherwise
IE.Visible = True
' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
Set GetIEDocument = IEDocument
End Function
Extract Information
You can now load the webpage by using Set IEDocument = GetIEDocument("http://worldoftanks.com/en/tournaments/1000000017/"). The object IEDocument then contains everything you need to extract the desired data.
First you need to find the part that you want to extract (the critical "Tag", as you called it).
Since the content of a webpage is represented as a tree of HTML tags, you need to find the table tag that contains all other tags that you are interested in. You already spotted it in your 16/03/19 1600 update. The <table> tag contains two <tr> tags (table row), the first being the header row filled with <th> tags (table header) representing the header of a single column.
The second row is a dummy row representing the entry of one Team.
The prepending line <!-- ko foreach: {data: rrBrackets().teams, as: 'team' } --> is part of the Knockout Framwork, a JavaScript library employed by the website to dynamically fill the bare HTML tags with content. This is the reason why there is only one row in the HTML source, but in the rendered page you see nine rows: After the page is loaded, the JavaScript code loops over the list of Teams and creates a new row for each, populated with their respective data.
This, however, does not need to concern us: IEDocument contains the final version of the HTML DOM, after all loading was done (also see edit at the bottom). The first row looks actually like this (press F12 and have a look at the DOM Explorer tab for yourself):
<tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<td class="tournament-table_td" data-bind="text: team.position">1</td>
<td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<a class="tournament-table_team tournament-table_team__big" href="/en/tournaments/1000000017/team/1000006728/" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}">Pubbies</a>
</td>
<td class="tournament-table_td" data-bind="text: team.battle_played">8</td>
<td class="tournament-table_td" data-bind="text: team.wins">7</td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses">1</td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws">0</td>
<td class="tournament-table_td" data-bind="text: team.extra_statistics.points">21</td>
</tr>
Programmatically finding the tag in the first place is, however, a bit more complicated. Usually structurally important tags have an id attribute that is unique. In such a case we could simply find it by using IEDocument.getElementById("id_of_table_tag").
In this case our best bet is probably searching for the heading Tournament brackets:
<div class="wrapper">
<h2 class="tournament-heading">Tournament brackets</h2>
</div>
If you inspect the following tree of HTML tags, to get to our <table> tag we need to go one step up in the hierarchy, skip the next two tags and from there on, use the first child tag for the next two tags:
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
If objH2.innerText = "Tournament brackets" Then Exit For
Next objH2
' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
' this may fail if the JavaScript code has not already populated the table
Set objTable = objH2.parentElement _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.children(0) _
.children(0)
As you can imagine, this is not very robust and is bound to break at any time if the layout of the webpage changes. There are other possible ways how to traverse the tree of HTML tags to finally reach the tag you seek.
See the documentation of the Document object for more.
All we need to do now is loop over the Rows of objTable and output each of its Cells.
Output to Excel
As for the output, in this example, we keep it as simple as possible.
Put together with the above, the following code just outputs the table to the current worksheet in Excel:
Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim objTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long
' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"
strH2AnchorContent = "Tournament brackets"
' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
Exit Sub
End If
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
If objH2.innerText = strH2AnchorContent Then Exit For
Next objH2
If objH2 Is Nothing Then
MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
Exit Sub
End If
' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
Set objTable = objH2.parentElement _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.children(0) _
.children(0)
' iterate over the table and output its contents
lngRow = 1
For Each objRow In objTable.rows
lngColumn = 1
For Each objCell In objRow.cells
Cells(lngRow, lngColumn) = objCell.innerText
lngColumn = lngColumn + 1
Next objCell
lngRow = lngRow + 1
Next
End Sub
Although this is only a partial solution for your current problem, this offers a general solution for how to programmatically extract data from a website using VBA.
As you said that you regularly encounter such problems, this might be of some use to you nonetheless.
Edit
In his answer, Doktor OSwaldo rightfully declares the objects as exactly what they are - in contrast to my previous version where everything was of type Object. I didn't know of the Microsoft HTML Object Library. Thanks #Doktor OSwaldo. :)
I incorporated the use of the library in my code above.
You should be aware that at the moment where objTable is set, the element might not yet exist in the DOM because of the JavaScript having not yet completely filled in all the data. You could put a loop around this statement checking if objTable was indeed successfully set:
On Error Resume Next
Do
Err.Clear
Set objTable = ...
Loop While Err
On Error GoTo 0
You should probably include a timeout option as shown in function GetIEDocument(). All of this is best moved to a separate function that also clicks the Round and Group buttons as shown in Doktor OSwaldo's answer.
As you probably have already noticed, the header columns are output twice. This is actually correct because of the way the icon is shown before the header text.
You can identify this with objCell.tagName = "TH" And objCell.children.length = 2, in which case you should use objCell.children(1).innerText instead of objCell.innerText to output to Excel.

So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:
Option Explicit
Private Sub CommandButton1_Click()
'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
'Microsoft HTML object Library.
'Code will NOT run otherwise.
Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection
Set objIE = New SHDocVw.InternetExplorer
Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim RowNumber As Integer
RowNumber = 1
With objIE
.Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
.Visible = 0
Do While .READYSTATE <> 4: DoEvents: Loop
Application.Wait (Now + TimeValue("0:00:01"))
Set htmlDoc = .document
Dim ButtonRoundData As Variant
Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")
Dim ButtonData As Variant
Set ButtonData = htmlDoc.getElementsByClassName("groups_link")
Dim button As HTMLLinkElement
For Each button In ButtonData
Debug.Print button.nodeName
button.Click
Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then
Set htmlCurrentDoc = .document
Dim RawData As HTMLTable
Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)
Dim ColumnNumber As Integer
ColumnNumber = 1
Dim hRow As HTMLTableRow
Dim hCell As HTMLTableCell
For Each hRow In RawData.Rows
For Each hCell In hRow.Cells
Cells(RowNumber, ColumnNumber).Value = hCell.innerText
ColumnNumber = ColumnNumber + 1
Next hCell
ColumnNumber = 1
RowNumber = RowNumber + 1
Next hRow
RowNumber = RowNumber + 3
Next button
End With
End Sub
What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...
for Debugging i suggest to set .Visible to 1, so you will se what happens.
EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.
EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.
Here some sample data i got in Excel:

Related

Obtain Innertext from Web Element with Variable Path - Selenium

I have a VBA macro that I'm running in Excel 2016. The macro brings back information from the internet using Chrome and Selenium WebDriver. The macro iterates through several similar webpages, but some pages have a few more or less lines than others. Hence, the XPath to the innertext I'm interested in varies slightly from page to page. Here is a snippet of the source code for the element, it is the "242" that I'm trying to locate and extract.
<div ng-repeat="squarefootage in improvement.SquareFootage" class="ng-scope">
<div>
<span class="labelSquareFootage ng-binding">ATTACHED GARAGE AREA </span><span class="result ng-binding">242</span>
</div>
</div>
As a workaround I'm just grabbing the entire source code for the page and then parsing it with INSTR to find what I'm looking for. I was wondering if there was a more elegant method to find an element with a variable path? Is there something in WebDriver that would work like
WDriver.FindElementbyInnerHTML
?
Here is a link to the website, you can look at a few different addresses and see how the path changes from page (address) to page (next address).
You could gather all nodes with matching class and loop until desired garage text found then take the nextSibling
Public Sub Demo()
'Your code to get to page and enter address and search, open heading, then....
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
html.body.innerHTML = WDriver.PageSource
Dim nodes As Object, node As Object, i As Long
Set nodes = html.querySelectorAll(".labelSquareFootage")
For i = 0 To nodes.Length - 1
Set node = nodes.Item(i)
If InStr(node.innerText, "ATTACHED GARAGE AREA") > 0 Then
Debug.Print node.NextSibling.innerText
Exit For
End If
Next i
End Sub
For xpath, you could try
//*[text()[contains(.,'ATTACHED GARAGE AREA')]]/following-sibling::span
if the desired value is the next span node. This searches for the desired text in the .innerText then takes the nextSibling span.
CSS selectors

Internet Explorer click an icon and a button of a specific line of a table

I've asked a similar question two days ago but I know stumble again on a similar problem but somehow different. previous question asked on a related problem
I have a report of many lines with the same structure. I need to click an icon that is on the nth line. That report is structured in cells so I know that my icon is in the first position (column) of that report. After I have click that icon I'll also have to click on a button in the 10th column.
I already know how to access the page in question with that code
Sub click_button_no_hlink()
Dim i As Long
Dim IE As Object
Dim Doc As Object
Dim objElement As Object
Dim objCollection As Object
Set IE = CreateObject("InternetExplorer.Application") 'create IE instance
IE.Visible = True
IE.Navigate "https://apex.xyz.qc.ca/apex/prd1/f?p=135:LOGIN_DESKTOP::::::" ' Adress of web page
While IE.Busy: DoEvents: Wend 'loading page
This first part is easy isn't? And I know how to handle it. Afterward I tried different variation around this but it either do nothing, or I get an error message. Obviously I don't fully understand what I'm doing with the "querySelector" thing…
dim step_target as string
step_target = 2
'identify all the lines of my table containing lines, containing icons
'and button to click on
Set objCollection = IE.document.getElementsByClassName("highlight-row")
i = 0
Do While i < objCollection.Length
'cell 2 is the one containing the step I'm targetting
If objCollection.Item(i).Cells(2).innerText = step_target Then
'that's not doing anything
objCollection.Item(i).Cells(9).Click
'tried many syntax around this with no luck
IE.document.querySelector([objCollection.Item(i).Cells(9)]).FireEvent ("onclick")
End If
i = i + 1
Loop
Here's images of the code of the page
Showing all the lines of the report
Showing all code lines of a particular line
and now the code of that first icon I need to click on (this is where I need help ;-) how can I call that action)
and finally the code of that button I also need to click on
Again, I thank you all in advance, for the time you'll take to help me along this.
you could try attribute selector for first in combination with descendant combinator and a type selector
ie.document.querySelector("[headers='ID_DET_DEM_TRAV_STD'] a").click
you could try attribute selector for second in combination with descendant combinator and input type selector
ie.document.querySelector("[headers='BOUTON1'] input").click
alternative for second is
ie.document.querySelector("[value=Fait]").click
Typically, if you want to select by position e.g. 1 and 10th columns you would use
td:nth-of-type(1)
td:nth-of-type(10)
Though you would also use a tr:nth-of-type(n) to get the right row as well e.g. first row, first col. Then add in any child type selector, for example, that you might need.
ie.document.querySelector("tr:nth-of-type(1) td:nth-of-type(1)")
Child a tag:
ie.document.querySelector("tr:nth-of-type(1) td:nth-of-type(1) a")
Child input tag: would then be:
IE.document.querySelector("tr:nth-of-type(4) td:nth-of-type(10) input").Click

How to click on a button on a webpage using <td> and <tr>?

I am trying to click o the first "Completed" button in the highlighted part of the webpage below.
Here is a piece of the VBA code of the website page:
I tried to click on the FIRST completed button in many different ways such as :
For Each element In ie3.getElementsByTagName("main_table_data_right_border main_table_data_bottom_border")(5)
If element.innerText = "Completed" Then
' Application.Wait (Now + TimeValue("0:03:00"))
element.Click
Application.Wait (Now + TimeValue("0:00:20"))
Exit For
Else
End If
Next
Or
doc.querySelector("#divPage > table.advancedSearch_table > tbody"). _ getElementsByTagName("tr")(3).getElementsByTagName("td")(5).Children(0).Click
But none of them seem to work. When I debug the code and I go through this part and this particular line, nothing really happens. So the button is not being clicked.
Can anyone help me with that?
You could use the getElementsByTagName method to find the hyperlink. Please refer to the following sample:
VBA code to find the hyperlink and click the button (in this sample, I just find the special cell in the first row. If you want to loop through the hyperlink, you need to use For Each statement to loop through the array).
Sub Test()
Dim ie As Object
Dim Rank As Object
Set ie = CreateObject("InternetExplorer.application")
ie.Visible = True
ie.Navigate ("http://localhost:54382/HtmlPage47.html")
Do
If ie.ReadyState = 4 Then
Exit Do
Else
End If
Loop
Set doc = ie.document
doc.getElementsByTagName("tr")(1).getElementsByTagName("td")(5).getElementsByTagName("a")(0).Click
End Sub
Code in the Web page:
<div>
<table class="main_table" style="text-align:center;">
<tr class="main_table_header">
<td></td>
<td>Export Type</td>
<td>Criteria</td>
<td>Rep./List</td>
<td>Creation Date</td>
<td>Status</td>
<td>Reference</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello AA')" id="link1" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello BB')" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello CC')" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
<tr class="main_table_data">
<td>
<input id="Checkbox1" type="checkbox" />
</td>
<td>Activites</td>
<td>Process Date from 2019/07/02 to 2019/07/02</td>
<td>For an advanced search</td>
<td>2019/07/03</td>
<td><a onclick="javascript:alert('hello DD')" href="#">Conpleted</a> (601 lines)</td>
<td>662602308</td>
</tr>
</table>
</div>
The result is like this:
I see you are a bit confused as to how to access HTML elements, so I'll take this opportunity to demonstrate the logic of doing so in a very detailed manner, which I also believe to be very intuitive. There are other ways to do it, but I believe the following one is the most comprehensive and intuitive one and ideal for a beginner.
Firstly, I will go ahead and assume that ie3 is an InternetExplorer object.
When you use this object to navigate to a page, you can access the html of that page by using the ie3.document, which holds an HTML document object.
To take full advantage of the HTML document object you should add a reference to the Microsoft HTML Object Library. This Library will allow you to use a number of HTML elements which make your life easier.
In your case, the elements you want to be able to access are
HTML tables and their rows and cells
HTML anchor elements ()
So my declarations would be the following:
Dim ie3 As New InternetExplorer 'To be used to navigate to the page of interest
Dim doc As HTMLDocument 'this will hold the HTML document corresponding to the page
Dim toBeClicked As HTMLAnchorElement 'To be used to store the <a></a> element
Dim table As HTMLTable 'To be used to store the table element
Dim tableRow As HTMLTableRow 'To be used to store a row of the table element
Dim tableCell As HTMLTableCell 'To be used to store a cellof the table element
Assuming that you have already used the ie3 to navigate to the website of interest, you can store it's HTML document in doc like so:
Set doc = ie3.document
Once you have access to the HTML document of the webpage, you can also get access to its elements in a number of ways, some more targeted than others. Below I am demonstrating the most common methods to do that, using the table element as an example.
If the table has a unique ID, you can get access to it by using the .getElementById() method. This method returns a single element. In your case, the table you're after, doesn't have an ID.
If the table belongs to a class, you can get access to it by using the .getElementsByClassName() method. This method returns a collection of elements, all of which belong to the same class. To get access to a member of this collection you can use a (item index) kind of notation. The first member has an index of 0. In your case the table belongs to class "advancedSearch_table", which happens to only have one member.
If there's no class or ID you can use the .getElementsByTagName method. This method returns a collection of all the elements who have the same tag. In your case you would need all the tables in the document. To get access to a member of this collection you can use a (item index) kind of notation. The first member has an index of 0. Tags in HTML look like so <tagName attribute="something">Something</tagName>.
Below I demonstrate all three methods. You can use either one of the first two:
Set table = doc.getElementsByClassName("advancedSearch_table")(0)
Set table = doc.getElementsByTagName("table")(0)
Set table = doc.getElementById("ID of the table") 'only for demostration purposes, it doesn't apply to your case, as the table has no ID.
Keep in mind that in your case, there is only one table in the document and there's only one element that belongs to the class "advancedSearch_table". This means that you need the first element of the corresponding collections. That's why I use 0 as index.
By the same logic as above, now that the table has been stored, you can get access to its rows and cells. More specifically, you need the 5th cell of the 4th row. That's where the link that you want to click is:
Set tableRow = table.getElementsByTagName("tr")(3)
Set tableCell = tableRow.getElementsByTagName("td")(4)
Finally, now that the cell of interest has been stored, you can access the anchor element and click it. Again, there's only one anchor element in the cell, so it's going to be the first one in the corresponding collection:
Set toBeClicked = tableCell.getElementsByTagName("a")(0)
toBeClicked.Click
BONUS
If you want to click on all the "Completed" links, one by one, you need to loop through the corresponding elements. Here'w two ways to do it:
Click on the anchor in the 5th cell of each row:
For Each tableRow In table.Rows
Set toBeClicked = tableRow.getElementsByTagName("td")(4).getElementsByTagName("a")(0)
toBeClicked.Click
Next tableRow
Loop through all rows and though all cells of the table, find the inner text that you're looking for and click the corresponding anchor:
For Each tableRow In table.Rows
For Each tableCell In tableRow.Cells
If tableCell.innerText = "Something" Then
Set toBeClicked = tableCell.getElementsByTagName("a")(0)
toBeClicked.Click
Next tableCell
Next tableRow
Here, once you click on completed hyperlink, JavaScript gets executed and it opens an Excel file, here you can use ie3.Navigate "javascript:openExcelFile('t83_Kerrfinancialadvisorsinc/455X3/ExportActivity_66260230820190703122002139.xlsx)"
Since it's tied with a hyperlink, you can also try using
element.Click
element.FireEvent ("onclick")
or you can use execScript
Call ie3.document.parentWindow.execScript("your script in webpage", "JavaScript")

using MS Excel VBA to extracting data from complex HTML/JS

Short introduction, i consider myself as a intermediate VBA coder without any significant HTML experience. I would like to extract data from a HTML/JS webpage using MS Excel VBA. I have spent couple of hours testing my code on various pages as well as looking for training materials and various forums and Q&A pages.
I am desperately asking for you help. (Office 2013, IE 11.0.96)
The goal is to get the FX rate of a certain bloomberg webpage. The long term goal is to run a macro on various exchange rates and get the daily rate out of the system to an excel table per working day, but i will be handle that part.
I would be happy either with
(1)the current rate (span class="priceText__1853e8a5") or
(2) previous closing (section class="dataBox opreviousclosingpriceonetradingdayago numeric") or
(3) opening rate (section class="dataBox openprice numeric").
My issue is that I cannot fetch the part of the html code where the rate is.
Dim IE As Object
Dim div As Object, holdingsClass As Object, botoes As Object
Dim html As HTMLDocument
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Visible = False
.Navigate "https://www.bloomberg.com/quote/EURHKD:CUR"
Do Until .ReadyState = 4: DoEvents: Loop
End With
Set html = IE.document
Set div = IE.document.getElementById("leaderboard") 'works just fine, populates the objects
Set holdingsClass = IE.document.getElementsByclass("dataBox opreviousclosingpriceonetradingdayago numeric") 'i am not sure is it a class element at all
Set botoes = IE.document.getElementsByTagName("dataBox openprice numeric") 'i am not sure is it a tag name at all
Range("a1").Value = div.textContent 'example how i would place it by using .textContent
Range("A2").Value = holdingsClass.textContent
Range("A3").Value = botoes.textContent
Much appreciate your help!
Instead of digging through html why not use Bloomberg API to request the specific rate?
Likely would be faster and would save you a lot of time in the future doing the same kind of thing.
Please see my similiar project where I create a macro to pull historical FX rates from the European central bank.
https://github.com/dmegaffi/VBA-GET-Requests/blob/master/FX%20-%20GET.xlsm
If you right-click the webpage element you want in chrome and select inspect, it'll bring up the details of that element. You can also press f12 to bring up the HTML of any page. This also works in other browsers.
Is this the element you're looking for?
screen shot of mentioned webpage
Based on your code above, you could reference this element with IE.document.getElementsByclass("priceText__1853e8a5"). Elements in HTML can share classes but can't share ID's, so if there is another element with the class priceText__1853e8a5 it won't work since it won't select a single element. Then, of course, you have to select the text within the element since at this point you'd just have the and would need the text inside of it.
Hope this helps.
To address your questions generally, see below.
(1)the current rate (span class="priceText__1853e8a5")
That can be written as a CSS query selector of:
span.priceText__1853e8a5
(2) previous closing (section class="dataBox
opreviousclosingpriceonetradingdayago numeric")
That can be written as a CSS query selector of:
.dataBox.opreviousclosingpriceonetradingdayago.numeric
(3) opening rate (section class="dataBox openprice numeric")
That can be written as a CSS query selector of:
.dataBox.openprice.numeric
They are applied with querySelector or querySelectorAll (if more than one match and a later match than the first is required) of HTMLDocument.
E.g.
Debug.Print IE.document.querySelector("span.priceText__1853e8a5").innerText
If more using querySelectorAll
IE.document.querySelectorAll("span.priceText__1853e8a5")(0).innerText
In the above you replace 0 with the appropriate index where your target element is found.
Observing the page the actual selectors appear to be as follows but I think this website is probably using ecmascript syntax that is not supported on legacy browsers i.e. Internet Explorer or is attempting blocked cross domain requests.
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
With IE
.Visible = True
.navigate "https://www.bloomberg.com/quote/EURHKD:CUR"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Debug.Print "Current: " & .querySelector(".priceText__1853e8a5").innerText
Debug.Print "Prev close: " & .querySelector(".value__b93f12ea").innerText
Debug.Print "Open: " & .querySelector(".value__b93f12ea").innerText
End With
.Quit
End With
End Sub
Using Selenium Basic and Chrome the page renders fine:
Option Explicit
Public Sub GetInfo()
Dim d As WebDriver
Set d = New ChromeDriver
Const URL = "https://www.bloomberg.com/quote/EURHKD:CUR"
With d
.Start "Chrome"
.get URL
Debug.Print "Current: " & .FindElementByCss(".priceText__1853e8a5").Text
Debug.Print "Prev close: " & .FindElementByCss(".value__b93f12ea").Text
Debug.Print "Open: " & .FindElementByCss(".value__b93f12ea").Text
.Quit
End With
End Sub

extracting text from a specific <h> element using GetElementById

I have created a VBS script file that looks at an XML data file.
Within the XML data file, the HTML data I need is embedded within the
<![CDATA[]'other interesting HTML data here'].
I have stripped out this HTML data using XPATH and insterted into a Div object (myDiv) element that is represented as a variable (its not written to a document).
So for example, the contents of myDiv.innerHTML looks like this;
<table>
<tr><td>text in cell 1</td></tr>
<tr><td><h1 id="myId1">my text for H1</h></td><tr>
<tr><td><h2 id="myId2">my text for h2</h></td></tr>
</table>
What I want to do at first is simply select the appropriate tag with the Id that matches "myId1", therefore, I used a statement like this;
MyIdText = MyDiv.getElementById("myId1")
However, the aplpication I am using says "Err 438, Object doesn't support this property or method".
I am a bit of a newbie with code and can understand some of the basic fundamantals, but get a bit lost when it becomes a bit more complex (sorry). I have looked through other postings on this board, and all of them seem to rlate to HTML nad Javascript, not VBScript (the application I am using will not allow Java Script).
Am I using the code wrong?
To use getElementById() you should write: document.getElementById("myId1"). This way you tell the browser to search inside 'document' for the specified ID. Your variable is not defined and it does not have this method attached, so your code will generate the above error.
To extract the text inside the specific H element:
MyIdText = document.getElementById("myId1").textContent;
many thanks for the help, unfortunately, I know a little VBS and even littler about DOM and I am trying to learn both by experimenting. There are certain restrictions within the environment/application I am working with (Its called ASCE and its a tool for managing Safety Cases - but thats not important right now).
However, so that we are comparing apples with apples, I have tried to experiment within an HTML page to give me a better understanding of what the DOM/VBS commands can actually do. I have had some partial success, but still cant understand why it falls over where it does.
Here is the exact file I am experimenting with, I have added comment text for each section;
<html>
<head>
<table border=1>
<tr>
<td>text in cell 1</td>
</tr>
<tr>
<td><h1 id="myId1">my text for H1</h1></td>
</tr>
<tr>
<td><h1 id="myId2">my text for h2</h2></td>
</tr>
</table>
<script type="text/vbscript">
DoStuff
Sub DoStuff
' Section 1: Get a node with the Id value of "myId1" from the above HTML
' and assign it to the variable 'GetValue'
' This works fine :-)
Dim GetValue
GetValue = document.getElementById("myId1").innerHTML
MsgBox "the text=" & GetValue
' Section 2: Create a query that assigs to the variable 'MyH1Tags' to all of the <h1>
' tags in the document.
' I assumed that this would be a 'collection of <h1> tags so I set up a loop to itterate
' through however many there were, but this fails as the browser says that this object
' doesn't support this property or method - This is where I am stuck
Dim MyH1Tags
Dim H1Tag
MyH1Tags = document.getElementsByTagName("h1") ' this works
For Each H1Tag in MyH1Tags ' this is where it falls over
MSgbox "Hello"
Next
' Section 3: Create a new Div element 'NewDiv' and then insert some HTML 'MyHTML'
' into 'NewDiv'. Create a query 'MyHeadings' that extracts all h1 headings from 'NewDiv'
' then loop round for however many h1 headings there are in 'MyHeadings'
' and display the text content. This works Ok
Dim NewDiv
Dim MyHTML
Dim MyHeadings
Dim MyHeading
Set NewDiv = document.createElement("DIV")
MyHTML="<h1 id=""a"">heading1</h1><h2 id=""b"">Heading2</h2>"
NewDiv.innerHTML=MyHTML
Set MyHeadings = NewDiv.getElementsByTagName("h1")
For Each MyHeading in MyHeadings
Msgbox "MyHeading=" & MyHeading.innerHTML
Next
'Section 4: Do a combination of Section 1 (that works) and Section 3 (that works)
' by creating a new Div element 'NewDiv2' and then paste into it some HTML
' 'MyHTML2' and then attempt to create a query that extracts the inner HTML from
' an id attribute with the value of "a". But this doesnt work either.
' I have tried "Set MyId = NewDiv2.getElementById("a").innerHTML" and
' also tried "Set MyId = NewDiv2.getElementById("a")" and it always falls over
' at the same line.
Dim NewDiv2
Dim MyHTML2
Dim MyId
Set NewDiv2 = document.createElement("DIV")
MyHTML2="<h1 id=""a"">heading1</h1><h2 id=""b"">Heading2</h2>"
NewDiv2.innerHTML=MyHTML
MyId = NewDiv2.getElementById("a").innerHTML
End Sub
</script>
</head>
<body>