VBA Excel Scraping - html

I am getting started with trying to learn about scraping. I got this page that is behind a login and I remember reading that you should not try to do the (1), (2) or (3) thing after get element by tagname. But that you should rather get something more unique like a Classname or ID. But can someone please tell me why
This the GetTag works and
Dim Companyname As String
Companyname = ie.document.getElementsByTagName("span")(1).innertext
This GetClass do not work
Dim Companyname As String
Companyname = ie.document.getElementsByClassName("account-website-name").innertext
This is the text that I am scraping
<span class="account-website-name" data-journey-name="true">Dwellington Journey</span>

getELEMENTbyProperty vs getELEMENTSbyProperty
There are primarily two distinct types of commands to retrieve one or more elements from a web page's .Document; those that return a single object and those that return a collection of objects.
Getting an ELEMENT
When getElementById is used, you are asking for a single object (e.g. MSHTML.IHTMLElement). In this case the properties (e.g. .Value, .innerText, .outerHtml, etc) can be retrieved directly. There isn't supposed to be more than a single unique id property within an HTML body so this function should safely return the only element within the i.e.document that matches.
'typical VBA use of getElementById
Dim CompanyName As String
CompanyName = ie.document.getElementById("CompanyID").innerText
Caveat: I've noticed a growing number of web designers who seem to think that using the same id for multiple elements is oh-key-doh-key as long as the id's are within different parent elements like different <div> elements. AFAIK, this is patently wrong but seems to be a growing practise. Be careful on what is returned when using .getElementById.
Getting ELEMENTS
When using getElementsByTagName, getElementsByClassName, etc. where the word Elements is plural, you are returning a collection (e.g. MSHTML.IHTMLElementCollection) of objects, even if that collection contains only one or even none. If you want to use these to directly access an property of one of the elements within the collection, an ordinal index number must be supplied so that a single element within the collection is referenced. The index number within these collections is zero based (i.e. the first starts at (0)).
'retrieve the text from the third <span> element on a webpage
Dim CompanyName As String
CompanyName = ie.document.getElementsByTagName("span")(2).innerText
'output all <span> classnames to the Immediate window until the right one comes along
'retrieve the text from the first <span> element with a classname of 'account-website-name'
Dim e as long, es as long
es = ie.document.getElementsByTagName("span").Length - 1
For e = 0 To es
Debug.Print ie.document.getElementsByTagName("span")(e).className
If ie.document.getElementsByTagName("span")(e).className = "account-website-name" Then
CompanyName = ie.document.getElementsByTagName("span")(e).innerText
Exit For
End If
Next e
'same thing, different method
Dim eSPN as MSHTML.IHTMLElement, ecSPNs as MSHTML.IHTMLElementCollection
ecSPNs = ie.document.getElementsByTagName("span")
For Each eSPN in ecSPNs
Debug.Print eSPN.className
If eSPN.className = "account-website-name" Then
CompanyName = eSPN.innerText
Exit For
End If
Next eSPN
Set eSPN = Nothing: Set ecSPNs = Nothing
To summarize, if your Internet.Explorer method uses Elements (plural) rather than Element (singular), you are returning a collection which must have an index number appended if you wish to treat one of the elements within the collection as a single element.

CSS selector:
You can achieve the same thing with a CSS selector of .account-website-name
The "." means className. This will return a collection of matching elements if there are more than one.
CSS query:
VBA:
You apply the selector with the .querySelectorAll method of .document. This returns a nodeList which you traverse the .Length of, accessing items by index, starting from 0.
Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll(".account-website-name")
For i = 0 To aNodeList.Length -1
Debug.Print aNodeList.Item(i).innerText
' Debug.Print aNodeList(i).innerText ''<== sometimes this syntax instead
Next

Related

Concatenate Rich Text Fields (HTML) and display result on Access form

I have an access database which deals with "articles" and "items" which are all textual stuff. An article is composed of several items. Each item has a rich text field and I wish to display the textual content of an article by concatenating all rich text fields of its items.
I have written a VBA program which concatenates the items rich text fields and feeds this into an independent TextBox control on my form (Textbox.Text = resulting string) but it does not work, I get an error message saying "this property parameter is too long".
If I try to feed a single textual field into the Textbox control, I get another error stating "Impossible to update the recordset" which I do not understand, what recordset is this about ?
Each item field is typically something like this (I use square brackets instead of "<" and ">" because otherwise the display of the post is not right) [div][font ...]Content[/font] [/div]", with "[em]" tags also included.
In front of my problem, I have a number of questions :
1) How do you feed an HTML string into an independent Textbox control ?
2) Is it OK to concatenate these HTML strings or should I modify tags, for example have only one "[div]" block instead of several in a row (suppress intermediate div tags) ?
3) What control should I use to display the result ?
You might well answer that I might as well use a subform displaying the different items of which an article is made up. Yes, but it is impossible to have a variable height for each item, and the reading of the whole article is very cumbersome
Thank you for any advice you may provide
It works for me with a simple function:
Public Function ConcatHtml()
Dim RS As Recordset
Dim S As String
Set RS = CurrentDb.OpenRecordset("tRichtext")
Do While Not RS.EOF
' Visually separate the records, it works with and without this line
If S <> "" Then S = S & "<br>"
S = S & RS!rText & vbCrLf
RS.MoveNext
Loop
RS.Close
ConcatHtml = S
End Function
and an unbound textbox with control source =ConcatHtml().
In your case you'd have to add the article foreign key as parameter to limit the item records you concatenate.
The "rich text" feature of a textbox is only intended for simple text.
We use the web browser control to display a larger amount of HTML text, and load it like this:
Private Sub Form_Current()
LoadWebPreview
End Sub
Private Sub HtmlKode_AfterUpdate()
LoadWebPreview
End Sub
Private Sub LoadWebPreview()
' Let the browser control finish the rendering of its standard content.
While Me!WebPreview.ReadyState <> acComplete
DoEvents
Wend
' Avoid the pop-up warning about running scripts.
Me!WebPreview.Silent = True
' Show body as it would be displayed in Outlook.
Me!WebPreview.Document.body.innerHTML = Me!HtmlBody.Value
End Sub

Retrieving the text between the <div> with VBA

I am trying to get a text string from inside a div on a webpage, but I can't seem to figure out how it is stored in the element.
Set eleval = objIE.Document.getElementsByClassName("outputValue")(0)
Debug.Print (eleval.innerText)
I have tried this and variations thereof, but my string just reads as "".
I mainly need help on how is this type of data is referenced in VBA.
<div class="outputValue">"text data that I want"</div>
Here is a screenshot of the page in question, I cannot give a link since it requires a company login to reach.
With .querySelector method, make sure page if fully loaded before attempting.
Example delays can be added with Application.Wait Now + TimeSerial(h,m,s)
Set eleval = objIE.Document.querySelector("div[class="outputValue"]")
Debug.Print eleval.innerText
If it is the first of its className on the page you could also use:
Set eleval = objIE.Document.querySelector(".outputValue")
If there is more than one and it is at a later index you can use
Set eleval = objIE.Document.querySelectorAll(".outputValue")
And then access items from the nodeList returned with
Debug.Print eleval.Item(0).innerText 'or replace 0 with the appropriate index.
Dim elaval as Variant
elaval = Trim(Doc.getElementsByTagName("div")(X).innerText)
msgbox elaval
Where X is the instance of your class div

Get value from HTML element for display in Textbox

I have tried adapting a handful of solutions Ive found on here and cannot get any to work. Most recently,
Private Sub Button4_Click(sender As Object, e As EventArgs) Handles Button4.Click
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://MyWebSearch.com/s/" + TextBox1.Text)
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(<div class="inline-block"></div>)
Textbox5.text(table.InnerText)
Next
End Sub
I am trying to conduct a search with a fixed address where + textbox1 contains the search item. I need to return the value from one element on the page into Textbox5 after search is conducted. I cant for the life of me get this to work. Ive tried obtaining the xpath but that failed also. What am I doing wrong?
The web page is rbx.trade/s/"username"
I am trying to return the users "Rap" and display in textbox5
You're searching for the class of an element. You probably want to search for the ID instead.
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(<div id="elementID"></div>)
Textbox5.text(table.InnerText)
Next
Or if you indeed mean to be using the class, store each value in an array or append to the text box instead of storing it directly. Say there are 8 element with that class. Doing it the way you are will always store the value of the 8th skimmed element with that class name.

Get values from a website with same ID

I'm trying to make a local VBscript to get some values from a webpage. I know that I can use the next code to get a value from a specific element:
IE.document.GetElementById("id-to-find")
My problem is that I have the same ID ("hiddencardetailsenrollid") in more than one element so I need to extract all of them. This is the code repeated:
carId: <span id="hiddencardetailscarid">10972203</span>,
enrollId: <span id="hiddencardetailsenrollid">11147540</span>.
Do you have any suggestion to do this? I thought on a for condition to read all the HTML document but I do not know how to approach it.
Any help will be appreciated.
Edit: Here is it the full screenshot of the sourcecode. As you can see, they have exactly the same labels, but carId and enrollId have different values. I can't copypaste the code, stackoverflow returns me an error (I suppose because "table" tag):
If you did have multiple elements with the same ID, which you shouldn't, you could use the answer from this question (courtesty of #peter) and slightly modify it:
Dim HTMLDoc, XML, URL, table
Set HTMLDoc = CreateObject("HTMLFile")
Set XML = CreateObject("MSXML2.XMLHTTP")
URL = "http://www.verizonwireless.com/b2c/store/controller?item=phoneFirst&action=viewPhoneDetail&selectedPhoneId=5723"
With XML
.Open "GET", URL, False
.Send
HTMLDoc.Write .responseText
End With
Set spans = HTMLDoc.getElementsByTagName("span")
for each span in spans
WScript.Echo span.innerHTML
next
'=><SPAN>Set Location</SPAN>
'=>Set Location
'=><SPAN>Submit</SPAN>
'=>Submit
You would simply replace getElementsByTagName with GetElementByID as you mentioned, then loop through the resulting array of objects. GetElementByID probably isn't even capable of returning an array. But again, you should not have multiple html elements with the same id.

Get data from a web table with table tag

I have this code in HTML:
<table cellspacing = "0" cellpadding = "0" width = "100%" border="0">
<td class="TOlinha2"><span id="Co">140200586125</span>
I already have a VBA function that accesses a web site, logs in and goes to the right page. Now I'm trying to take the td tags inside a table in HTML. The value I want is 140200586125, but I want a lot of td tags, so I intend to use a for loop to get those tds and put them in a worksheet.
I have tried both:
.document.getElementByClass()
and:
.document.getElementyById()
but neither worked.
Appreciate the help. I'm from Brazil, so sorry about any English mistakes.
There is not enough HTML to determine if the TOlinha2 is a consistent class name for all the tds within the table of interest; and is limited only to this table. If it is then you can indeed use .querySelectorAll
You could use the CSS selector:
ie.document.querySelectorAll(".TOlinha2")
Where "." stands for className.
You cannot iterate over the returned NodeList with a For Each Loop. See my question Excel crashes when attempting to inspect DispStaticNodeList. Excel will crash and you will lose any unsaved data.
You have to loop the length of the nodeList e.g.
Dim i As Long
For i = 0 To Len(nodeList) -1
Debug.Print nodeList(i).innerText
Next i
Sometimes you need different syntax which is:
Debug.Print nodeList.Item(i).innerText
You can seek to further narrow this CSS selector down with more qualifying elements such as, the element must be within tbody i.e. a table, and preceeded by a tr (table row) and have classname .TOLinha2
ie.document.querySelectorAll("tbody tr .TOlinha2")
Since you mentioned you need to retrieve multiple <td> tags, it would make more sense to retrieve the entire collection rather than using getElementById() to get them one-at-a-time.
Based on your HTML above, this would match all <span> nodes within a <td> with a class='TOlinha2':
Dim node, nodeList
Set nodeList = ie.document.querySelectorAll("td.TOlinha2 > span")
For Each node In nodeList
MsgBox node.innerText ' This should return the text within the <span>
Next