I need to find the string in the attached picture using a vba in excel. I have the code below, but it is not finding the date that I am looking for.
The for loop to find is at the last "For Each Element In Elements2"
Dim Doc As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim elements1 As IHTMLElementCollection
Dim Elements2 As IHTMLElementCollection
Dim iCnt As Integer
Dim Element As IHTMLElement
Dim appIE As InternetExplorerMedium
Sheets(1).Cells(1, 1).Value = ""
complete = 0
Set appIE = New InternetExplorerMedium
sURL = "https://example.com"
With appIE
.Navigate sURL
.Visible = True
Do While appIE.Busy Or appIE.ReadyState <> 4
DoEvents
Loop
Do While complete <> 1
Set Doc = appIE.Document
Set Elements = Doc.GetElementsByTagName("input")
Set elements1 = Doc.GetElementsByTagName("strong")
Set Elements2 = Doc.GetElementsByTagName("td")
For Each Element In Elements
If Element.ID = "form-id" Then
requestnumber = Element.GetAttribute("Value")
End If
If Element.ID = "remedy-case-info" Then
CaseInfo = Element.GetAttribute("Value")
End If
Next Element
For Each Element In elements1
If InStr(1, Element.InnerHtml, "EM") Then
For iCnt = 1 To Len(Element.InnerText)
If IsNumeric(Left(Element.InnerText, 2)) Then
NumericOnly (Element.InnerText)
End If
Next iCnt
End If
Next Element
AClientCount = tempcount
For Each Element In Elements2
' If InStr(1, Element.InnerHtml, "td") Then
If InStr(1, Element.InnerHtml, "value-field align-top") Then
Requestdate = Element.GetAttribute("Value")
End If
' End If
Next Element
Set Elements = Nothing
If requestnumber <> "" Then
Sheets(1).Cells(1, 1).Value = requestnumber & " - " & CaseInfo & " - " & tempcount & " - " & Requestdate
complete = 1
End If
Loop
.Quit
End With
The innerHtml property picks up the content within a tag but does not pick up the tag itself. The outerHTML property includes the tag itself as well as the tag's content.
Example:
HTML <p class="fee fie foe fum">bar <b>bat</b> <i>cat</i> car</p>
innerHTML bar <B>bat</B> <I>cat</I> car
outerHTML <P class="fee fie foe fum">bar <B>bat</B> <I>cat</I> car</P>
To perform a text match on an attribute value of an element, you would need to look at the outerHTML property of the element and not the innerHTML property.
However, the class attribute of an element can be accessed via the className property so you could replace the InStr on innerHTML with this:
If InStr(1, Element.className, "value-field align-top") Then
This is not ideal because it would be perfectly valid to write the class names in a different order - e.g. class="align-top value-field" - and this would not be picked up by the InStr function.
It would be better to start with getElementsByClassName (which doesn't care about which order the class names are in) and then use the tagName property to check we have the correct tag, like this:
Set Elements2 = Doc.getElementsByClassName("value-field align-top")
' code for the loops on Elements and Elements1 goes here
For Each Element In Elements2
If Element.tagName = "td" Then
Finally, Element.getAttribute("value") will return Null unless the element has a named attribute called "value". To get the text value of the element, use this instead:
Requestdate = Element.innerText
Related
Ok, this is the target webpage(s):
http://dnd.arkalseif.info/items/index.html_page=27
Here's my current code:
Sub GetItemsList()
' This macro uses manually entered links to scrap the content of the target page.
' It does not (yet) capture hyperlinks, it only grabs text.
Dim ie As Object
Dim retStr As String
Dim sht As Worksheet
Dim LastRow As Long
Dim rCell As Range
Dim rRng As Range
Dim Count As Long
Dim Status As String
Dim BadCount As Long
Set sht = ThisWorkbook.Worksheets("List")
BadCount = 0
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set ie = CreateObject("internetexplorer.application")
Set rRng = sht.Range("b1:b" & LastRow)
Status = "Starting at row "
For Each rCell In rRng.Cells
Count = rCell.Row
Application.StatusBar = BadCount & " dead links so far. " & Status & Count & "of " & LastRow & "."
Wait 1
If rCell = "" Then
With ie
.Navigate rCell.Offset(0, -1).Value
.Visible = False
End With
Do While ie.Busy
DoEvents
Loop
Wait 1
On Error GoTo ErrHandler
' rCell.Value = ie.Document.getElementById("content").innerText
rCell.Value = ie.Document.getElementsByClassName("common").innerText
rCell.WrapText = False
Status = "This row successfully scraped. Moving on to row "
Application.StatusBar = BadCount & " dead links so far. " & Status & Count + 1 & "of " & LastRow & "."
Status = "Previous row succeded. Now at row "
98 Wait 1
End If
Next rCell
If BadCount > 0 Then
Application.StatusBar = "Macro finshed running with " & BadCount & " errors."
Else
Application.StatusBar = "Finished."
End If
Exit Sub
ErrHandler:
rCell.Value = ""
Status = "Previous row failed. Moving on to row "
BadCount = BadCount + 1
Application.StatusBar = "This row is a dead link. " & BadCount & " dead links so far. Moving on to row " & Count + 1 & "of " & LastRow & "."
Resume 98
End Sub
(try to ignore all my StatusBar updates, this code was originally meant for a looooong list of hyperlinks, and I needed (at the time) to know when things buggered up)
Now, the commented out line works, in that it grabs the entire body of text from the div id Content. But I want to grab the hyperlinks nestled inside the first column of the table which is nested inside the div id (which is what the following line was for). But it just fails. Excel does nothing, treats it like an error, and proceeds to the next link.
I presume that I need to tell Excel to look for the Table class inside the Div id. But I don't know how to do that, and I haven't been able to figure it out.
Thanks everyone.
I would use CSS selectors to target the links and XMLHTTP as a faster retrieval method than launching a browser.
CSS selectors:
The following:
td:first-child [href]
The td:first-child is a :first-child CSS pseudo-class selector of td tagged element; " " is a descendant combinator selector, the [] is an attribute selector. Basically, it selects for the first td element in each row in this case i.e. the first column, and then to the href attribute element within.
The :first-child CSS pseudo-class represents the first element among a
group of sibling elements.
Sadly VBA implementation doesn't support the :not selector as the exact elements could also be matched with .common tr + tr td :not([href*='rule'],br). Support for pseudo selectors is very limited. In this case using a :nth-child() CSS pseudo-class selector of td:nth-child(1) would have retrieved specific items if supported in descendant combination as td:nth-child(1) [href]. I keep meaning to do a write up on what is and isn't supported in case anyone wants as a reference. It is useful to be aware of even non VBA supported methods in case you then chose to switch to a language that does support.
The selector is applied via the querySelectorAll method of, in this case, HTMLDocument. It returns all matches as a nodeList whose .Length can be traversed to access individual matched elements by index.
nodeList items:
Option Explicit
Public Sub GetLinks()
Dim sResponse As String, html As HTMLDocument, nodeList As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://dnd.arkalseif.info/items/index.html_page=27", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set nodeList = .querySelectorAll("td:first-child [href]")
For i = 0 To nodeList.Length - 1
Debug.Print Replace$(nodeList.item(i), "about:", "http://dnd.arkalseif.info/items/")
Next
End With
End Sub
References (VBE > Tools > References):
Microsoft HTML Object Library
I am trying to extract all the hyperlinks which contains"http://www.bursamalaysia.com/market/listed-companies/company-announcements/" from the webpages I input.
Firstly, the code ran well but after then I am facing the problems which I could not extract the url link that I needed. It just missing every time i run the sub.
Link:http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All
Sub scrapeHyperlinks()
Dim IE As InternetExplorer
Dim html As HTMLDocument
Dim ElementCol As Object
Dim Link As Object
Dim erow As Long
Application.ScreenUpdating = False
Set IE = New InternetExplorer
For u = 1 To 50
IE.Visible = False
IE.navigate Cells(u, 2).Value
Do While IE.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to websiteî“‘hahaha"
DoEvents
Loop
Set html = IE.document
Set ElementCol = html.getElementsByTagName("a")
For Each Link In ElementCol
erow = Worksheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1).Value = Link
Cells(erow, 1).Columns.AutoFit
Next
Next u
ActiveSheet.Range("$A$1:$A$152184").AutoFilter Field:=1, Criteria1:="http://www.bursamalaysia.com/market/listed-companies/company-announcements/???????", Operator:=xlAnd
For k = 1 To [A65536].End(xlUp).Row
If Rows(k).Hidden = True Then
Rows(k).EntireRow.Delete
k = k - 1
End If
Next k
Set IE = Nothing
Application.StatusBar = ""
Application.ScreenUpdating = True
End Sub
Just to get the qualifying hrefs that you mention from the URL given I would use the following. It uses a CSS selector combination to target the URLs of interest from the specified page.
The CSS selector combination is
#bm_ajax_container [href^='/market/listed-companies/company-announcements/']
This is a descendant selector looking for elements with attribute href whose value starts with /market/listed-companies/company-announcements/, and having a parent element with id of bm_ajax_container. That parent element is the ajax container div. The "#" is an id selector and the "[] " indicates an attribute selector. The "^" means starts with.
Example of container div and first matching href:
As more than one element is to be matched the CSS selector combination is applied via querySelectorAll method. This returns a nodeList whose .Length can be traversed to access individual items by index.
The full set of qualifying links are written out to the worksheet.
Example CSS query results from page using selector (sample):
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
Application.ScreenUpdating = False
With IE
.Visible = True
.navigate "http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim links As Object, i As Long
Set links = .document.querySelectorAll("#bm_ajax_container [href^='/market/listed-companies/company-announcements/']")
For i = 0 To links.Length - 1
With ThisWorkbook.Worksheets("Sheet1")
.Cells(i + 1, 1) = links.item(i)
End With
Next i
.Quit
End With
Application.ScreenUpdating = True
End Sub
I have a macro that tries to extract all the href values from a page but it only seems to get the first one. If someone could help me out that would be greatly appreciated.
The URL I used is https://www.facebook.com/marketplace/vancouver/entertainment
Screenshot of HTML
<div class="_3-98" data-testid="marketplace_home_feed">
<div>
<div>
<div class="_65db">
<a class="_1oem" href="/marketplace/item/920841554781924" data-testid="marketplace_feed_item">
<a class="_1oem" href="/marketplace/item/580124349088759" data-testid="marketplace_feed_item">
<a class="_1oem" href="/marketplace/item/1060730340772072" data-testid="marketplace_feed_item">
Sub Macro1()
``marker = 0
Set objShell = CreateObject("Shell.Application")
IE_count = objShell.Windows.Count
For x = 0 To (IE_count - 1)
On Error Resume Next ' sometimes more web pages are counted than are open
my_url = objShell.Windows(x).document.Location
my_title = objShell.Windows(x).document.Title
If my_title Like "Facebook" & "*" Then 'compare to find if the desired web page is already open
Set ie = objShell.Windows(x)
marker = 1
Exit For
Else
End If
Next
Set my_data = ie.document.getElementsByClassName("_3-98")
Dim link
i = 1
For Each elem In my_data
Set link = elem.getElementsByTagName("a")(0)
i = i + 1
'copy the data to the excel sheet
ActiveSheet.Cells(i, 4).Value = link.href
Next
End Sub
You can use a CSS selector combination to get the elements. If you provide the actual HTML, not as an image it will be easier to test and determine best combination. The selector is applied via the querySelectorAll method to return a nodeList of all matching elements. You traverse the .Length of the nodeList to access items by index from 0 to .Length-1.
VBA:
Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll("._1oem[href]")
For i = 0 To aNodeList.Length-1
Activesheet.Cells(i + 2,4) = aNodeList.item(i)
Next
The css selector combination is ._1oem[href], which selects the href attributes of elements with a class of _1oem. The "." is a class selector and the [] an attribute selector. It is a fast and robust method.
The above assumes there are no parent form/frame/iframe tags to negotiate.
An alternative selector that matches on the two attributes, rather than the class would be:
html.querySelectorAll("[data-testid='marketplace_feed_item'][href]")
Full example:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
With IE
.Visible = True
.navigate "https://www.facebook.com/marketplace/vancouver/entertainment"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim aNodeList As Object, i As Long
Set aNodeList = IE.document.querySelectorAll("._1oem[href]")
For i = 0 To aNodeList.Length - 1
ActiveSheet.Cells(i + 2, 4) = aNodeList.item(i)
Next
'Quit '<== Remember to quit application
End With
End Sub
You only ask for the first anchor element within each element with a _3-98 class. Iterate through the collection of anchor elements within the parent element.
...
dim j as long
Set my_data = ie.document.getElementsByClassName("_65db")
For Each elem In my_data
for i = 0 to elem.getelementsbytagname("a").count -1
j = j+1
ActiveSheet.Cells(j, 4).Value = elem.getElementsByTagName("a")(i).href
next i
Next elem
...
I always have problems when I do web scraping with vba if I find tags nested like this in the link
http://forebet.com
scrape the link data from the menu on the left, but I'm wrong when I get to the championships nested as England, Spain
Sub championshipforebet()
Dim objIE As Object
Dim itemEle As Object
Dim itemEle1 As Object
Dim away As Object
Dim desc As String, pt1 As String, pt2 As String, price As String
Dim i As Integer
Cells.Select
Selection.ClearContents
Selection.NumberFormat = "#"
Set objIE = CreateObject("internetexplorer.application")
objIE.Visible = True
objIE.navigate "https://www.forebet.com/it/"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set itemEle = objIE.document.getElementsByClassName("moduletable_foremenu")(1)
Set itemEle1 = itemEle.getElementsByClassName("tree_foremenu")
i = 1
For Each ele1 In itemEle1(0).getElementsByTagName("li")
i = i + 1
Cells(i, 1) = ele1(0).getElementsByClassName("mainlevel_foremenu").href
Next ele1
End Sub
Although there is no remarkable difference between the two answers, I decided to post mine as I've already created one. The for loop part and the split function might be helpful for future readers. As QHarr has already described about .querySelector() I didn't repeat the same.
This is the code you can try as well:
Sub GrabLinks()
Const Baseurl$ = "https://www.forebet.com"
Dim S$, I&
With New XMLHTTP60
.Open "GET", Baseurl & "/", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
With .querySelectorAll(".mainlevel_foremenu,.sublevel_foremenu")
For I = 0 To .Length - 1
Cells(I + 1, 1) = Baseurl & Split(.Item(I).getAttribute("href"), "about:")(1)
Next I
End With
End With
End Sub
Reference to add to the library:
Microsoft XML, v6.0
Microsoft HTML Object Library
You can use css selectors including the OR operator ",". Also, use XMLHTTPRequest (XHR) to retrieve the data far more quickly than by opening IE.
The links are in one of two classes:
The top level have class:
mainlevel_foremenu
The nested have class:
sublevel_foremenu
You can use a CSS selector combination to get all elements with either of these two classes, combining them with the OR operator.
a.mainlevel_foremenu,a.sublevel_foremenu
The a means a tag (element selector) and the . is a class selector. So elements with an a tag that have class name mainlevel_foremenu, or (",") sublevel_foremenu. The a is not actually required in this instance so you could shorten to .mainlevel_foremenu, .sublevel_foremenu.
The CSS selector returns more than one item so .querySelectorAll method of document is used to return a nodeList. The length of the nodeList is then iterated to access individual links by index.
The links are relative so Replace function is used to remove the "about:" and the base path is prefixed to the link address.
If we look at the first switch on the page:
The combined CSS selector results show we are also getting the nested level (note I have cut out some of the intermediary results).
VBA:
Option Explicit
Public Sub GetInfo()
Application.Screenupdating = False
Dim sResponse As String, i As Long, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.forebet.com/", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim links As Object
With html
.body.innerHTML = sResponse
Set links = .querySelectorAll("a.mainlevel_foremenu,a.sublevel_foremenu")
End With
For i = 0 To links.Length - 1
Cells(i + 1, 1) = "https://www.forebet.com" & Replace$(links(i).href, "about:", vbNullString)
Next i
Application.Screenupdating = True
End Sub
Sample results:
References (VBE > Tools > References):
Microsoft HTML Object Library
I have a vba module for extracting all the links in a page. I would however like to ignore all the links in certain tags such as <header> and <footer> (and all their child tags). Can anyone tell me how can this be done?
Sub Fetch_click()
Dim LinkArr As Variant
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate Cells(1, 1).Text
While IE.Busy
DoEvents
Wend
Dim i As Integer
i = 3
Set LinkArr = IE.Document.getElementsByTagName("a")
For Each LinkObj In LinkArr
Cells(i, 1).Value = LinkObj.href
i = i + 1
Next
End Sub
Thank you
I would prefer to use objects from the Microsoft HTML Object Library and the Microsoft Internet Controls library (add references to both!), e.g.
Sub StartTest()
Dim Browser As SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
' start browser
Set Browser = New SHDocVw.InternetExplorer
Browser.Visible = True
Browser.navigate "www.dauda.at"
Set HTMLDoc = Browser.document
Dim ECol As MSHTML.IHTMLElementCollection
Dim IFld As MSHTML.IHTMLElement
' search all <a> tags
Set ECol = HTMLDoc.getElementsByTagName("a")
For Each IFld In ECol
' etc ...
Next IFld
' clean up
Set IFld = Nothing
Set ECol = Nothing
Set HTMLDoc = Nothing
Browser.Quit
Set Browser = Nothing
End Sub
Checking where your <a> tag is sitting, can be as easy as inspecting the IFld.ParentNode.nodeName to get the tag of the enclosing parent.
If it is unclear how deeply nested your <a> is, you can make use of a recursive function examing the next higher parent all the way up to the document root ("#document") or the contained "HTML", e.g.
Function BadParentRec(TestFld As MSHTML.IHTMLElement) As Boolean
Dim MyTag As String, MyTempResult As Boolean
BadParentRec = False
MyTag = TestFld.ParentNode.nodeName
' Debug.Print MyTag
If MyTag = "#document" Then
MyTempResult = False ' lowest level is good
ElseIf MyTag = "XXX" Then ' your own criteria for bad tags go here
MyTempResult = True ' send "bad" back up the recursion chain
Else
MyTempResult = BadParentRec(TestFld.parentElement) ' next level down
End If
BadParentRec = MyTempResult
End Function
... so inside the For Each loop you would say
If Not BadParentRec(IFld) Then
Debug.Print Ifld.href ' check here for href = ""
End If