Web scraping using VBA

Web scraping using VBA - html

I would like to extract data from this URL.
I want to extract Title, mobile contact number and address from each of 10 business cards.
Here is some code I tried but did not get success.
Public Sub GetValueFromBrowser()
On Error Resume Next
Dim Sn As Integer
Dim ie As Object
Dim url As String
Dim Doc As HTMLDocument
Dim element As IHTMLElement
Dim elements As IHTMLElementCollection
For Sn = 1 To 1
url = Sheets("Infos").Range("C" & Sn).Value
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = 0
.navigate url
While .Busy Or .readyState <> 4
DoEvents
Wend
End With
Set Doc = ie.document
Set elements = Doc.getElementsByClassName(" col-sm-5 col-xs-8 store-details sp-detail paddingR0")
Dim count As Long
Dim erow As Long
count = 0
For Each element In elements
If element.className = "lng_cont_name" Then
erow = Sheet1.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1) = Doc.getElementsByClassName("Store-Name")(count).innerText
Cells(erow, 2) = Doc.getElementsByClassName("cont_fl_addr")(count).innerText
count = count + 1
End If
Next element
If Val(Left(Sn, 2)) = 99 Then
ActiveWorkbook.Save
End If
Next Sn
End Sub

The tel numbers were not easy as I think they have purposefully been made difficult to scrape. I have found a way to decipher the values from the CSS pseudo ::before element content. The addresses and titles were straightforward CSS selections.
I have since written a cleaner script in python here.
So, how do the various parts of the code work?
titles:
Set titles = .querySelectorAll(".jcn [title]")
I target the titles as elements that have a title attribute with a parent jcn class attribute. The "." indicates a class selector, the "[]" an attribute selector, and the " " in between is a descendant combinator.
querySelectorAll method of document returns a nodeList of all matching elements on the page i.e. the 10 titles.
addresses:
Set addresses = .querySelectorAll(".desk-add.jaddt")
Addresses are targeted by their class attribute desk-add jaddt. As compound class names are not allowed, an additional "." has to be replace the white space in the name.
Telephone numbers (via deciphering contents within storesTextToDecipher) :
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
This is where the magic happens. The numbers are not available via the DOM directly as they are pseudo element content.
If you inspect the relevant HTML you will find a series of pseudo ::before elements. VBA exposes no mechanism for applying pseudo selectors to try and get at this info in the CSS for the page.
What you see is in fact a series of span elements that each have a class attribute beginning with mobilesv. These elements sit within a single parent element of class col-sm-5 col-xs-8 store-details sp-detail paddingR0 (note again the compound class name).
I initially gather a nodeList of all the parent elements.
Sample of returned elements:
Each of these parent elements houses the class name (beginning with mobilesv) elements that constitute the characters of the telephone number string. Some characters are numbers in the string, others represent the +()-. Note: The 2|3 letter strings, in the class names, after icon- e.g. dc, fe.
For example, the first search result on the page, for the initial number 9 in the telephone number:
The actual CSS content for this pseudo element /telephone character can be observed in the CSS styling:
Note the class name and before pseudo element selector: .icon-ji:before
And the content is \9d010.
Long story short.... you can extract the 2 or 3 letters after icon- i.e. ji in this case, and the number string after \9d0, i.e. 10 in this case, and use these two bits of info to decipher the telephone number. This info is available in the response:
See the same 2/3 letter strings that are associated with the class names of the telephone string on the left, and the content instructions on the right. A little maths deduces that the number on the right is 1 greater than the telephone number, for that class, shown in the image of the webpage. I simply create a dictionary that then maps the 2/3 letter abbreviation to the telephone number by parsing this section of the html.
When looping over storesTextToDecipher, I use this dictionary to decipher the actual telephone number from the matching 2/3 letter abbreviation in the class name.
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
html.body.innerHTML = sResponse
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
With html
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
End With
For i = 0 To titles.Length - 1
Debug.Print "title: " & titles.item(i).innerText
Debug.Print "address: " & addresses.item(i).innerText
Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Sample output:
Edit: All page results
As you now want more than 10 the following uses the expected page result count (NUMBER_RESULTS_ON_PAGE) to gather the information from the page. It scrolls the page until the expected number of telephone numbers (which should be unique) are found, or the MAX_WAIT_SEC is hit. This means you avoid an infinite loop and can set your expected result count if you expect a different number. It does rely on each store having a telephone number listed - this seems a fairly reasonable assumption.
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
Const NUMBER_RESULTS_ON_PAGE As Long = 80
Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
Application.ScreenUpdating = True
Set resultCountDict = CreateObject("Scripting.Dictionary")
Set cipherDict = GetCipherDict(URL)
With IE
.Visible = True
.Navigate2 URL
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
t = Timer
Do
DoEvents
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
Dim telNumber As String, i As Long
For i = 0 To titles.Length - 1
telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
If Not resultCountDict.Exists(telNumber) Then
resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
End If
Next
.parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
While IE.Busy Or IE.readyState < 4: DoEvents: Wend
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE
End With
.Quit
End With
Dim key As Variant, rowCounter As Long
rowCounter = 1
With ThisWorkbook.Worksheets("Sheet1")
For Each key In resultCountDict.keys
.Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
rowCounter = rowCounter + 1
Next
End With
Application.ScreenUpdating = True
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Public Function GetCipherDict(ByVal URL As String) As Object
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
Set GetCipherDict = cipherDict
End Function
EDIT:
Version for where more than one number is present at top (Please note that if you make too many requests or too quickly server will serve you random pages):
Option Explicit
Public Sub GetDetails()
Dim re As Object, decodeDict As Object, i As Long
Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
Set decodeDict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
Set html = New MSHTML.htmlDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
.setRequestHeader "User-Agent", "Mozilla/4.0"
.send
responseText = .responseText
html.body.innerHTML = responseText
End With
keys = GetMatches(re, responseText, "-(\w+):before")
If UBound(keys) = 0 Then Exit Sub
values = GetMatches(re, responseText, "9d0(\d+)", True)
For i = LBound(values) To UBound(values)
decodeDict(keys(i)) = values(i)
Next
Dim itemsToDecode()
decodeDict(keys(UBound(keys))) = "+"
itemsToDecode = GetValuesToDecode(html)
PrintNumbers re, html, itemsToDecode, decodeDict
End Sub
Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .Test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
If numeric Then
arrMatches(i) = iMatch.SubMatches.item(0) - 1
Else
If spanSearch Then
arrMatches(i) = iMatch
Else
arrMatches(i) = iMatch.SubMatches.item(0)
End If
End If
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
Dim i As Long, elements As Object, results(), Class As String
Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
ReDim results(elements.Length - 1)
For i = 0 To elements.Length - 1
Class = elements.item(i).className
results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
Next
GetValuesToDecode = results
End Function
Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
Dim output As String, i As Long
For i = LBound(itemsToDecode) To UBound(itemsToDecode)
output = output & decodeDict(itemsToDecode(i))
Next
Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
htmlToSearch = html.querySelector(".telCntct").outerHTML
groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
startPos = 1
Dim totalNumbers As Long
For i = LBound(groups) To UBound(groups)
If InStr(groups(i), ",") > 0 Then
totalNumbers = totalNumbers + 1
Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
startPos = i + 1
End If
Next
If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub

Related

Webscrape a specific part of a webpage

My webscrape stopped working. The owner changed the html.
I believe it is the Set allElements = doc.getElementsByClassName("el-col el-col-8") line that needs changing.
I am trying to grab text from the webpage that includes the "52-week Range (undefined)" section. I managed to grab text from before and after but not the section I need. An example webpage is https://www.gurufocus.com/stock/gliba/summary and my code should fill my cell with "38.72 - 73.63" after I do some trimming.
I need to do it this way so I can get my head round it and change it in the future when necessary so please just focus on correcting my set line of code (assuming that is the problem!) rather than a whole new more sophisticated method as it will be beyond me. (My other set line of code does what I want it to do.)
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim incomeStmtURLs As Variant
Dim sURL As String
Dim lastrow As Long
Dim allRowOfData As Object
Dim i As Integer
Dim allElements As IHTMLElementCollection
Dim anElement As IHTMLElement
Dim aCell As HTMLTableCell
Application.DisplayAlerts = False
Call ToggleEvents(False)
incomeStmtURLs = Range("Sheet1!h1:h2").Value
For i = 1 To UBound(incomeStmtURLs)
Set wb = CreateObject("internetExplorer.Application")
sURL = incomeStmtURLs(i, 1)
wb.navigate sURL
wb.Visible = False
While wb.Busy
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
Set doc = wb.document
On Error GoTo err_clear
Set allElements = doc.getElementsByClassName("el-col el-col-8")
While allElements.Length = 0
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
x = allElements(0).innerText
' Debug.Print x
Sheet6.Cells(i + 1, 2).Value = Trim(Replace(Mid(x, InStr(1, x, "52-Week Range (undefined)") + 25, 25), vbLf, ""))
Set allElements = doc.getElementsByClassName("fs-x-large fc-primary fw-bolder")
x = allElements(0).innerText
Sheet6.Cells(i + 1, 4).Value = Trim(Replace(Mid(x, InStr(1, x, "$") + 1, 7), vbLf, ""))
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Next i
Call ToggleEvents(True)
End Sub
Sub ToggleEvents(blnState As Boolean)
Application.DisplayAlerts = blnState
Application.EnableEvents = blnState
If blnState Then Application.CutCopyMode = False
If blnState Then Application.StatusBar = False
End Sub

The page dynamically updates content as you scroll down. You likely need to scroll that part of the page into view then use grab all the elements with classname statictics-item then take the n-2 index e.g. Without the scrolling part:
Set elems = ie.document.getElementsByClassName("statictics-item")
If elems.length > 1 Then Debug.print elems(elems.length-2).innerText
For future readers (I know OP doesn't want this):
I would avoid the whole scrolling pickle, dynamic html and browser and issue an xmlhttp request and regex out the appropriate values from the javscript objects the web page uses for updating. N.B. I would probably add in validation on regex match positions.
Public Sub test()
Dim r As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.gurufocus.com/stock/gliba/summary", False
.send
r = GetMatches(.responseText, "price52wlow:(.*?),|price52whigh:(.*?),")
If r <> "NA" Then MsgBox r
End With
End Sub
Public Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As String
Dim matches As Object
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
'If .test(inputString) Then
Set matches = .Execute(inputString)
If matches.Count = 2 Then
GetMatches = matches.Item(0).submatches(0) & "-" & matches.Item(1).submatches(1)
Else
GetMatches = "NA"
End If
End With
End Function
Regex:

How to Isolate multiple innertext entries when using get elementbyID

I'm trying to isolate 2 different innerText strings from a webpage, but cannot single them out. The innerText for all the tags comes as a whole.
The date and season number are the issue.
I am using getElementById and this gives me a single element. The div with id "next_episode" has what looks like 2 different entries for inner text that I'm interested in. When I loop through the inner text of its children, these 2 entries are skipped. I can't figure out how to isolate the 2 different innerText entries of just the "next_episode" tag. I'm isolating the text I require by using the index number in the arrays my code returns.
Dim IE_00 As SHDocVw.InternetExplorer
Dim HTMLDoc_00 As MSHTML.HTMLDocument
Set IE_00 = New SHDocVw.InternetExplorer
IE_00.Visible = True
IE_00.navigate "https://next-episode.net/final-space"
Do While IE_00.readyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc_00 = IE_00.document
Dim NETC_05 As MSHTML.IHTMLElementCollection
Dim NET_05 As MSHTML.IHTMLElement
'Can loop through the inner text of the children one by one and find what
I need
Set NETC_05 = HTMLDoc_00.getElementById("next_episode").Children
For Each NET_05 In NETC_05
Debug.Print NET_05.innerText
Next NET_05
'This just gives a big block of text that includes the missing inner text
I need
Set NET_05 = HTMLDoc_00.getElementById("next_episode")
Debug.Print NET_05.innerText

Data (for the most part) is in NextSiblings:
The Node.nextSibling read-only property returns the node immediately
following the specified one in their parent's childNodes, or returns
null if the specified node is the last child in the parent element.
*1
You could write a function, like GetNextSiblings, which checks the current node for specific search strings and then extracts required values from NextSibling. I have re-ordered output columns to make for less code, but you could easily loop an alternative headers array, and use that ordering to access from dict info to write out values in a different order. I determine order of output by order of entry of keys in the dict. I loop the headers array to populate the dict keys and later update the dict with the values scraped.
The overhead of a browser is not needed as the required content is not dynamically loaded. A simple, and much faster, xhr request will suffice.
Side-note:
I would recommend, for this type of page, to use Python 3 and BeautifulSoup (bs4 4.7.1+) as this gives you access to pseudo selector :contains. The code could then be much more concise and the program faster. I show this at the end.
VBA:
Option Explicit
Public Sub GetShowInfo()
Dim html As MSHTML.HTMLDocument, headers(), i As Long, aCollection As Object, info As Object
headers = Array("Name:", "Countdown:", "Date:", "Season:", "Episode:", "Status:")
Set html = New HTMLDocument
With CreateObject("Msxml2.xmlhttp")
.Open "GET", "https://next-episode.net/final-space", False
.send
html.body.innerHTML = .responseText
End With
Set info = CreateObject("Scripting.Dictionary")
For i = LBound(headers) To UBound(headers)
info(headers(i)) = vbNullString
Next
info("Name:") = html.querySelector("#next_episode .sub_main").innerText
info("Countdown:") = html.querySelector("#next_episode span").innerText
Set aCollection = html.getElementById("middle_section").getElementsByTagName("div")
Set info = GetNextSiblings(aCollection, headers, info)
Set aCollection = html.getElementById("next_episode").getElementsByTagName("div")
Set info = GetNextSiblings(aCollection, headers, info)
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 1).Resize(1, info.Count) = info.keys
.Cells(2, 1).Resize(1, info.Count) = info.items
End With
End Sub
Public Function GetNextSiblings(ByVal aCollection As Object, ByRef headers(), ByVal info As Object) As Object
Dim item As Object, i As Long
For Each item In aCollection
For i = 2 To UBound(headers)
If InStr(item.outerHTML, headers(i)) > 0 Then
If headers(i) = "Episode:" Then
info(headers(i)) = item.NextSibling.innerText
Else
info(headers(i)) = item.NextSibling.NodeValue
End If
Exit For
End If
Next
Next
Set GetNextSiblings = info
End Function
Reading:
NextSibling
CSS selectors
querySelector
Python (with bs4 4.7.1+):
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://next-episode.net/final-space')
soup = bs(r.content, 'lxml')
current_nodes = ['Status:','Name:', 'Countdown:','Date:','Season:','Episode:']
for node in current_nodes:
selector = f'#middle_section div:contains("{node}"), #next_episode div:contains("{node}")'
if node in ['Episode:','Name:']:
print(node, soup.select_one(selector).text.replace(node,''))
elif node == 'Countdown:':
print(node, soup.select_one(selector).next_sibling.text)
else:
print(node, soup.select_one(selector).next_sibling)

'Setting XML 05 as an Object
Dim XML_05 As New MSXML2.XMLHTTP60
'Setting HTML Document 05 as an Object
Dim HTML_05 As New MSHTML.HTMLDocument
XML_05.Open "GET", Cells(Row, NextEpisodeURL).Value, False
XML_05.send
HTML_05.body.innerHTML = XML_05.responseText
'Setting Net Element Tag Collection 05 as an Object
Dim NETC_05 As MSHTML.IHTMLElementCollection
'Setting Net Element Tag 05 as an Object
Dim NET_05 As MSHTML.IHTMLElement
'Setting Reg EX 05 as an Object
Dim REO_05 As VBScript_RegExp_55.RegExp
'Setting Match Object 05 as Object
Dim MO_05 As Object
'Setting Season array as Array
Dim SN_05() As String
'Setting Episode Name 05 as Array
Dim ENA_05() As String
'Setting Episode Number 05 as Array
Dim EN_05() As String
'Getting Episode Name Episode Number and Season Number From Net
'Set NETC_05 = HTML_05.getElementsByClassName("sub_main")
Set NET_05 = HTML_05.getElementById("previous_episode")
Set REO_05 = New VBScript_RegExp_55.RegExp
REO_05.Global = True
REO_05.IgnoreCase = True
'Getting Episode Name
REO_05.Pattern = "(Name:(.*))"
Set MO_05 = REO_05.Execute(NET_05.innerText)
Debug.Print MO_05.Count
Debug.Print MO_05(0).Value
ENA_05 = Split(MO_05(0), ":")
Debug.Print ENA_05(1)
Cells(Row, NextEpName).Value = ENA_05(1)
'Getting Episode Number
REO_05.Pattern = "(Episode:([0-9]*))"
Set MO_05 = REO_05.Execute(NET_05.innerText)
Debug.Print MO_05.Count
Debug.Print MO_05(0).Value
EN_05 = Split(MO_05(0), ":")
Debug.Print EN_05(1)
Cells(Row, EpisodeNet).Value = EN_05(1)
'Getting Season Number
REO_05.Pattern = "(Season:([0-9]*))"
Set MO_05 = REO_05.Execute(NET_05.innerText)
Debug.Print MO_05.Count
Debug.Print MO_05(0).Value
SN_05 = Split(MO_05(0), ":")
Debug.Print SN_05(1)
Cells(Row, SeasonNet).Value = SN_05(1)
'Getting Countdown From Net
Set NETC_05 = HTML_05.getElementById("next_episode").Children
Cells(Row, Countdown).Value = NETC_05(5).innerText
Debug.Print NETC_05(5).innerText
Public Sub GetShowInfo()
Dim html As MSHTML.HTMLDocument, headers(), i As Long, aCollection As Object, info As Object
Set html = New HTMLDocument
With CreateObject("Msxml2.xmlhttp")
.Open "GET", "https://next-episode.net/chicago-fire", False
.send
html.body.innerHTML = .responseText
End With
Set aCollection = html.getElementById("next_episode").getElementsByTagName("div")
Set aCollection = html.getElementById("next_episode").getElementsByClassName("subheadline")
On Error Resume Next
For Each ele In aCollection
Debug.Print ele.innerText
Debug.Print ele.outerText
Debug.Print ele.nextElementSibling.innerText
Debug.Print ele.nextElementSibling.innerText
Next ele
End Sub

Retrieving all Excel file links from a webpage

I'm trying to get all the downloadable Excel file links from the website, but having difficulty. Please help to guide me. Thanks.
Sub TYEX()
Dim internet As Object
Dim internetdata As Object
Dim div_result As Object
Dim header_links As Object
Dim link As Object
Dim URL As String
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = True
URL = "https://www.jpx.co.jp/markets/public/short-selling/index.html"
internet.Navigate URL
Do Until internet.ReadyState >= 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 5)
Set internetdata = internet.Document
Set div_result = internetdata.getElementById("readArea")
Set header_links = div_result.getElementsByTagName("td")
For Each h In header_links
Set link = h.ChildNodes.item(0)
Cells(Range("A" & Rows.Count).End(xlUp).Row + 1, 1) = link.href
Next
MsgBox "done"
End Sub

You had the idea down correctly, but here's a different approach:
Sub TYEX()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate "https://www.jpx.co.jp/markets/public/short-selling/index.html"
.Visible = True
Do While .Busy Or .readyState < 4
DoEvents
Loop
Dim doc As Object, tbl As Object
Set doc = .document
Set tbl = doc.getElementsByClassName("component-normal-table")(0).Children(0)
Dim r As Long, xlsArr(), a As Object
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
End With
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With
End Sub
Public Function NextRow(ByVal ws As Worksheet, Optional ByVal col As Variant = 1) As Long
With ws
NextRow = .Cells(.Rows.Count, col).End(xlUp).Row + 1
End With
End Function
Breaking Down the Code
This will loop your html table rows. We start at 1, because 0 is actually just the table header.
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
This uses regular expressions to extract the url from the innerHTML property. You can see how this particular regex works here: Regex101
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
You will size your range to be the same size as your array that contains the links, then write the array to the worksheet. This is usually much faster than writing cells one-by-one.
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With

You can use an attribute = value CSS selector with $ operator to say the href value must end with .xls. Then use querySelectorAll to retrieve all matched results. Using CSS selectors is a very fast and generally robust method.
Dim list As Object
Set list = ie.document.querySelectorAll("[href$='.xls']")
It is much faster to use XMLHTTP as well, rather than opening IE. Note that you can then pass these links to a function to perform a binary download or to URLMon for downloading.
Option Explicit
Public Sub Links()
Dim sResponse As String, html As HTMLDocument, list As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.jpx.co.jp/markets/public/short-selling/index.html", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set list = html.querySelectorAll("[href$='.xls']")
End With
For i = 0 To list.Length - 1
Debug.Print Replace$(list.item(i), "about:", "https://www.jpx.co.jp")
Next
End Sub
Example download function (though you could re-use your existing XMLHTTP object - this is just to illustrate):
Public Function DownloadFile(ByVal downloadFolder As String, ByVal downloadURL As String) As String
Dim http As Object , tempArr As Variant
Set http = CreateObject("WinHttp.WinHttpRequest.5.1")
http.Open "GET", downloadURL, False
http.send
On Error GoTo errhand
With CreateObject("ADODB.Stream")
.Open
.Type = 1
.write http.responseBody
tempArr = Split(downloadURL, "/")
tempArr = tempArr(UBound(tempArr))
.SaveToFile downloadFolder & tempArr, 2 '< "/" on enter of downloadFolder. 2 for overwrite which is Ok if no file modifications.
.Close
End With
DownloadFile = downloadFolder & tempArr
Exit Function
errhand:
If Err.Number <> 0 Then
Debug.Print Err.Number, Err.Description
MsgBox "Download failed"
End If
DownloadFile = vbNullString
End Function
References (VBE > Tools > References):
Microsoft HTML Object Library

How do i extract specific data (name, details )from the website into excel with excel vba?

How do i extract specific data (name, details )from the website into excel with excel vba?
Below I am trying to get processor and warranty:
Option Explicit
Sub GetData()
Dim objIE As InternetExplorer
Dim itemELE As Object
Dim html As IHTMLDocument
Dim Processor As String
Dim warranty As String
Dim y As Integer
'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = True
'navigate to page with needed data
objIE.navigate "https://www.harveynorman.com.sg/computers-tablets-and-gaming/computers/laptops/"
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each itemELE In objIE.document.getElementsByClassName("expandabaleContent")
Processor = itemELE.getElementsByTagName("d1")(0).innerText
warranty = itemELE.getElementsByClassName("d1")(0).getElementsByTagName("a")(0).textContent
Sheets("Sheet1").Range("A" & y).Value = Processor
Sheets("Sheet1").Range("B" & y).Value = warranty
y = y + 1
Next
End Sub
Screenshot of the page:

For the page shown (in your image) you can issue an XMLHTTP (XHR) GET request to grab the product info without opening a slow IE browser instance.
For the specific information:
Processor and warranty info:
If you inspect the page the info about processor and warranty appears associated with a classname facetedResults-feature-list
You can see the classname and then a dl tag housing a dt tag which has sibling dd tags. Two of these sibling dd tags are associated with the info for processor and warranty.
I use a CSS selector to grab all these dd tags which can be simplified,in this instance, to ignore the sibling dt and parent dl tags and use just:
.facetedResults-feature-list dd
The "." is a class selector. The CSS combination selection above says get the dd tags within elements with class facetedResults-feature-list
Product titles info:
The titles I get using another CSS selector of:
.facetedResults-title
This is elements with class facetedResults-title. This contains the product title.
Writing out product titles, processor and warranty info to the sheet:
A little maths shows me that the processor info repeats every 14, and that if I add 8 to the index for the processor I get the warranty info. You can see how you could write out each of the details as they occur at indices that repeat every 14. I combine the loop over the nodeList of dd elements with the titles to write out to the sheet.
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument
Application.ScreenUpdating = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim titles As Object, targetedInfo As Object, rowCounter As Long
With html
.body.innerHTML = sResponse
Set titles = .querySelectorAll(".facetedResults-title")
Set targetedInfo = .querySelectorAll(".facetedResults-feature-list dd")
End With
With Worksheets("Sheet1")
For i = 0 To targetedInfo.Length - 1
If i Mod 14 = 0 Then
rowCounter = rowCounter + 1
.Cells(rowCounter, 1) = titles(rowCounter - 1).innerText
.Cells(rowCounter, 2) = targetedInfo(i).innerText
.Cells(rowCounter, 3) = targetedInfo(i + 8).innerText
End If
Next i
End With
Application.ScreenUpdating = True
End Sub
Output sample:
More general info:
CSS selectors:
The product info is associated with an 'expandableContent facetedResults-expandableContent-features expandableContent-is-collapsed facetedResults-expandableContent-69' class name
The prices are associated with an 'expandableContent facetedResults-expandableContent-price expandableContent-is-collapsed' class name.
You can select these by the traditional .getElementsByClassName and then loop over the collection, or, in my case, use a CSS selector for class to do the same thing, and then traverse the length of the returned nodeList.
.getElementsByClassName("expandableContent facetedResults-expandableContent-features expandableContent-is-collapsed facetedResults-expandableContent-69")
is the same as
.querySelectorAll(".expandableContent.facetedResults-expandableContent-features.expandableContent-is-collapsed.facetedResults-expandableContent-69")
The "." is the class selector.
Titles are associated with a class facetedResults-title
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, i As Long, html As New HTMLDocument
Application.ScreenUpdating = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Dim info As Object, prices As Object, titles As Object
With html
.body.innerHTML = sResponse
Set titles = .querySelectorAll(".facetedResults-title")
Set info = .querySelectorAll(".expandableContent.facetedResults-expandableContent-features.expandableContent-is-collapsed.facetedResults-expandableContent-69")
Set prices = .querySelectorAll(".expandableContent.facetedResults-expandableContent-price.expandableContent-is-collapsed")
End With
With Worksheets("Sheet1")
For i = 0 To titles.Length - 1
.Cells(i + 1, 1) = titles(i).innerText
.Cells(i + 1, 2) = info(i).innerText
.Cells(i + 1, 3) = prices(i).innerText
Next i
End With
Application.ScreenUpdating = True
End Sub
References required (VBE>Tools>References):
Microsoft HTML Object Library

Qharr has already provided some good options but in case still you want to try IE then see below code
Option Explicit
Sub GetData()
Dim objIE As InternetExplorer
Dim itemELE As Object
Dim html As IHTMLDocument
Dim Processor As String
Dim warranty As String
Dim y As Integer
'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = True
'navigate to page with needed data
objIE.navigate "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS"
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Application.Wait Now + TimeSerial(0, 0, 3)
y = 1
For Each itemELE In objIE.document.getElementsByClassName("facetedResults-feature-list")
If InStr(1, itemELE.className, "bundleList", vbTextCompare) = 0 Then
Application.Wait Now + TimeSerial(0, 0, 2)
Processor = itemELE.getElementsByTagName("dl")(0).innerText
warranty = itemELE.getElementsByTagName("dl")(4).innerText
Sheets("Sheet1").Range("A" & y).Value = Processor
Sheets("Sheet1").Range("B" & y).Value = warranty
y = y + 1
End If
Next
End Sub
Results

Fetching data from web page

I am trying to fetch the publication date corresponding to each patent number.
Here is the Excel sheet:
The database is espacenet.com
Here's the link for the first patent you see in the Excel sheet:
http://worldwide.espacenet.com/searchResults?compact=false&PN=US7055777B2&ST=advanced&locale=en_EP&DB=EPODOC
Under the "Publication Info" header, I need to get the date after matching the patent number with the one in the Excel sheet.
Here's the code:
Sub tryextraction()
Dim ie As New InternetExplorer
Dim sdd As String
Dim tdd() As String
Dim num0 As Integer
Dim num1 As Integer
Dim doc As HTMLDocument
Dim i As Integer
Dim j As Integer
ie.Visible = True
num1 = ActiveSheet.UsedRange.Rows.Count
For num0 = 2 To num1
ie.navigate "http://worldwide.espacenet.com/searchResults?compact=false&PN=" & Range("A" & num0) & "&ST=advanced&locale=en_EP&DB=EPODOC"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.document
sdd = Trim(doc.getElementsByTagName("td")(5).innerText)
tdd() = Split(sdd, vbLf)
j = UBound(tdd)
For i = 0 To j
If InStr(tdd(i), "(") <> 0 Then
tdd(i) = Replace(tdd(i), " ", "")
tdd(i) = Replace(tdd(i), "(", "")
tdd(i) = Replace(tdd(i), ")", "")
If tdd(i) = Range("A" & num0).Value Then
Range("B" & num0).Value = tdd(i + 1)
End If
End If
Next i
Next num0
ie.Quit
End Sub
The code is not giving any error. The column "Publication Date" remains blank after the code finishes running.
The html tag which contains the publication info has been taken correctly.

There are some trailing white space characters after the ID you are searching for in the document so tdd(i) = Range("A" & num0).Value never evaluates to true. It's not just a space, so a simple Trim(tdd(i)) = Range("A" & num0).Value call does not help. Try instead InStr(tdd(i), Range("A" & num0).Value) If that is not good enough, you'll have to specifically remove CRLF from the end of the string before doing the compare.

There are often multiple publication dates under the publication info header.
Example:
The following script obtains all of these and the preceeding line (so you have the associated publication along with date).
It loops from row 2 of the Activesheet, to the last populated row, picking up the Publication Numbers from column A and writing out the results starting from column B. Depending on how many dates there are, the data will extend across multiple columns from B.
Regex:
A regex of ^(.*)\s\d{4}-\d{2}-\d{2} is used to retrieve the date pattern and the preceeding line i.e. The publication identifier and the date. Try it
Example output:
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, html As New HTMLDocument, url As String, pubInfo As Object
Dim loopRange As Range, iRow As Range, counter As Long
'example US7055777B2
Application.ScreenUpdating = False
With ActiveSheet
Set loopRange = Range("A2:A" & .Cells(.Rows.Count, "A").End(xlUp).Row)
End With
With IE
.Visible = True
counter = 2 '<== start from row 2
For Each iRow In loopRange
If Not IsEmpty(iRow) Then
url = "https://worldwide.espacenet.com/searchResults?compact=false&PN=" & iRow.Value & "&ST=advanced&locale=en_EP&DB=EPODOC"
.navigate url
While .Busy Or .readyState < 4: DoEvents: Wend
Set html = .document
Do
DoEvents
On Error Resume Next
Set pubInfo = html.querySelector(".publicationInfoColumn")
On Error GoTo 0
Loop While pubInfo Is Nothing
Dim tempArr()
tempArr = GetDateAndPatent(pubInfo.innerText, "^(.*)\s\d{4}-\d{2}-\d{2}") '"(?m)^(.*)\s\d{4}-\d{2}-\d{2}" '<==This is not supported
With ActiveSheet
.Cells(counter, 2).Resize(1, UBound(tempArr) + 1) = tempArr
End With
End If
counter = counter + 1
Next iRow
.Quit '<== Remember to quit application
End With
Application.ScreenUpdating = True
End Sub
Public Function GetDateAndPatent(ByVal inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
For Each iMatch In matches
ReDim Preserve arrMatches(i)
arrMatches(i) = iMatch.Value
i = i + 1
Next iMatch
End If
End With
GetDateAndPatent = arrMatches
End Function

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Web scraping using VBA - html

Related

Webscrape a specific part of a webpage

How to Isolate multiple innertext entries when using get elementbyID

Retrieving all Excel file links from a webpage

How do i extract specific data (name, details )from the website into excel with excel vba?

Fetching data from web page

Categories

Resources