Fetching data from web page

Fetching data from web page - html

I am trying to fetch the publication date corresponding to each patent number.
Here is the Excel sheet:
The database is espacenet.com
Here's the link for the first patent you see in the Excel sheet:
http://worldwide.espacenet.com/searchResults?compact=false&PN=US7055777B2&ST=advanced&locale=en_EP&DB=EPODOC
Under the "Publication Info" header, I need to get the date after matching the patent number with the one in the Excel sheet.
Here's the code:
Sub tryextraction()
Dim ie As New InternetExplorer
Dim sdd As String
Dim tdd() As String
Dim num0 As Integer
Dim num1 As Integer
Dim doc As HTMLDocument
Dim i As Integer
Dim j As Integer
ie.Visible = True
num1 = ActiveSheet.UsedRange.Rows.Count
For num0 = 2 To num1
ie.navigate "http://worldwide.espacenet.com/searchResults?compact=false&PN=" & Range("A" & num0) & "&ST=advanced&locale=en_EP&DB=EPODOC"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.document
sdd = Trim(doc.getElementsByTagName("td")(5).innerText)
tdd() = Split(sdd, vbLf)
j = UBound(tdd)
For i = 0 To j
If InStr(tdd(i), "(") <> 0 Then
tdd(i) = Replace(tdd(i), " ", "")
tdd(i) = Replace(tdd(i), "(", "")
tdd(i) = Replace(tdd(i), ")", "")
If tdd(i) = Range("A" & num0).Value Then
Range("B" & num0).Value = tdd(i + 1)
End If
End If
Next i
Next num0
ie.Quit
End Sub
The code is not giving any error. The column "Publication Date" remains blank after the code finishes running.
The html tag which contains the publication info has been taken correctly.

There are some trailing white space characters after the ID you are searching for in the document so tdd(i) = Range("A" & num0).Value never evaluates to true. It's not just a space, so a simple Trim(tdd(i)) = Range("A" & num0).Value call does not help. Try instead InStr(tdd(i), Range("A" & num0).Value) If that is not good enough, you'll have to specifically remove CRLF from the end of the string before doing the compare.

There are often multiple publication dates under the publication info header.
Example:
The following script obtains all of these and the preceeding line (so you have the associated publication along with date).
It loops from row 2 of the Activesheet, to the last populated row, picking up the Publication Numbers from column A and writing out the results starting from column B. Depending on how many dates there are, the data will extend across multiple columns from B.
Regex:
A regex of ^(.*)\s\d{4}-\d{2}-\d{2} is used to retrieve the date pattern and the preceeding line i.e. The publication identifier and the date. Try it
Example output:
VBA:
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, html As New HTMLDocument, url As String, pubInfo As Object
Dim loopRange As Range, iRow As Range, counter As Long
'example US7055777B2
Application.ScreenUpdating = False
With ActiveSheet
Set loopRange = Range("A2:A" & .Cells(.Rows.Count, "A").End(xlUp).Row)
End With
With IE
.Visible = True
counter = 2 '<== start from row 2
For Each iRow In loopRange
If Not IsEmpty(iRow) Then
url = "https://worldwide.espacenet.com/searchResults?compact=false&PN=" & iRow.Value & "&ST=advanced&locale=en_EP&DB=EPODOC"
.navigate url
While .Busy Or .readyState < 4: DoEvents: Wend
Set html = .document
Do
DoEvents
On Error Resume Next
Set pubInfo = html.querySelector(".publicationInfoColumn")
On Error GoTo 0
Loop While pubInfo Is Nothing
Dim tempArr()
tempArr = GetDateAndPatent(pubInfo.innerText, "^(.*)\s\d{4}-\d{2}-\d{2}") '"(?m)^(.*)\s\d{4}-\d{2}-\d{2}" '<==This is not supported
With ActiveSheet
.Cells(counter, 2).Resize(1, UBound(tempArr) + 1) = tempArr
End With
End If
counter = counter + 1
Next iRow
.Quit '<== Remember to quit application
End With
Application.ScreenUpdating = True
End Sub
Public Function GetDateAndPatent(ByVal inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
For Each iMatch In matches
ReDim Preserve arrMatches(i)
arrMatches(i) = iMatch.Value
i = i + 1
Next iMatch
End If
End With
GetDateAndPatent = arrMatches
End Function

Related

Getting the text contents from a HTML Table without ID by using VBA

I am trying to parse a HTML table (it has no id but class name) from a website. However, since it has no id, I have difficulty in printing the contents from the table. But I couldn't figure it out.
Edited
Here you can see the image of Excel file. GTIP Numbers are located at column A. My aim is when the VBA code runs, these GTIP numbers from column A are forwarded to the Search Field named "GTİP Ara" in the website https://www.isib.gov.tr/urun-arama/. As a result, the companies having the selected GTIP will be returned to the columns next specified row.
For the third row, GTIP number "841013000000" has forwarded to the Search Field named GTIP Ara and as a result; Company 2, ... Company 9 are returned to neighbor columns.
Sometimes GTIP numbers return nothing since none of the companies are holding specified number.
For example: 841410819000 will return the companies but 841112101000 will return an error "Aradığınız Sonuç Bulunamadı!". That's why I am trying to add an if statement but it doesn't work properly.
Now, because of a mistake somewhere in my code block; the returned values are same for each GTIP, here you can see the result on second image.
Sub GrabLastNames()
Dim objIE As InternetExplorer
Dim ele As Object
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.navigate "https://www.isib.gov.tr/urun-arama"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 2
For i = 2 To 269
objIE.document.getElementById("gtip-ara").Value = _
Sheets("Sheet1").Range("A:A").Cells(i, 1).Value
objIE.document.getElementById("ara").Click
'If objIE.document.getElementsByClassName("error").getElementsByTagName("span").Value <> "Aradığınız Sonuç Bulunamadı!" Then
For Each ele In objIE.document.getElementsByClassName("urun-arama-table table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
Sheets("Sheet1").Cells(i, y).Value = ele.Children(0).textContent
y = y + 1
Next
y = 2
Next i
End Sub

I only had one working GTIP 841410819000, so cant test this more thoroughly.
You don't really need to know the ID, the page has one table, so getting it using getElemenetsByClassName like you did, or just getElementsByTagName like in my example should work fine. My code is probably the same as yours, just with a few pauses to a: not spam the website, and b: give IE a chance to get itself together (its IE after all).
If you can provide a few more working GTIP values I can test a bit more.
' requires reference 'Microft HTML Object Library'
Sub Main()
Dim Browser As New InternetExplorer
Browser.Visible = True
Browser.navigate "https://www.isib.gov.tr/urun-arama"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
Dim TableRows As IHTMLElementCollection
Dim TableRow As IHTMLElement
Dim SourceRow As Integer
Dim ResultColumn As Integer
Application.Wait (Now + TimeValue("0:00:05"))
SourceRow = 2 ' Skip Header
Do
Debug.Print "Trying " & Sheet.Cells(SourceRow, 1).Value
Browser.Document.getElementById("gtip-ara").Value = Sheet.Cells(SourceRow, 1).Value
Browser.Document.getElementById("ara").Click
Application.Wait (Now + TimeValue("0:00:02"))
Do While Browser.Busy
DoEvents
Loop
If Browser.Document.getElementsByTagName("table").Length > 0 Then
Debug.Print " > Found Results"
Set TableRows = Browser.Document.getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
ResultColumn = 2 ' dont overwrite search term
For Each TableRow In TableRows
Sheet.Cells(SourceRow, ResultColumn).Value = TableRow.innerText
ResultColumn = ResultColumn + 1
Next TableRow
Else
Debug.Print " - No Results Found"
End If
If Sheet.Cells(SourceRow + 1, 1).Value = "" Then
Exit Do
Else
SourceRow = SourceRow + 1
End If
Application.Wait (Now + TimeValue("0:00:05"))
Loop
Browser.Quit
Set Browser = Nothing
End Sub
Update
Updated my code again, it doesn't spawn lots of windows anymore and prints only the company name (as your example did).
' requires Microsoft HTML Object Library
' requires Microsoft XML, v6.0
Sub Main()
Dim XHR As XMLHTTP60
Dim Document As HTMLDocument
Dim ResultRows As IHTMLElementCollection
Dim ResultRow As IHTMLElement
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
Dim SheetRow As Integer
Dim SheetColumn As Integer
Dim LastRow As Integer
LastRow = Sheet.Cells(Sheet.Rows.Count, "A").End(xlUp).Row
For SheetRow = 2 To LastRow
Debug.Print "Trying GTIP:" & Sheet.Cells(SheetRow, 1).Value
Application.StatusBar = "Status: " & Right(String(Len(CStr(LastRow - 1)), "0") & CStr(SheetRow - 1), Len(CStr(LastRow - 1))) & "/" & CStr(LastRow - 1)
Set XHR = New XMLHTTP60
XHR.Open "POST", "https://www.isib.gov.tr/urun-arama", False
XHR.setRequestHeader "content-type", "application/x-www-form-urlencoded"
XHR.send "gtipkategori=" & Sheet.Cells(SheetRow, 1).Value
Set Document = New HTMLDocument
Document.body.innerHTML = XHR.responseText
If Document.getElementsByTagName("table").Length > 0 Then
Debug.Print " > Found Results"
SheetColumn = 2 ' First Column to output data into
Set ResultRows = Document.getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
For Each ResultRow In ResultRows
Sheet.Cells(SheetRow, SheetColumn).Value = ResultRow.getElementsByTagName("td")(0).innerText ' 0 - company name
SheetColumn = SheetColumn + 1
Next
Else
Debug.Print " - No Results"
End If
Set XHR = Nothing
Set Document = Nothing
Application.Wait (Now + TimeValue("0:00:01")) ' slow down requests
Next
Application.StatusBar = "Complete"
End Sub

Webscrape a specific part of a webpage

My webscrape stopped working. The owner changed the html.
I believe it is the Set allElements = doc.getElementsByClassName("el-col el-col-8") line that needs changing.
I am trying to grab text from the webpage that includes the "52-week Range (undefined)" section. I managed to grab text from before and after but not the section I need. An example webpage is https://www.gurufocus.com/stock/gliba/summary and my code should fill my cell with "38.72 - 73.63" after I do some trimming.
I need to do it this way so I can get my head round it and change it in the future when necessary so please just focus on correcting my set line of code (assuming that is the problem!) rather than a whole new more sophisticated method as it will be beyond me. (My other set line of code does what I want it to do.)
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim incomeStmtURLs As Variant
Dim sURL As String
Dim lastrow As Long
Dim allRowOfData As Object
Dim i As Integer
Dim allElements As IHTMLElementCollection
Dim anElement As IHTMLElement
Dim aCell As HTMLTableCell
Application.DisplayAlerts = False
Call ToggleEvents(False)
incomeStmtURLs = Range("Sheet1!h1:h2").Value
For i = 1 To UBound(incomeStmtURLs)
Set wb = CreateObject("internetExplorer.Application")
sURL = incomeStmtURLs(i, 1)
wb.navigate sURL
wb.Visible = False
While wb.Busy
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
Set doc = wb.document
On Error GoTo err_clear
Set allElements = doc.getElementsByClassName("el-col el-col-8")
While allElements.Length = 0
Application.Wait Now + #12:00:01 AM#
DoEvents
Wend
x = allElements(0).innerText
' Debug.Print x
Sheet6.Cells(i + 1, 2).Value = Trim(Replace(Mid(x, InStr(1, x, "52-Week Range (undefined)") + 25, 25), vbLf, ""))
Set allElements = doc.getElementsByClassName("fs-x-large fc-primary fw-bolder")
x = allElements(0).innerText
Sheet6.Cells(i + 1, 4).Value = Trim(Replace(Mid(x, InStr(1, x, "$") + 1, 7), vbLf, ""))
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Next i
Call ToggleEvents(True)
End Sub
Sub ToggleEvents(blnState As Boolean)
Application.DisplayAlerts = blnState
Application.EnableEvents = blnState
If blnState Then Application.CutCopyMode = False
If blnState Then Application.StatusBar = False
End Sub

The page dynamically updates content as you scroll down. You likely need to scroll that part of the page into view then use grab all the elements with classname statictics-item then take the n-2 index e.g. Without the scrolling part:
Set elems = ie.document.getElementsByClassName("statictics-item")
If elems.length > 1 Then Debug.print elems(elems.length-2).innerText
For future readers (I know OP doesn't want this):
I would avoid the whole scrolling pickle, dynamic html and browser and issue an xmlhttp request and regex out the appropriate values from the javscript objects the web page uses for updating. N.B. I would probably add in validation on regex match positions.
Public Sub test()
Dim r As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.gurufocus.com/stock/gliba/summary", False
.send
r = GetMatches(.responseText, "price52wlow:(.*?),|price52whigh:(.*?),")
If r <> "NA" Then MsgBox r
End With
End Sub
Public Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As String
Dim matches As Object
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
'If .test(inputString) Then
Set matches = .Execute(inputString)
If matches.Count = 2 Then
GetMatches = matches.Item(0).submatches(0) & "-" & matches.Item(1).submatches(1)
Else
GetMatches = "NA"
End If
End With
End Function
Regex:

Retrieving all Excel file links from a webpage

I'm trying to get all the downloadable Excel file links from the website, but having difficulty. Please help to guide me. Thanks.
Sub TYEX()
Dim internet As Object
Dim internetdata As Object
Dim div_result As Object
Dim header_links As Object
Dim link As Object
Dim URL As String
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = True
URL = "https://www.jpx.co.jp/markets/public/short-selling/index.html"
internet.Navigate URL
Do Until internet.ReadyState >= 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 5)
Set internetdata = internet.Document
Set div_result = internetdata.getElementById("readArea")
Set header_links = div_result.getElementsByTagName("td")
For Each h In header_links
Set link = h.ChildNodes.item(0)
Cells(Range("A" & Rows.Count).End(xlUp).Row + 1, 1) = link.href
Next
MsgBox "done"
End Sub

You had the idea down correctly, but here's a different approach:
Sub TYEX()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate "https://www.jpx.co.jp/markets/public/short-selling/index.html"
.Visible = True
Do While .Busy Or .readyState < 4
DoEvents
Loop
Dim doc As Object, tbl As Object
Set doc = .document
Set tbl = doc.getElementsByClassName("component-normal-table")(0).Children(0)
Dim r As Long, xlsArr(), a As Object
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
End With
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With
End Sub
Public Function NextRow(ByVal ws As Worksheet, Optional ByVal col As Variant = 1) As Long
With ws
NextRow = .Cells(.Rows.Count, col).End(xlUp).Row + 1
End With
End Function
Breaking Down the Code
This will loop your html table rows. We start at 1, because 0 is actually just the table header.
With tbl.Rows
ReDim xlsArr(1 To .Length - 1)
For r = 1 To .Length - 1 ' 0 is the table header
xlsArr(r) = .Item(r).Children(1).innerHTML
Next r
End With
This uses regular expressions to extract the url from the innerHTML property. You can see how this particular regex works here: Regex101
With CreateObject("VBScript.RegExp")
.Pattern = "<a href=""(\/markets.*?\.xls)"
For r = 1 To UBound(xlsArr)
xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
Debug.Print xlsArr(r)
Next
End With
You will size your range to be the same size as your array that contains the links, then write the array to the worksheet. This is usually much faster than writing cells one-by-one.
'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
xlsArr) - 1, 1))
rng.Value = Application.Transpose(xlsArr)
End With

You can use an attribute = value CSS selector with $ operator to say the href value must end with .xls. Then use querySelectorAll to retrieve all matched results. Using CSS selectors is a very fast and generally robust method.
Dim list As Object
Set list = ie.document.querySelectorAll("[href$='.xls']")
It is much faster to use XMLHTTP as well, rather than opening IE. Note that you can then pass these links to a function to perform a binary download or to URLMon for downloading.
Option Explicit
Public Sub Links()
Dim sResponse As String, html As HTMLDocument, list As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.jpx.co.jp/markets/public/short-selling/index.html", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set list = html.querySelectorAll("[href$='.xls']")
End With
For i = 0 To list.Length - 1
Debug.Print Replace$(list.item(i), "about:", "https://www.jpx.co.jp")
Next
End Sub
Example download function (though you could re-use your existing XMLHTTP object - this is just to illustrate):
Public Function DownloadFile(ByVal downloadFolder As String, ByVal downloadURL As String) As String
Dim http As Object , tempArr As Variant
Set http = CreateObject("WinHttp.WinHttpRequest.5.1")
http.Open "GET", downloadURL, False
http.send
On Error GoTo errhand
With CreateObject("ADODB.Stream")
.Open
.Type = 1
.write http.responseBody
tempArr = Split(downloadURL, "/")
tempArr = tempArr(UBound(tempArr))
.SaveToFile downloadFolder & tempArr, 2 '< "/" on enter of downloadFolder. 2 for overwrite which is Ok if no file modifications.
.Close
End With
DownloadFile = downloadFolder & tempArr
Exit Function
errhand:
If Err.Number <> 0 Then
Debug.Print Err.Number, Err.Description
MsgBox "Download failed"
End If
DownloadFile = vbNullString
End Function
References (VBE > Tools > References):
Microsoft HTML Object Library

Web scraping using VBA

I would like to extract data from this URL.
I want to extract Title, mobile contact number and address from each of 10 business cards.
Here is some code I tried but did not get success.
Public Sub GetValueFromBrowser()
On Error Resume Next
Dim Sn As Integer
Dim ie As Object
Dim url As String
Dim Doc As HTMLDocument
Dim element As IHTMLElement
Dim elements As IHTMLElementCollection
For Sn = 1 To 1
url = Sheets("Infos").Range("C" & Sn).Value
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = 0
.navigate url
While .Busy Or .readyState <> 4
DoEvents
Wend
End With
Set Doc = ie.document
Set elements = Doc.getElementsByClassName(" col-sm-5 col-xs-8 store-details sp-detail paddingR0")
Dim count As Long
Dim erow As Long
count = 0
For Each element In elements
If element.className = "lng_cont_name" Then
erow = Sheet1.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
Cells(erow, 1) = Doc.getElementsByClassName("Store-Name")(count).innerText
Cells(erow, 2) = Doc.getElementsByClassName("cont_fl_addr")(count).innerText
count = count + 1
End If
Next element
If Val(Left(Sn, 2)) = 99 Then
ActiveWorkbook.Save
End If
Next Sn
End Sub

The tel numbers were not easy as I think they have purposefully been made difficult to scrape. I have found a way to decipher the values from the CSS pseudo ::before element content. The addresses and titles were straightforward CSS selections.
I have since written a cleaner script in python here.
So, how do the various parts of the code work?
titles:
Set titles = .querySelectorAll(".jcn [title]")
I target the titles as elements that have a title attribute with a parent jcn class attribute. The "." indicates a class selector, the "[]" an attribute selector, and the " " in between is a descendant combinator.
querySelectorAll method of document returns a nodeList of all matching elements on the page i.e. the 10 titles.
addresses:
Set addresses = .querySelectorAll(".desk-add.jaddt")
Addresses are targeted by their class attribute desk-add jaddt. As compound class names are not allowed, an additional "." has to be replace the white space in the name.
Telephone numbers (via deciphering contents within storesTextToDecipher) :
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
This is where the magic happens. The numbers are not available via the DOM directly as they are pseudo element content.
If you inspect the relevant HTML you will find a series of pseudo ::before elements. VBA exposes no mechanism for applying pseudo selectors to try and get at this info in the CSS for the page.
What you see is in fact a series of span elements that each have a class attribute beginning with mobilesv. These elements sit within a single parent element of class col-sm-5 col-xs-8 store-details sp-detail paddingR0 (note again the compound class name).
I initially gather a nodeList of all the parent elements.
Sample of returned elements:
Each of these parent elements houses the class name (beginning with mobilesv) elements that constitute the characters of the telephone number string. Some characters are numbers in the string, others represent the +()-. Note: The 2|3 letter strings, in the class names, after icon- e.g. dc, fe.
For example, the first search result on the page, for the initial number 9 in the telephone number:
The actual CSS content for this pseudo element /telephone character can be observed in the CSS styling:
Note the class name and before pseudo element selector: .icon-ji:before
And the content is \9d010.
Long story short.... you can extract the 2 or 3 letters after icon- i.e. ji in this case, and the number string after \9d0, i.e. 10 in this case, and use these two bits of info to decipher the telephone number. This info is available in the response:
See the same 2/3 letter strings that are associated with the class names of the telephone string on the left, and the content instructions on the right. A little maths deduces that the number on the right is 1 greater than the telephone number, for that class, shown in the image of the webpage. I simply create a dictionary that then maps the 2/3 letter abbreviation to the telephone number by parsing this section of the html.
When looping over storesTextToDecipher, I use this dictionary to decipher the actual telephone number from the matching 2/3 letter abbreviation in the class name.
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
html.body.innerHTML = sResponse
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
With html
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
End With
For i = 0 To titles.Length - 1
Debug.Print "title: " & titles.item(i).innerText
Debug.Print "address: " & addresses.item(i).innerText
Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Sample output:
Edit: All page results
As you now want more than 10 the following uses the expected page result count (NUMBER_RESULTS_ON_PAGE) to gather the information from the page. It scrolls the page until the expected number of telephone numbers (which should be unique) are found, or the MAX_WAIT_SEC is hit. This means you avoid an infinite loop and can set your expected result count if you expect a different number. It does rely on each store having a telephone number listed - this seems a fairly reasonable assumption.
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
Const NUMBER_RESULTS_ON_PAGE As Long = 80
Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
Application.ScreenUpdating = True
Set resultCountDict = CreateObject("Scripting.Dictionary")
Set cipherDict = GetCipherDict(URL)
With IE
.Visible = True
.Navigate2 URL
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
t = Timer
Do
DoEvents
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
Dim telNumber As String, i As Long
For i = 0 To titles.Length - 1
telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
If Not resultCountDict.Exists(telNumber) Then
resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
End If
Next
.parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
While IE.Busy Or IE.readyState < 4: DoEvents: Wend
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE
End With
.Quit
End With
Dim key As Variant, rowCounter As Long
rowCounter = 1
With ThisWorkbook.Worksheets("Sheet1")
For Each key In resultCountDict.keys
.Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
rowCounter = rowCounter + 1
Next
End With
Application.ScreenUpdating = True
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Public Function GetCipherDict(ByVal URL As String) As Object
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
Set GetCipherDict = cipherDict
End Function
EDIT:
Version for where more than one number is present at top (Please note that if you make too many requests or too quickly server will serve you random pages):
Option Explicit
Public Sub GetDetails()
Dim re As Object, decodeDict As Object, i As Long
Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
Set decodeDict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
Set html = New MSHTML.htmlDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
.setRequestHeader "User-Agent", "Mozilla/4.0"
.send
responseText = .responseText
html.body.innerHTML = responseText
End With
keys = GetMatches(re, responseText, "-(\w+):before")
If UBound(keys) = 0 Then Exit Sub
values = GetMatches(re, responseText, "9d0(\d+)", True)
For i = LBound(values) To UBound(values)
decodeDict(keys(i)) = values(i)
Next
Dim itemsToDecode()
decodeDict(keys(UBound(keys))) = "+"
itemsToDecode = GetValuesToDecode(html)
PrintNumbers re, html, itemsToDecode, decodeDict
End Sub
Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .Test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
If numeric Then
arrMatches(i) = iMatch.SubMatches.item(0) - 1
Else
If spanSearch Then
arrMatches(i) = iMatch
Else
arrMatches(i) = iMatch.SubMatches.item(0)
End If
End If
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
Dim i As Long, elements As Object, results(), Class As String
Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
ReDim results(elements.Length - 1)
For i = 0 To elements.Length - 1
Class = elements.item(i).className
results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
Next
GetValuesToDecode = results
End Function
Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
Dim output As String, i As Long
For i = LBound(itemsToDecode) To UBound(itemsToDecode)
output = output & decodeDict(itemsToDecode(i))
Next
Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
htmlToSearch = html.querySelector(".telCntct").outerHTML
groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
startPos = 1
Dim totalNumbers As Long
For i = LBound(groups) To UBound(groups)
If InStr(groups(i), ",") > 0 Then
totalNumbers = totalNumbers + 1
Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
startPos = i + 1
End If
Next
If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub

Issue With Looping/Variables Through HTML Table Elements

I have this project I'm working on which involves tracking the order statuses of individual shipments from a list of URLs on an Excel sheet. Currently, the code is able to loop through the URLs fine and extract information from them, but when I try to add exceptions and variables to the loop for each different URL, I get a type mismatch. Currently, there are 3 variables I need in order to extract the HTML information from the HTMLCollection/Table. What I'm trying to do here is:
Cycle through each URL (inserts URL in IE and keeps going without opening new tabs)
Obtain the status of the item for each URL from the HTML element
FedEx: Delivered (td class="status")
UPS: Delivered (id="tt_spStatus")
USPS: Arrived at USPS Facility (class= "info-text first)
My code:
Sub TrackingDeliveryStatusUpdate()
Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False
Application.EnableEvents = True
Dim rngURL As Range
Dim wb1 As Workbook, ws1 As Worksheet, ws2 As Worksheet
Dim MyURL As String
Dim Rows As Long, links As Variant, IE As InternetExplorer, link As Variant
Dim i As Long
Dim sID As String
Dim rngLinks As Range, rngLink As Range
Dim filterRange As Range
Dim copyRange As Range
'Dim doc As Object
Set wb1 = Application.Workbooks.Open("\\S51\Store51\Employee Folders\Jason\TrackingDeliveryStatus.xls")
Set ws1 = wb1.Worksheets("TrackingDeliveryStatusResults")
'Set rngURL = ws1.Range("C2:C" & lastRow)
'Arr = rngURL
Set IE = New InternetExplorer
Rows = ws1.Cells(ws1.Rows.Count, "A").End(xlUp).Row
Set rngLinks = ws1.Range("C2:C" & Rows)
'Loop through each used cell in column C on sheet URLs
'For Each rngURL In Worksheets("TrackingDeliveryStatusResults").Range("C2", Worksheets("TrackingDeliveryStatusResults").Range("C" & Rows.Count).End(xlUp))
'Set HTMLDoc = IE.Document
With IE
.Visible = True
i = 2
For Each rngLink In rngLinks
.Navigate (rngLink)
While .Busy Or .ReadyState <> 4: DoEvents: Wend
'If InStr(1, URL.Value, "fedex") Then sID = "status"
Dim doc As Object
Set doc = IE.Document
'I know this line right below is a big mess. Not sure what to do here.
If InStr(1, rngLink.Value2, "ups") Then
sID = "tt_spStatus"
'If InStr(3, URL.Value, "usps") Then sID = "info-text first"
ws1.Range("D" & i).Value = doc.getElementById(sID).Items(0).Value
'ws1.Range("D" & i).Value = .Document.body.innerText
End If
Next rngLink
i = i + 1
End With
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
I figured this would work, but apparently it doesn't with referencing URLs from an Excel spreadsheet.

You should dim Link and Links as range and change:
links = ws1.Range("C2:C" & Rows)
to:
Set Links = ws1.Range("C2:C" & Rows)
You are trying to make a Variant equal a Range which can't happen. But once you set Links to the range of cells, then when you're looping through each cell reference the URL with link.Value2.

The reason is not working is because you have several concepts mixed up in your statements. I have refactored the relevant parts of the code and made some suggestions on variable type and name changes
Dim rngLinks as Range, rngLink as Range
Set rng Links = ws1.Range("C2:C" & Rows)
For each rngLink in rngLinks
Just because here you are looping through cells in a worksheet range.
Then inside the For each rngLink in rngLinks statement, after the statement where the code waits for IE to load.
Dim doc as Object
Set doc = .document
If InStr(1, rngLink.Value2, "ups") Then
sID = "tt_spStatus"
ws1.Range("D" & I).Value = doc.getElementById(sID).Items(0).Value
End If
You can also delete the For Each TDelement In sID block. It is not needed.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Fetching data from web page - html

Related

Getting the text contents from a HTML Table without ID by using VBA

Webscrape a specific part of a webpage

Retrieving all Excel file links from a webpage

Web scraping using VBA

Issue With Looping/Variables Through HTML Table Elements

Categories

Resources