html parsing of cricinfo scorecards - html

Aim
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel
As an example the current Australian Big Bash 2011/12 scorecards are available from
Game 1: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
Last Game: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
Background
I am proficient in using VBA (either automating IE or using XMLHTTP and then using regular expressions) to scrape data from websites, ie
Extract values from HTML TD and Tr
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
Query
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
Please bear in mind that my preference is a repeatable CSV format containing:
the date/name of the match
Team 1 name
the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
Team 2 name
the output should dump up to 11 records for Team 2 (blank records where players haven't batted)
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Sample Site links and Data to be Extracted

There are 2 techniques that I use for "VBA". I will describe them 1 by one.
1) Using FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
1) Using FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/Firefox
Free download (http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29
Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29
Free download (http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
Another example would be finding the name of a table on a website which has the data that you need scrapped.
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
Now let's take a simple example on how to scrape the URL posted in your question
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
Paste this code in the Userform code area
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
Similarly you can scrape other info as well.
2) Using Excel's inbuilt facility to get data from the web
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
Enter the url in the "New Web Query Window" and click on "Go"
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
If you wish you can record a macro and automate this as well :)
Here is the macro that I recorded.
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A$1"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Hope this helps. Let me know if you still have some queries.
Sid

For anyone else interested in this I ended up using the code below based on Siddhart Rout's earlier answer
XMLHttp was significantly quicker than automating IE
the code generates a CSV file for each series to be dowloaded (held in the X variable)
the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on
Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub

RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.
You should use the HtmlAgilityPack to query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.

As quite a few people may see this I thought I would use it as a chance to demonstrate a few features I rarely see people using in VBA web-scraping: deleteRow, querySelector and use of clipboard to write out a table (complete with formatting and hyperlinks) to a sheet based on the table.outerHTML.
deleteRow is used to remove the unwanted rows. querySelector is used to apply faster css selectors to match on nodes. Modern browsers/html parsers are optimized for css and class selectors (which I use) are the second fastest selector type (after id).
Use of css selectors and understanding htmlTable methods/properties will allow for much greater flexibility in your web-scraping endeavours. Understanding the use of the clipboard means a simple copy paste method for transferring a table to Excel.
Execution could easily be tied to a button push and the url read in from a cell.
VBA:
Option Explicit
Public Sub test()
WriteOutTable "https://www.espncricinfo.com/series/8044/scorecard/524935/hobart-hurricanes-vs-melbourne-stars-big-bash-league-2011-12"
End Sub
Public Sub WriteOutTable(ByVal url As String)
'required VBE (Alt+F11) > Tools > References > Microsoft HTML Object Library ; Microsoft XML, v6 (your version may vary)
Dim hTable As MSHTML.HTMLTable, clipboard As Object
Dim xhr As MSXML2.xmlhttp60, html As MSHTML.htmlDocument
Set xhr = New MSXML2.xmlhttp60
Set html = New MSHTML.htmlDocument
With xhr
.Open "GET", url, False
.Send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".batsman")
rowCount = hTable.Rows.Length - 1
For i = rowCount To 0 Step -1
Select Case True
Case i = rowCount Or i = rowCount - 1 Or InStr(hTable.Rows(i).outerHTML, "wicket-details") > 0
hTable.deleteRow i
End Select
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial
End Sub

Related

Return URL From First Search Result

I have an Excel workbook of around 25,000 company keywords from which I'd like to get the company website URL.
I am looking to run a VBA script which can run these keywords as a Google search, and pull the URL of the first result into a spreadsheet.
I found a similar thread.
The results of this to be hit-and-miss; some keywords return the URL in the next column, others remain blank.
It also seemed to pull the URL of Google's optimised sub-links in the first search result rather than the main website URL: Google Search Result example
I then found the below code here which I ran on a sample list of 1,000 keywords. The author of this blog stipulates that this code works for Mozilla Firefox.
I tested IE code that he has also written but this did not achieve the same results (it was adding hyperlinks consisting of descriptive text from the search results rather than the raw URL).
The Firefox code worked until the 714th row, then returned a error message
"Run time error 91: object variable or with block variable not set"
Spreadsheet layout showing successful results and row at which macro stopped
Sub GoogleURL ()
Dim url As String, lastRow As Long
Dim XMLHTTP As Object
Dim html As Object
Dim objResultDiv As Object
Dim objH As Object
lastRow = Range(“A” & Rows.Count).End(xlUp).Row
For i = 2 To lastRow
url = “https://www.google.co.uk/search?q=” & Cells(i, 1) & “&rnd=” & WorksheetFunction.RandBetween(1, 10000)
Set XMLHTTP = CreateObject(“MSXML2.serverXMLHTTP”)
XMLHTTP.Open “GET”, url, False
XMLHTTP.setRequestHeader “Content-Type”, “text/xml”
XMLHTTP.setRequestHeader “User-Agent”, “Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0”
XMLHTTP.send
Set html = CreateObject(“htmlfile”)
html.body.innerHTML = XMLHTTP.ResponseText
Set objResultDiv = html.getelementbyid(“rso”)
Set objH = objResultDiv.getelementsbytagname(“h3”)(0)
Cells(i, 2).Value = objH.innerText
Set html = CreateObject(“htmlfile”)
html.body.innerHTML = XMLHTTP.ResponseText
Set objResultDiv = html.getelementbyid(“rso”)
Set objH = objResultDiv.getelementsbytagname(“cite”)(0)
Cells(i, 3).Value = objH.innerText
DoEvents
Next
End Sub
As Firefox is a third party browser for the support scope of Microsoft, I can help you to check the VBA code for the IE browser.
You said that the VBA code given in this link for the IE browser generates the description with the link and your requirement is to store description and link in a separate column.
I tried to modify that sample code as per your requirement.
Here is the modified code from that sample.
Option Explicit
Const TargetItemsQty = 1 ' results for each keyword
Sub GWebSearchIECtl()
Dim objSheet As Worksheet
Dim objIE As Object
Dim x As Long
Dim y As Long
Dim strSearch As String
Dim lngFound As Long
Dim st As String
Dim colGItems As Object
Dim varGItem As Variant
Dim strHLink As String
Dim strDescr As String
Dim strNextURL As String
Set objSheet = Sheets("Sheet1")
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Visible = True ' for debug or captcha request cases
y = 1 ' start searching for the keyword in the first row
With objSheet
.Select
.Range(.Columns("B:B"), .Columns("B:B").End(xlToRight)).Delete ' clear previous results
.Range(.Columns("C:C"), .Columns("C:C").End(xlToRight)).Delete ' clear previous results
.Range("A1").Select
Do Until .Cells(y, 1) = ""
x = 2 ' start writing results from column B
.Cells(y, 1).Select
strSearch = .Cells(y, 1) ' current keyword
With objIE
lngFound = 0
.navigate "https://www.google.com/search?q=" & EncodeUriComponent(strSearch) ' go to first search results page
Do
Do While .Busy Or Not .READYSTATE = 4: DoEvents: Loop ' wait IE
Do Until .document.READYSTATE = "complete": DoEvents: Loop ' wait document
Do While TypeName(.document.getelementbyid("res")) = "Null": DoEvents: Loop ' wait [#res] element
Set colGItems = .document.getelementbyid("res").getElementsByClassName("g") ' collection of search result [.g] items
For Each varGItem In colGItems ' process each item in collection
If varGItem.getelementsbytagname("a").Length > 0 And varGItem.getElementsByClassName("st").Length > 0 Then ' must have hyperlink and description
strHLink = varGItem.getelementsbytagname("a")(0).href ' get first hyperlink [a] found in current item
strDescr = GetInnerText(varGItem.getElementsByClassName("st")(0).innerHTML) ' get first description [span.st] found in current item
lngFound = lngFound + 1
'Debug.Print (strHLink)
'Debug.Print (strDescr)
With objSheet ' put result into cell
.Cells(y, x).Value = strDescr
.Hyperlinks.Add .Cells(y, x + 1), strHLink
.Cells(y, x).WrapText = True
x = x + 1 ' next column
End With
If lngFound = TargetItemsQty Then Exit Do ' continue with next keyword - necessary quantity of the results for current keyword found
End If
DoEvents
Next
If TypeName(.document.getelementbyid("pnnext")) = "Null" Then Exit Do ' continue with next keyword - no [a#pnnext.pn] next page button exists
strNextURL = .document.getelementbyid("pnnext").href ' get next page url
.navigate strNextURL ' go to next search results page
Loop
End With
y = y + 1 ' next row
Loop
End With
objIE.Quit
' google web search page contains the elements:
' [div#res] - main search results block
' [div.g] - each result item block within [div#res]
' [a] - hyperlink ancor(s) within each [div.g]
' [span.st] - description(s) within each [div.g]
' [a#pnnext.pn] - hyperlink ancor to the next search results page
End Sub
Function EncodeUriComponent(strText As String) As String
Static objHtmlfile As Object
If objHtmlfile Is Nothing Then
Set objHtmlfile = CreateObject("htmlfile")
objHtmlfile.parentWindow.execScript "function encode(s) {return encodeURIComponent(s)}", "jscript"
End If
EncodeUriComponent = objHtmlfile.parentWindow.encode(strText)
End Function
Function GetInnerText(strText As String) As String
Static objHtmlfile As Object
If objHtmlfile Is Nothing Then
Set objHtmlfile = CreateObject("htmlfile")
objHtmlfile.Open
objHtmlfile.Write "<body></body>"
End If
objHtmlfile.body.innerHTML = strText
GetInnerText = objHtmlfile.body.innerText
End Function
Output in IE 11 browser:
You can try to run it on your side to see the results with large amount of data.
If you meet with any performance issue then I suggest you try it with a smaller amount of data.

Search a website using excel vba with excel data and extract the active state in flowchart of search result and mapping it into column

I am hoping someone can help....
I have around 7000 values in a excel spreadsheet that I need to search in a website and then record active state of result flowchart from the website to be inputted back into the excel spreadsheet. Since I am new to macros web scrape I used to automate web code modified input ids for the website which I want to extract information (https://nacionalidade.justica.gov.pt/). I am a bit confused in how to apply if condition to get the active state having seven classes in flowhchart, Here is the flow chart.
Now that I have access codes each will be on different stage, I only want to pick the state and put it in column E in front of the access code(currently doing manually)
I am unclear how to extract that info being new to this type of web data extraction - any help would be incredible!
Here is my code:(couldn't be able to change for mentioned web after this)
objIE.document.getElementById("btnPesquisa").Click
Code:
'start a new subroutine called SearchBot
Sub SearchBot()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim aEle As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://nacionalidade.justica.gov.pt/"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("SenhaAcesso").Value = _
Sheets("Guy Touti").Range("D2").Value
'click the 'go' button
objIE.document.getElementById("btnPesquisa").Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'the first search result will go in row 2
y = 2
'for each <a> element in the collection of objects with class of 'result__a'...
For Each aEle In objIE.document.getElementsByClassName("result__a")
'...get the href link and print it to the sheet in col C, row y
result = aEle
Sheets("Guy Touti").Range("E" & y).Value = result
'...get the text within the element and print it to the sheet in col D
Sheets("Guy Touti").Range("D" & y).Value = aEle.innerText
Debug.Print aEle.innerText
'is it a yellowpages link?
If InStr(result, "yellowpages.com") > 0 Or InStr(result, "yp.com") > 0 Then
'make the result red
Sheets("Guy Touti").Range("C" & y).Interior.ColorIndex = 3
'place a 1 to the left
Sheets("Guy Touti").Range("B" & y).Value = 1
End If
'increment our row counter, so the next result goes below
y = y + 1
'repeat times the # of ele's we have in the collection
Next
'add up the yellowpages listings
Sheets("Guy Touti").Range("B1").Value = _
Application.WorksheetFunction.Sum(Sheets("Guy Touti").Range("B2:B100"))
'close the browser
objIE.Quit
'exit our SearchBot subroutine
End Sub
I did try this first but after a while started searching for a better way. Can you help????
You can simplify the POST XHR request the page makes to get data and use the classnames to limit to nodes with either active1 or active3. Take the last node in that nodelist and extract the step number and convert colour via lookup (if wanted). With 7,000 requests it might be considerate to add a delay in every 50 requests, or less, of 1-2 seconds. You can i mod 50 to determine this in the loop and use Application.Wait Now + Timeserial(0,0,2)
Option Explicit
Public Sub GetStatus()
Dim html As MSHTML.HTMLDocument, xhr As Object, colourLkup As Object
Dim ws As Worksheet, senhas(), i As Long, results()
Set ws = ThisWorkbook.Worksheets("Sheet1")
senhas = Application.Transpose(ws.Range("D2:D" & ws.Cells(ws.Rows.Count, "D").End(xlUp).Row))
ReDim results(1 To UBound(senhas))
Set colourLkup = CreateObject("Scripting.Dictionary")
colourLkup.Add "active1", "green"
colourLkup.Add "active3", "orange"
Set html = New MSHTML.HTMLDocument
Set xhr = CreateObject("MSXML2.XMLHTTP")
For i = LBound(senhas) To UBound(senhas)
If senhas(i) <> vbNullString Then
With xhr
.Open "POST", "https://nacionalidade.justica.gov.pt/Home/GetEstadoProcessoAjax", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send "SenhaAcesso=" & senhas(i)
html.body.innerHTML = .responseText
End With
Dim nodes As Object, classinfo() As String
Set nodes = html.querySelectorAll(".active1, .active3")
classinfo = Split(nodes(nodes.Length - 1).className, Chr$(32))
results(i) = Replace$(classinfo(1), "step", vbNullString) & "-" & colourLkup(classinfo(2))
End If
Set nodes = Nothing
Next
ws.Cells(2, 5).Resize(UBound(results), 1) = Application.Transpose(results)
End Sub

How can I retrieve Amazon's keyword/phrase suggestions from the search bar

Below is some code I've found and altered to attempt to capture the keyword/phrase suggestions from Amazon's search bar. I'm very new to the concept of web scraping, so I know the code presented here may be very ineffective and inefficient. I've manually captured some data from the F12 DOM Explorer and Network windows. If the best answer is web scraping, I need that in the form of excel vba. I see in some of the below images that it appears as though some of the content type from the Network window is "application/json" and the Initiator/Type is "XMLHttpRequest", but this is only after it shows a connection and authentication to "https://completion.amazon.com". If that's the route, I have no idea how to complete those requests. Any help would be much appreciated.
So far I've tried invoking the search bar programmatically, via the scripts in the code, but that does nothing that I can see. Simply 'pasting' the keyword into the search bar with a 'space' appended to it does not produce the suggested keywords. However, typing into the search bar does. If I type the keyword in, then choose 'inspect element' of the dropdown suggestions, dynamic HTML is produced to show the HTML content of the suggestions (at which time I can get what I need). I've been unsuccessful in getting to that point.
Private Sub CommandButton1_Click()
Dim MyHTML_Element As IHTMLElement
Dim MyURL As String
Dim AASearchRank As Workbook
Dim AAws As Worksheet
Dim InputSearch As HTMLInputTextElement
Dim elems As IHTMLElementCollection
Dim TDelement As HTMLTableCell
Dim elems2 As IHTMLElementCollection
Dim TDelement2 As HTMLDivElement
'Dim TDelement2 As HTMLInputTextElement
Dim InputSearchButton As HTMLInputButtonElement
Dim IE As InternetExplorer
Dim x As Integer
Dim i As Long
MyURL = "https://www.amazon.com/"
Set IE = New InternetExplorer
With IE
.Silent = True
.Navigate MyURL
.Visible = True
Do
DoEvents
Loop Until .ReadyState = READYSTATE_COMPLETE
End With
Set HTMLDoc = IE.Document
Set AASearchRank = Application.ThisWorkbook
Set AAws = AASearchRank.Worksheets("Sheet2")
Set InputSearchButton = HTMLDoc.getElementById("nav-search-submit-text")
Set InputSearchOrder = HTMLDoc.getElementById("twotabsearchtextbox")
If Not InputSearchOrder Is Nothing Then
InputSearchButton.Click
Do
DoEvents
Loop Until IE.ReadyState = READYSTATE_COMPLETE
End If
x = 2
If AAws.Range("D" & x).Value = "" Then
Do Until AAws.Range("B" & x) = ""
Set InputSearch = HTMLDoc.getElementById("twotabsearchtextbox")
InputSearch.Focus
'When a keyword is typed in the search bar with a 'space' after, it invokes the suggestions I'm looking for.
InputSearch.Value = "Travel "
'InputSearch.Value = AAws.Range("C" & x) & " "
Set InputSearchButton = HTMLDoc.getElementsByClassName("nav-input")(0)
InputSearch.Focus
'Here I was trying to invoke some script to see if it had any effect on the search bar drop down
HTMLDoc.parentWindow.execScript "window.navmet.push({key:'UpNav',end:+new Date(),begin:window.navmet.tmp});"
HTMLDoc.parentWindow.execScript "window.navmet.push({key:'Search',end:+new Date(),begin:window.navmet.tmp});"
HTMLDoc.parentWindow.execScript "window.navmet.push({key:'NavBar',end:+new Date(),begin:window.navmet.main});"
Do
DoEvents
Loop Until IE.ReadyState = READYSTATE_COMPLETE
'Application.Wait (Now + TimeValue("0:00:05"))
Set elems2 = HTMLDoc.getElementsByClassName("nav-issFlyout nav-flyout")
i = 0
For Each TDelement2 In elems2
'Debug statements strictly for learning what each option/query returns
Debug.Print TDelement2.innerText
Debug.Print TDelement2.className
Debug.Print TDelement2.dataFld
Debug.Print TDelement2.innerHTML
Debug.Print TDelement2.outerText
Debug.Print TDelement2.outerHTML
Debug.Print TDelement2.parentElement.className
Debug.Print TDelement2.tagName
Debug.Print TDelement2.ID
Next
'Once the searchbar is populated, and the drop down list provides suggestions,
'the below code will give me what I want. If there's an easier solution,
'I'm all for it
Set elems = HTMLDoc.getElementsByClassName("s-suggestion")
i = 0
For Each TDelement In elems
If Left(TDelement.ID, 6) = "issDiv" Then
Debug.Print TDelement.innerText
Debug.Print TDelement.ID
End If
Next
x = x + 1
Loop
End If
End Sub
An ideal solution would be to obtain these suggested keywords through either invoking the search bar dynamic HTML or via Amazon's completion site, but it appears as though that might not be open to the general public. Thank you for any help, and apologies up front for any posting deficiencies.
There is an API call you can find in the network tab. It returns a json string you can parse with as jsonparser to get the suggestions. I use jsonconverter.bas which, once downloaded I add to the project and then go VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
The url itself is a queryString i.e. it is constructed of different parameters. For example, there is a limit parameter, whose value is 11, which specifies the number of suggestions to return. You may be able to alter and/or remove some of these. Below, I concatenate the SEARCH_TERM constant into the query string to represent your search value (that which would be typed into the search box).
I don't know whether any of the params are time-based (i.e. expire over time - I have made a number of requests without problem since you posted your question). It may be that necessary time based values can be pulled via a prior GET request to Amazon search page.
params = (
('session-id', '141-0042012-2829544'),
('customer-id', ''),
('request-id', '7E7YCB7AZZM1HQEZF2G1'),
('page-type', 'Search'),
('lop', 'en_US'),
('site-variant', 'desktop'),
('client-info', 'amazon-search-ui'),
('mid', 'ATVPDKIKX0DER'),
('alias', 'aps'),
('b2b', '0'),
('fresh', '0'),
('ks', '76'),
('prefix', 'TRAVEL'),
('event', 'onKeyPress'),
('limit', '11'),
('fb', '1'),
('suggestion-type', ['KEYWORD', 'WIDGET']),
('_', '1556820864750')
)
VBA:
Option Explicit
Public Sub GetTable()
Dim json As Object, suggestion As Object '< VBE > Tools > References > Microsoft Scripting Runtime
Const SEARCH_TERM As String = "TRAVEL"
Const SEARCH_TERM2 As String = "BOOKS"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://completion.amazon.com/api/2017/suggestions?session-id=141-0042012-2829544" & _
"&customer-id=&request-id=7E7YCB7AZZM1HQEZF2G1&page-type=Search&lop=en_US&site-variant=" & _
"desktop&client-info=amazon-search-ui&mid=ATVPDKIKX0DER&alias=aps&b2b=0&fresh=0&ks=76&" & _
"prefix=" & SEARCH_TERM & "&event=onKeyPress&limit=11&fb=1&suggestion-type=KEYWORD&suggestion-type=" & _
"WIDGET&_=1556820864750", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
Set json = JsonConverter.ParseJson(.responseText)("suggestions")
End With
For Each suggestion In json
Debug.Print suggestion("value")
Next
End Sub

Running a VBA Loop with Selenium to scrape html data into Excel

I was recently assisted in scraping data from a webpage by the guys at Stackoverflow. It's a great community. I was given a function that pulls data into excel from a cell containing a url. Unfortunately I'm running into some problems because I need a loop function so that Excel does not restart all my functions once I save or refresh the page.
So far I have tried to build this, but am next to useless in VBA. Wondering if anyone can provide a little extra assistance.
Sub POSTPageViews()
Dim InputSheet As Worksheet
Dim i As Long
Dim AllWords As Range
Dim text As String
Dim OutValue As String
Dim driver As SeleniumWrapper.WebDriver
On Error Resume Next
Set driver = New SeleniumWrapper.WebDriver
driver.Start "chrome", "https://re.po.st/"
driver.Open strLocation
Set InputSheet = Active
Set WorkRng = Application.Selection
WordListSheet.Range("E1") = "All Words"
InputSheet.Activate
r = 1
Do While Cells(r, 1) <> ""
Cells(r, 1).Value = txt
OutValue = driver.findElementById("sguidtotaltable").findElementByTagName("span").text
Next i
r = r + 1
driver.stop 'Stops the browser
Loop
End Sub
But naturally it is not working... Anybody see what is wrong? Basically in Column E I have all the URLs and in column K I would like to see the accompanying values.
Thanks
Does this work (in the spirit of my comments)?
Sub POSTPageViews()
Dim driver As SeleniumWrapper.WebDriver
Set driver = New SeleniumWrapper.WebDriver
driver.Start "chrome", "https://re.po.st/"
With Worksheets("Trial")
r = 2
Do While .Cells(r, 5) <> ""
driver.Open .Cells(r, 5).Value
.Cells(r, 11) = driver.findElementById("sguidtotaltable").findElementByTagName("span").text
r = r + 1
Loop
driver.stop 'Stops the browser
End Sub

Extracting multiple search results

I have created a VBA application that allows you to extract search results from the canada411.ca site. You simply insert values into to the values "Where" and "What" and "Title", "Location, and "Phone" will spit out. In my code What = "Name". Here is my code:
Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Row = Range("Name").Row And _
Target.Column = Range("Name").Column Then
End If
If Target.Row = Range("Where").Row And _
Target.Column = Range("Where").Column Then
'Set Variables What and Where from Canada411.ca to Values on Excel WorkSheet
Dim IE As New InternetExplorer
IE.Visible = True
IE.navigate ("http://canada411.yellowpages.ca/search/si/1/") & _
Range("Name").Value & "/" & Range("Where").Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
'Extract from Canada411.ca Source element (first search result)
Range("Title").Value = Trim(Doc.getElementsByTagName("h3")(0).innerText)
Range("Phone").Value = Trim(Doc.getElementsByTagName("h4")(0).innerText)
Range("Location").Value = Trim(Doc.getElementsByClassName("address")(0).innerText)
IE.Quit
'Extract for Second Search result
'Third Search result etc.
End If
End Sub
My problem is that I don't know how to get the remaining results on the page, I can only get the first result on the first page. The source code for the subsequent search results are the same as the first, but I cannot seem to make it work. (Perhaps there is a shortcut after you have the code for the first one, to get the rest?) I am new to VBA and HTML and appreciate the help.
Well, you have two options.
1) Learn how to navigate the DOM using the Tools->References library 'Microsoft HTML Object Library' and extract that way.
2) It is possible to pull the web page into a Excel worksheet and then you only need to pull out data from each cell. Much easier but ties you to Excel. Use the Macro recorder and then use the GUI , on the Ribbon Data->From Web and follow the wizard.
In your link change the « 1 » for a 2, 3, 4 ... These are the page numbers !
http://canada411.yellowpages.ca/search/si/1/
http://canada411.yellowpages.ca/search/si/2/
http://canada411.yellowpages.ca/search/si/3/
...