Extracting multiple search results - html

I have created a VBA application that allows you to extract search results from the canada411.ca site. You simply insert values into to the values "Where" and "What" and "Title", "Location, and "Phone" will spit out. In my code What = "Name". Here is my code:
Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Row = Range("Name").Row And _
Target.Column = Range("Name").Column Then
End If
If Target.Row = Range("Where").Row And _
Target.Column = Range("Where").Column Then
'Set Variables What and Where from Canada411.ca to Values on Excel WorkSheet
Dim IE As New InternetExplorer
IE.Visible = True
IE.navigate ("http://canada411.yellowpages.ca/search/si/1/") & _
Range("Name").Value & "/" & Range("Where").Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
'Extract from Canada411.ca Source element (first search result)
Range("Title").Value = Trim(Doc.getElementsByTagName("h3")(0).innerText)
Range("Phone").Value = Trim(Doc.getElementsByTagName("h4")(0).innerText)
Range("Location").Value = Trim(Doc.getElementsByClassName("address")(0).innerText)
IE.Quit
'Extract for Second Search result
'Third Search result etc.
End If
End Sub
My problem is that I don't know how to get the remaining results on the page, I can only get the first result on the first page. The source code for the subsequent search results are the same as the first, but I cannot seem to make it work. (Perhaps there is a shortcut after you have the code for the first one, to get the rest?) I am new to VBA and HTML and appreciate the help.

Well, you have two options.
1) Learn how to navigate the DOM using the Tools->References library 'Microsoft HTML Object Library' and extract that way.
2) It is possible to pull the web page into a Excel worksheet and then you only need to pull out data from each cell. Much easier but ties you to Excel. Use the Macro recorder and then use the GUI , on the Ribbon Data->From Web and follow the wizard.

In your link change the « 1 » for a 2, 3, 4 ... These are the page numbers !
http://canada411.yellowpages.ca/search/si/1/
http://canada411.yellowpages.ca/search/si/2/
http://canada411.yellowpages.ca/search/si/3/
...

Related

How to access the Web using VBA? Please check my code

In order to improve the repeatitive work, I tried to access the Web site which is using in company using VBA.
So, I made code using VBA. And I checked it could be access the normal site such as google, youtube...
But, I don't know why it could not be access the company site.
VBA stopped this line
Set HTMLDoc = IE_ctrl.document
Thank you in advanced.
And I checked one different things(VBA Local values, type) between Normal and company site.
please check below 2 pictures.
Sub a()
Dim IE_ctrl As InternetExplorer
Dim HTMLDoc As HTMLDocument
Dim input_Data As IHTMLElement
Dim URL As String
URL = "https://www.google.com"
Set IE_ctrl = New InternetExplorer
IE_ctrl.Silent = True
IE_ctrl.Visible = True
IE_ctrl.navigate URL
Wait_Browser IE_ctrl
Set HTMLDoc = IE_ctrl.document
Wait_Browser IE_ctrl
Set input_Data = HTMLDoc.getElementsByClassName("text").Item
input_Data.Click
End Sub
Sub Wait_Browser(Browser As InternetExplorer, Optional t As Integer = 1)
While Browser.Busy
DoEvents
Wend
Application.Wait DateAdd("s", t, Now)
End Sub
Normal site(operating well.)
enter image description here
Company site(operating error.)
enter image description here
You can try the following code. Please read the comments. I can't say anymore because I don't know the page or the html of the page.
Sub a()
'Use late binding for what you need
Dim ie As Object
Dim nodeInputData As Object
Dim url As String
url = "https://www.google.com"
'Use the windows GUID to initialize the Internet Explorer, if you
'want to get access to a company page. This helps if there are
'security rules you can't access over other ways of initializing IE
'This don't work in most cases for pages in the "real" web
'Read here for more infos:
'https://blogs.msdn.microsoft.com/ieinternals/2011/08/03/default-integrity-level-and-automation/
Set ie = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
ie.Visible = True
ie.navigate url
'Waiting for the document to load
Do Until ie.readyState = 4: DoEvents: Loop
'If necessary, if there is dynamic content that must be loaded,
'after the ie reports, loading was ready
'(The last three values are: hours, minutes, seconds)
Application.Wait (Now + TimeSerial(0, 0, 1))
'I don't know your html. If you only want to click a button,
'you don't need a varable
'ie.document.getElementsByClassName("text")(0).Click
'will do the same like
Set nodeInputData = ie.document.getElementsByClassName("text")(0)
nodeInputData.Click
'A short explanation of getElementsByClassName() and getElementsByTagName():
'Both methods create a node collection of all html elements that was found
'by the creteria in the brackets. This is because there can be any number of
'html elements with specified class names or tag names. If, for example,
'3 html elements with the class name "Text" were found, a node collection
'with three elements is created by getElementsByClassName("Text").
'These have the indices 0 to 2, as in an array. The individual elements are
'also addressed via these indices. They are indicated in round brackets.
End Sub

How can I retrieve Amazon's keyword/phrase suggestions from the search bar

Below is some code I've found and altered to attempt to capture the keyword/phrase suggestions from Amazon's search bar. I'm very new to the concept of web scraping, so I know the code presented here may be very ineffective and inefficient. I've manually captured some data from the F12 DOM Explorer and Network windows. If the best answer is web scraping, I need that in the form of excel vba. I see in some of the below images that it appears as though some of the content type from the Network window is "application/json" and the Initiator/Type is "XMLHttpRequest", but this is only after it shows a connection and authentication to "https://completion.amazon.com". If that's the route, I have no idea how to complete those requests. Any help would be much appreciated.
So far I've tried invoking the search bar programmatically, via the scripts in the code, but that does nothing that I can see. Simply 'pasting' the keyword into the search bar with a 'space' appended to it does not produce the suggested keywords. However, typing into the search bar does. If I type the keyword in, then choose 'inspect element' of the dropdown suggestions, dynamic HTML is produced to show the HTML content of the suggestions (at which time I can get what I need). I've been unsuccessful in getting to that point.
Private Sub CommandButton1_Click()
Dim MyHTML_Element As IHTMLElement
Dim MyURL As String
Dim AASearchRank As Workbook
Dim AAws As Worksheet
Dim InputSearch As HTMLInputTextElement
Dim elems As IHTMLElementCollection
Dim TDelement As HTMLTableCell
Dim elems2 As IHTMLElementCollection
Dim TDelement2 As HTMLDivElement
'Dim TDelement2 As HTMLInputTextElement
Dim InputSearchButton As HTMLInputButtonElement
Dim IE As InternetExplorer
Dim x As Integer
Dim i As Long
MyURL = "https://www.amazon.com/"
Set IE = New InternetExplorer
With IE
.Silent = True
.Navigate MyURL
.Visible = True
Do
DoEvents
Loop Until .ReadyState = READYSTATE_COMPLETE
End With
Set HTMLDoc = IE.Document
Set AASearchRank = Application.ThisWorkbook
Set AAws = AASearchRank.Worksheets("Sheet2")
Set InputSearchButton = HTMLDoc.getElementById("nav-search-submit-text")
Set InputSearchOrder = HTMLDoc.getElementById("twotabsearchtextbox")
If Not InputSearchOrder Is Nothing Then
InputSearchButton.Click
Do
DoEvents
Loop Until IE.ReadyState = READYSTATE_COMPLETE
End If
x = 2
If AAws.Range("D" & x).Value = "" Then
Do Until AAws.Range("B" & x) = ""
Set InputSearch = HTMLDoc.getElementById("twotabsearchtextbox")
InputSearch.Focus
'When a keyword is typed in the search bar with a 'space' after, it invokes the suggestions I'm looking for.
InputSearch.Value = "Travel "
'InputSearch.Value = AAws.Range("C" & x) & " "
Set InputSearchButton = HTMLDoc.getElementsByClassName("nav-input")(0)
InputSearch.Focus
'Here I was trying to invoke some script to see if it had any effect on the search bar drop down
HTMLDoc.parentWindow.execScript "window.navmet.push({key:'UpNav',end:+new Date(),begin:window.navmet.tmp});"
HTMLDoc.parentWindow.execScript "window.navmet.push({key:'Search',end:+new Date(),begin:window.navmet.tmp});"
HTMLDoc.parentWindow.execScript "window.navmet.push({key:'NavBar',end:+new Date(),begin:window.navmet.main});"
Do
DoEvents
Loop Until IE.ReadyState = READYSTATE_COMPLETE
'Application.Wait (Now + TimeValue("0:00:05"))
Set elems2 = HTMLDoc.getElementsByClassName("nav-issFlyout nav-flyout")
i = 0
For Each TDelement2 In elems2
'Debug statements strictly for learning what each option/query returns
Debug.Print TDelement2.innerText
Debug.Print TDelement2.className
Debug.Print TDelement2.dataFld
Debug.Print TDelement2.innerHTML
Debug.Print TDelement2.outerText
Debug.Print TDelement2.outerHTML
Debug.Print TDelement2.parentElement.className
Debug.Print TDelement2.tagName
Debug.Print TDelement2.ID
Next
'Once the searchbar is populated, and the drop down list provides suggestions,
'the below code will give me what I want. If there's an easier solution,
'I'm all for it
Set elems = HTMLDoc.getElementsByClassName("s-suggestion")
i = 0
For Each TDelement In elems
If Left(TDelement.ID, 6) = "issDiv" Then
Debug.Print TDelement.innerText
Debug.Print TDelement.ID
End If
Next
x = x + 1
Loop
End If
End Sub
An ideal solution would be to obtain these suggested keywords through either invoking the search bar dynamic HTML or via Amazon's completion site, but it appears as though that might not be open to the general public. Thank you for any help, and apologies up front for any posting deficiencies.
There is an API call you can find in the network tab. It returns a json string you can parse with as jsonparser to get the suggestions. I use jsonconverter.bas which, once downloaded I add to the project and then go VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
The url itself is a queryString i.e. it is constructed of different parameters. For example, there is a limit parameter, whose value is 11, which specifies the number of suggestions to return. You may be able to alter and/or remove some of these. Below, I concatenate the SEARCH_TERM constant into the query string to represent your search value (that which would be typed into the search box).
I don't know whether any of the params are time-based (i.e. expire over time - I have made a number of requests without problem since you posted your question). It may be that necessary time based values can be pulled via a prior GET request to Amazon search page.
params = (
('session-id', '141-0042012-2829544'),
('customer-id', ''),
('request-id', '7E7YCB7AZZM1HQEZF2G1'),
('page-type', 'Search'),
('lop', 'en_US'),
('site-variant', 'desktop'),
('client-info', 'amazon-search-ui'),
('mid', 'ATVPDKIKX0DER'),
('alias', 'aps'),
('b2b', '0'),
('fresh', '0'),
('ks', '76'),
('prefix', 'TRAVEL'),
('event', 'onKeyPress'),
('limit', '11'),
('fb', '1'),
('suggestion-type', ['KEYWORD', 'WIDGET']),
('_', '1556820864750')
)
VBA:
Option Explicit
Public Sub GetTable()
Dim json As Object, suggestion As Object '< VBE > Tools > References > Microsoft Scripting Runtime
Const SEARCH_TERM As String = "TRAVEL"
Const SEARCH_TERM2 As String = "BOOKS"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://completion.amazon.com/api/2017/suggestions?session-id=141-0042012-2829544" & _
"&customer-id=&request-id=7E7YCB7AZZM1HQEZF2G1&page-type=Search&lop=en_US&site-variant=" & _
"desktop&client-info=amazon-search-ui&mid=ATVPDKIKX0DER&alias=aps&b2b=0&fresh=0&ks=76&" & _
"prefix=" & SEARCH_TERM & "&event=onKeyPress&limit=11&fb=1&suggestion-type=KEYWORD&suggestion-type=" & _
"WIDGET&_=1556820864750", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
Set json = JsonConverter.ParseJson(.responseText)("suggestions")
End With
For Each suggestion In json
Debug.Print suggestion("value")
Next
End Sub

vba: How to click on element within iframe

My goal is to click an element within a html Iframe, but nothing worked for me so far. Hope someone can advise how to approach this task correctly as I am running in circles for weeks now.
I have tried to click on a div Id, span title but nothing worked so far. I believe it is because a wrong syntex
Option Explicit
Sub it_will_work()
'make the app work faster?
Application.ScreenUpdating = False
Application.DisplayAlerts = False
'--------------------------------
Dim sht As Worksheet
Set sht = ThisWorkbook.Sheets("Fields") 'my data will be stored here
Dim LastRow As Long
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row 'range definition
Dim i As Long 'Will be used for a loop that navigate to different url
For i = 2 To LastRow 'First url starts at row 2 untill the last row
Dim IE As Object 'Internet Explorer declaration
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.navigate sht.Range("A" & i).Value 'My url that I want to navigate to
While IE.readyState <> 4 Or IE.Busy: DoEvents: Wend
Dim Doc As New HTMLDocument 'Will be used for the main html page
Set Doc = IE.document
Doc.getElementById("tab7").Click 'data taht need to be updated is here
'Global workgroup data that will effect the workgroup data(dependency)
Doc.getElementById("mcdResourceGlobalWorkgroup_ddltxt").Value = sht.Range("W" & i).Value
Doc.getElementById("mcdResourceGlobalWorkgroup_ddltxt").Focus
Doc.getElementById("mcdResourceGlobalWorkgroup_ddlimg").Click
'Workgroup dropdown, that need to be choosen within the Iframe:
Doc.getElementById("ResourceWorkgroup").Value = sht.Range("X" & i).Value '1) worgroup that I want to insert
Doc.getElementById("ResourceWorkgroup").Focus
Doc.getElementById("_IB_imgResourceWorkgroup").Click '2) Cliking here will generate dropdown values according the value inserted above
Application.Wait Now + TimeValue("00:00:5") 'before refering to Iframe I let the values to be loaded
'***from this point I have the issue where I try to access Iframe and click on the desired element:***
'Here I declare Iframe
Dim objIFRAME As Object
Set objIFRAME = IE.document.getElementsByTagName("iframe")
Debug.Print TypeName(objIFRAME)
'Here I ask to click on a title within the Iframe where value = X
objIFRAME.getElementsByName("title").Value = sht.Range("X" & i).Value.Click
Next i
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub
After the url loads the following steps should happen:
Click on tab 7 -> this will open the correct tab to work on
inseart value from clumn "W" to "Global workgroup" field
focus on "Global workgroup" field
Click on an image that validate the "Global workgroup" field
(validates the instered value)
inseart value from clumn "X" to "Workgroup" field
focus on "Workgroup" field
Click on image that opens the drop down options, which is generated
according the inserted value to "Workgroup" field
Within the Iframe, Click on the title that is equal to value
which was inserted to "Workgroup" field
I have also tried to use Selenium IDE so I can see how the recorded macro access the Iframe and click the desired elemnt:
Command: Select frame | Target: Index=2
Click | Target: css=span[title="APAC"]
I have tried to mimic the stpes above in VBE, but couldn't find a way to write it properly. I event tried to download & apply selenium driver and run the code using the selenium library but got stuck as well.
Below image is the html code of the Iframe and the desired element I want to click on:
You should be able to use the following syntax
ie.document.querySelector("[id='_CPDDWRCC_ifr']").contentDocument.querySelector("span[title=APAC]").click
With selenium you can use
driver.SwitchToFrame driver.FindElementByCss("[id='_CPDDWRCC_ifr']")
driver.FindElementByCss("span[title=APAC]").click
With your existing tag solution you need to use an index. For example,
objIFRAME(0)
Then querySelector on the contentDocument of that.

Incorrect data returned when web-scraping from Internet Explorer using VBA

I am using VBA code (in MS Access, but this problem should occur regardless of the VBA platform) to scrape specific web pages for particular data:
Option Compare Database
Option Explicit
' Requires references to "Microsoft Internet Controls"
' Requires references to "Microsoft HTML Object Library"
Private mFound As Boolean
Private Sub cmdGetFromIE(BaseURL as string)
Const SND_ALIAS_SYSTEMEXCLAMATION = 8531
Dim SW As SHDocVw.ShellWindows
Dim IE As SHDocVw.InternetExplorer
Dim CtrA As Long
Dim TStart As Single
Dim Doc As MSHTML.HTMLDocument
Dim IncElement As MSHTML.IHTMLElement, TitleElement As MSHTML.IHTMLElement, UserElement As MSHTML.IHTMLElement
' ...
Set SW = New SHDocVw.ShellWindows
If SW.Count > 0 Then
For CtrA = 0 To SW.Count - 1
Set IE = SW.Item(CtrA)
If Left(IE.LocationURL, Len(BaseURL)) = BaseURL Then
TStart = Timer
IE.Refresh
Do Until (IE.ReadyState = READYSTATE_COMPLETE And Not IE.Busy) Or Timer > TStart + 30
DoEvents ' Sleep 1 ' Sleep Windows API call procedure to sleep 1s
Debug.Print IE.LocationName, IE.LocationURL, IE.ReadyState, IE.Busy
Loop
If IE.ReadyState = READYSTATE_COMPLETE And Not IE.Busy Then
Set Doc = IE.Document
Set IncElement = Doc.getElementsByClassName("history-item__title ng-binding").Item
Set TitleElement = Doc.getElementsByClassName("history-item__details ng-binding").Item
Set UserElement = Doc.getElementsByClassName("person-summary__full-name_link font-size-xxl ng-binding").Item
Debug.Print CtrA & ";" & Val(Right(IncElement.innerText, 12)) & ";" & TitleElement.innerText & ";" & UserElement.innerText
' Do stuff with the data...
End If
End If
Next
' Do more stuff with the data...
Else
' ... Do 'No IE open' stuff...
End If
End Sub
My problem is that if I open Navigation Page A, and from there navigate to Data Page B 1, the correct data is returned from that page, but if I then navigate back to Navigation Page A and then navigate to Data Page B 2, which is the same form, but contains different data, this code returns some or all of the same data for Data Page B 2 as was returned from Data Page B 1, despite the data pages being refreshed.
I can even navigate to Navigation Page A and then open Data Page B 1 from it in a new tab, and then go back to the Navigation Page A tab and then open Data Page B 2 in yet another new tab, and I still get the problem that I get some or all of Data Page B 1's data from Data Page B 2.
Data Page B appears to be an Angular-JS - populated stock form, with different data depending upon the URL's data section, however that shouldn't matter, I want the page's data as it stands at the instant I run the procedure, but whether I have my code do an IE.Refresh or not, I still have this problem.
If I try Set Doc = New MSHTML.HTMLDocument:Set Doc = Doc.createDocumentFromUrl(IE.LocationURL, ""), I get a "Permission Denied" error at Doc.getElementsByClassName.
How can I scrape Data Page B for the correct current data for each IE tab?
EDIT:
I don't necessarily even have to use IE to load the pages, I just need to get the URLs from IE, and if there is another way accessible via VBA to load and parse the resultant HTML DOM that will work, I'm open to it.

html parsing of cricinfo scorecards

Aim
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel
As an example the current Australian Big Bash 2011/12 scorecards are available from
Game 1: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
Last Game: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
Background
I am proficient in using VBA (either automating IE or using XMLHTTP and then using regular expressions) to scrape data from websites, ie
Extract values from HTML TD and Tr
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
Query
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
Please bear in mind that my preference is a repeatable CSV format containing:
the date/name of the match
Team 1 name
the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
Team 2 name
the output should dump up to 11 records for Team 2 (blank records where players haven't batted)
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Sample Site links and Data to be Extracted
There are 2 techniques that I use for "VBA". I will describe them 1 by one.
1) Using FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
1) Using FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/Firefox
Free download (http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29
Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29
Free download (http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
Another example would be finding the name of a table on a website which has the data that you need scrapped.
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
Now let's take a simple example on how to scrape the URL posted in your question
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
Paste this code in the Userform code area
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
Similarly you can scrape other info as well.
2) Using Excel's inbuilt facility to get data from the web
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
Enter the url in the "New Web Query Window" and click on "Go"
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
If you wish you can record a macro and automate this as well :)
Here is the macro that I recorded.
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A$1"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Hope this helps. Let me know if you still have some queries.
Sid
For anyone else interested in this I ended up using the code below based on Siddhart Rout's earlier answer
XMLHttp was significantly quicker than automating IE
the code generates a CSV file for each series to be dowloaded (held in the X variable)
the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on
Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub
RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.
You should use the HtmlAgilityPack to query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.
As quite a few people may see this I thought I would use it as a chance to demonstrate a few features I rarely see people using in VBA web-scraping: deleteRow, querySelector and use of clipboard to write out a table (complete with formatting and hyperlinks) to a sheet based on the table.outerHTML.
deleteRow is used to remove the unwanted rows. querySelector is used to apply faster css selectors to match on nodes. Modern browsers/html parsers are optimized for css and class selectors (which I use) are the second fastest selector type (after id).
Use of css selectors and understanding htmlTable methods/properties will allow for much greater flexibility in your web-scraping endeavours. Understanding the use of the clipboard means a simple copy paste method for transferring a table to Excel.
Execution could easily be tied to a button push and the url read in from a cell.
VBA:
Option Explicit
Public Sub test()
WriteOutTable "https://www.espncricinfo.com/series/8044/scorecard/524935/hobart-hurricanes-vs-melbourne-stars-big-bash-league-2011-12"
End Sub
Public Sub WriteOutTable(ByVal url As String)
'required VBE (Alt+F11) > Tools > References > Microsoft HTML Object Library ; Microsoft XML, v6 (your version may vary)
Dim hTable As MSHTML.HTMLTable, clipboard As Object
Dim xhr As MSXML2.xmlhttp60, html As MSHTML.htmlDocument
Set xhr = New MSXML2.xmlhttp60
Set html = New MSHTML.htmlDocument
With xhr
.Open "GET", url, False
.Send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".batsman")
rowCount = hTable.Rows.Length - 1
For i = rowCount To 0 Step -1
Select Case True
Case i = rowCount Or i = rowCount - 1 Or InStr(hTable.Rows(i).outerHTML, "wicket-details") > 0
hTable.deleteRow i
End Select
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial
End Sub