Running a VBA Loop with Selenium to scrape html data into Excel - html

I was recently assisted in scraping data from a webpage by the guys at Stackoverflow. It's a great community. I was given a function that pulls data into excel from a cell containing a url. Unfortunately I'm running into some problems because I need a loop function so that Excel does not restart all my functions once I save or refresh the page.
So far I have tried to build this, but am next to useless in VBA. Wondering if anyone can provide a little extra assistance.
Sub POSTPageViews()
Dim InputSheet As Worksheet
Dim i As Long
Dim AllWords As Range
Dim text As String
Dim OutValue As String
Dim driver As SeleniumWrapper.WebDriver
On Error Resume Next
Set driver = New SeleniumWrapper.WebDriver
driver.Start "chrome", "https://re.po.st/"
driver.Open strLocation
Set InputSheet = Active
Set WorkRng = Application.Selection
WordListSheet.Range("E1") = "All Words"
InputSheet.Activate
r = 1
Do While Cells(r, 1) <> ""
Cells(r, 1).Value = txt
OutValue = driver.findElementById("sguidtotaltable").findElementByTagName("span").text
Next i
r = r + 1
driver.stop 'Stops the browser
Loop
End Sub
But naturally it is not working... Anybody see what is wrong? Basically in Column E I have all the URLs and in column K I would like to see the accompanying values.
Thanks

Does this work (in the spirit of my comments)?
Sub POSTPageViews()
Dim driver As SeleniumWrapper.WebDriver
Set driver = New SeleniumWrapper.WebDriver
driver.Start "chrome", "https://re.po.st/"
With Worksheets("Trial")
r = 2
Do While .Cells(r, 5) <> ""
driver.Open .Cells(r, 5).Value
.Cells(r, 11) = driver.findElementById("sguidtotaltable").findElementByTagName("span").text
r = r + 1
Loop
driver.stop 'Stops the browser
End Sub

Related

Parsing Html in VBA using a query get request

I am using someone else's code because this is an old file other people are using, I want to update it to make it more efficient but I need a little help. Below is the vba operation. What I need is it to get the information but delete everything but a certain word which changes every time the operation is run. I could use regex and objRE.Pattern = "|" but the word changes depending on the status.
HTML:
<span onmouseover="ShowText('Message','blahblah'); return true;"
onmouseout="HideText('Message'); return true;"
href="javascript:ShowText('Message')">---(PSA)---</span>
</font><a href='?srn=numbers12131131'target='_self'><font color='#6666FF'
size='3'>numbers123232343</font></a><font size='3'>----Installed----MUM
Indication:In Scope-<font color='#00CC00'>PASS WITH WARNING</font>--- (20181018)
</td><tr></table> </b><br>
<table class="OrderForm" width="1000"> '
I just want the Installed status in my excel sheet.
VBA code that needs work:
Sub GetComment()
Dim book As Workbook
Dim sheet As Worksheet
Dim row As Integer
Dim SRN As String
Dim whttp As Object
Set book = ThisWorkbook
Set sheet = book.Worksheets("CMT Data")
Set whttp = CreateObject("WinHTTP.WinHTTPrequest.5.1")
row = 2
SRN = sheet.Cells(row, 1)
Do While SRN <> ""
Debug.Print SRN
whttp.Open "GET", "www.websitedatgoeshere.com" & SRN, False
whttp.SetRequestHeader "Cookie", "mycookiefromwebsite;"
whttp.send
Debug.Print whttp.responseText
sheet.Cells(row, 2) = whttp.responseText
row = row + 1
SRN = sheet.Cells(row, 1)
Loop
Set whttp = Nothing
End Sub
This is based on if, and only if, the word is always between "----" and "----", and that it is the first occurrence in the response. If not the first you can adjust the index 1 as required.
Debug.Print Split(Split(whttp.responseText, "----")(1), "----")(0)
sheet.Cells(row, 2) = Split(Split(whttp.responseText, "----")(1), "----")(0)

Accessing a website's table with a WinHTTPRequest in Excel VBA

I have written code that grabs a table from a website and pulls out each cell from that table and then drops them into an excel spreadsheet. The code works flawlessly when the website loads correctly.
The issue is the website does not play nice with internet explorer, therefore the code only executes successfully about half of the time. I could write a routine that checks to see if the website loaded successfully and repeat if it did not, However I want to see if I can get it to work with a WinHTTPRequest.
The lines below are how I access the table using internet explorer based webscraping, with the last line being how i load the table into a variable.
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "https://weather.com/weather/tenday/l/12345:4:US"
IE.Visible = True
Application.Wait (Now() + TimeValue("00:02:00"))
Set doc = IE.document
Set WeatherTable = doc.getElementsByClassName("twc-table")(0)
I can load the website in question via WinHTTPRequest using the code below.
Set doc = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://weather.com/weather/tenday/l/12345:4:US", False
.send
doc.body.innerHTML = .responseText
End With
However when I try and grab the table using the line below i get "Run-time error '438': Object doesn't support this property or method.
Set WeatherTable = doc.getElementByclassname("twc-table")(0)
Basically i need the equivalent of this line for WinHTTP webscraping.
I have looked at descending down through the html document(doc.body.all.item(1), etc) but I don't get very far before I run into errors. I have also looked at the Selenium addon, but I don't remember being able to download and install it successfully, and I am not sure if it is even still maintained for current versions of chrome / firefox.
Here is the full code that allows me to get the table via internet explorer webscraping and then drop it onto an excel spreadsheet.
Any help is appreciated.
Sub GetTable2()
Dim IE As Object
Dim doc As HTMLDocument
Dim WeatherTable As HTMLTable
Dim WeatherTableRows As HTMLTableRow
Dim HTMLTableCell As HTMLTableCell
Dim HeaderRow As Boolean
Dim RowCount As Long
Dim ColumnCount As Long
Dim i As Long
RowCount = 1
ColumnCount = 1
HeaderRow = True
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "https://weather.com/weather/tenday/l/12345:4:US"
IE.Visible = True
'Application.Wait (Now() + TimeValue("00:02:00"))
Set doc = IE.document
Set WeatherTable = doc.getElementsByClassName("twc-table")(0)
For Each WeatherTableRows In WeatherTable.Rows
i = 1
For Each HTMLTableCell In WeatherTableRows.Cells
If HeaderRow = True Then
ThisWorkbook.Sheets("Sheet5").Cells(RowCount, ColumnCount).Value = HTMLTableCell.innerText
ColumnCount = ColumnCount + 1
Else
If i = 1 Then
i = i + 1
Else
ThisWorkbook.Sheets("Sheet5").Cells(RowCount, ColumnCount).Value = HTMLTableCell.innerText
ColumnCount = ColumnCount + 1
End If
End If
Next HTMLTableCell
HeaderRow = False
ColumnCount = 1
RowCount = RowCount + 1
Next WeatherTableRows
IE.Quit
Set IE = Nothing
Set doc = Nothing
End Sub
You missed an s. It is plural as you are getting a collection of elements by className.
Set WeatherTable = doc.getElementsByClassName("twc-table")(0)
To make your approach slightly cleaner, you can try this way as well.
Sub FetchTabularData()
Dim elem As Object, trow As Object, S$, R&, C&
[B1:G1] = [{"Day","Description","High/Low","Precip","Wind","Humidity"}]
With New WinHttp.WinHttpRequest
.Open "GET", "https://weather.com/weather/tenday/l/12345:4:US", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
For Each elem In .querySelector(".twc-table").getElementsByTagName("tr")
For Each trow In elem.getElementsByTagName("td")
C = C + 1: Cells(R + 1, C) = trow.innerText
Next trow
C = 0: R = R + 1
Next elem
End With
End Sub
Reference to add:
Microsoft HTML Object Library
Microsoft WinHTTP Services, version 5.1

libre office macro find replace formatted text

I want to go through a document and find all center aligned text and delete it, I can setup formatted text on the find and replace tool, but when I record, it doesn't save formatting... does anyone know how to edit the basic code to do this?
also is the open office documentation compatible with libre office.
Recording in OpenOffice generates dispatcher code, which usually isn't very good. It's better to use the UNO API when writing macros. Here is some code that does what you want:
Sub DeleteCenteredLines
oDoc = ThisComponent
Dim vDescriptor, vFound
' Create a descriptor from a searchable document.
vDescriptor = oDoc.createSearchDescriptor()
' Set the text for which to search and other
With vDescriptor
.searchString = ""
.searchAll=True
End With
Dim srchAttributes(0) As New com.sun.star.beans.PropertyValue
srchAttributes(0).Name = "ParaAdjust"
srchAttributes(0).Value = com.sun.star.style.ParagraphAdjust.CENTER
vDescriptor.SetSearchAttributes(srchAttributes())
' Find the first one
vFound = oDoc.findFirst(vDescriptor)
Do While Not IsNull(vFound)
vFound.setPropertyValue("ParaAdjust", com.sun.star.style.ParagraphAdjust.LEFT)
oTC = oDoc.Text.createTextCursorByRange(vFound)
oTC.gotoStartOfParagraph(false)
oTC.gotoEndOfParagraph(true)
oTC.String = ""
oTC.goRight(1,true)
oTC.String = ""
vFound = oDoc.findNext( vFound.End, vDescriptor)
Loop
End Sub
Check out http://www.pitonyak.org/AndrewMacro.odt for examples of many common tasks. In my experience, looking for examples in this document is usually easier than trying to record macros and make sense of what was recorded.
This works for OpenOffice as well as LibreOffice. Generally the API is the same for both.
My solution which replaces strings in italic and superscript to tags.
(it is extremly slow. Maybe someone can improve it)
Sub replace_italico_sobrescrito_por_tag()
MsgBox "It takes long to run."
Dim vartemp As String
theDoc = thisComponent
iSheetsCount = theDoc.Sheets.Count
Dim theCell As Object, rText As String, textSlice As String, textItalic As Long, textSup As Integer
Dim theParEnum As Object, theParElement As Object
Dim theSubEnum As Object, theSubElement As Object
For k=0 to iSheetsCount-1
Sheet = theDoc.getSheets().getByIndex(k)
dim pX as integer, pY as integer, maxcol as integer, maxrow as integer
maxcol = 100
maxrow = 500
For pX=0 to maxrow
For pY=0 to maxcol
theCell = Sheet.GetCellByPosition(pX, pY)
theParEnum = theCell.GetText().CreateEnumeration
rText = ""
Do While theParEnum.HasMoreElements
theParElement = theParEnum.NextElement
theSubEnum = theParElement.CreateEnumeration
Do While theSubEnum.HasMoreElements
textSlice = ""
theSubElement = theSubEnum.NextElement
If theCell.Type = 2 Then
textSlice = theSubElement.String
textItalic = theSubElement.CharPosture
textSup = theSubElement.CharEscapement
Else
textSlice = theCell.String
End If
If theSubElement.CharPosture >= 1 Then
textSlice = "<i>" & textSlice & "</i>"
End If
If theSubElement.CharEscapement > 0 Then
textSlice = "<sup>" & textSlice & "</sup>"
End If
rText = rText & textSlice
Loop
Loop
theCell.String=rText
Next pY
Next pX
Next k
MsgBox "End"
End Sub

VBA in Access 2010 to import emails located in Outlook Public (Sub)Folders - Including Folder Name & Attachments?

I am trying to develop an Access database for keeping track of emails in Outlook. I was able to develop the following code by combining bits and pieces from many internet searches. The attached code finally works and took me more time than I want to admit to develop. I am new to VBA programming and am trying to grunt my way through the process. Anyway, out of frustration and dread that this project could end up taking way longer than I wanted it to, I thought I would finally ask for some help. The following are features, in order of priority, that I would eventually like to add to the below code:
High Priority:
(1) Need recursive VBA code to import emails located in all subfolders.
(2) Need VBA code to insert the Folder name where the email is located into Access Database. Folder Path is not necessary.
(3) Need VBA code to insert the file name of any user attached documents.
Low Priority (Access can be used to remove duplicates until issue is resolved):
(4) Want VBA code to append data with new emails when macro is run.
Nice future options:
(5) VBA code to allow me to pick a folder. Option would allow for future flexibility.
I am running Access and Outlook 2010 on Window 7 (64 Bit Computer). The following is my code so far:
Sub ImportContactsFromOutlook()
' This code is based in Microsoft Access.
' Set up DAO objects (uses existing "tblContacts" table)
Dim rst As DAO.Recordset
Set rst = CurrentDb.OpenRecordset("Email")
' Set up Outlook objects.
Dim ol As New Outlook.Application
Dim olns As Outlook.NameSpace
Dim cf As Outlook.MAPIFolder
Dim c As Outlook.MailItem
Dim objItems As Outlook.Items
Dim Prop As Outlook.UserProperty
Set olns = ol.GetNamespace("MAPI")
'--- (5) --- VBA code to allow me to pick a folder. Option would allow for future flexability.
Set cf = olns.GetDefaultFolder(olPublicFoldersAllPublicFolders)
'--- (1) --- Need recursive VBA code to import emails located in all subfolders.
Set objItems = cf.Items
iNumMessages = objItems.Count
If iNumMessages <> 0 Then
For i = 1 To iNumMessages
If TypeName(objItems(i)) = "MailItem" Then
Set c = objItems(i)
rst.AddNew
rst!EntryID = c.EntryID
rst!ConversationID = c.ConversationID
rst!Sender = c.Sender
rst!SenderName = c.SenderName
rst!SentOn = c.SentOn
rst!To = c.To
rst!CC = c.CC
rst!BCC = c.BCC
rst!Subject = c.Subject
rst!Attachments = c.Attachments.Count
'--- (3) --- Need VBA code to insert the file name of any user attached documents. ".Count" is used to avoid error and can be replaced.
rst!Body = c.Body
rst!HTMLBody = c.HTMLBody
rst!Importance = c.Importance
rst!Size = c.Size
rst!CreationTime = c.CreationTime
rst!ReceivedTime = c.ReceivedTime
rst!ExpiryTime = c.ExpiryTime
'--- (2) --- Need VBA code to insert the Folder name where the email is located into Access Database. Folder Path is not necessary.
rst.Update
End If
Next i
rst.Close
MsgBox "Finished."
Else
MsgBox "No e-mails to export."
End If
'--- (4) --- Want VBA code to append data with new emails when macro is run.
End Sub
Here are some helpful reference material I tried to use. Some of them have what looked like fancy tools. Because I am learning I either could not implement or did not understand some of them..
msdn.microsoft.com/en-us/library/ee861519(v=office.14).aspx
msdn.microsoft.com/en-us/library/office/ee861520(v=office.14).aspx
accessexperts.net/blog/2011/07/07/importing-outlook-emails-into-access/
add-in-express.com/creating-addins-blog/2011/08/15/how-to-get-list-of-attachments/
databasejournal.com/features/msaccess/article.php/3827996/Working-With-Outlook-from-Access.htm
stackoverflow.com/questions/7298591/copying-all-incoming-emails-in-outlook-inbox-and-personal-subfolders-to-excel-th
Any recommendations or direction is welcome. Thanks for the help. It is appreciated.
Here is my code as it stands now (see below). There are still a few problems when I run it. On the first time the code is run, since there are no records in the Access database table, I receive the following error:
Run-time error ‘3021’: No current record.
Is there an error check or way I can code around this? Also, after the Access database is populated, the following code only excludes those emails found in the primary folder, not the sub folder:
If ([rst]![EmailLocation] <> ofProp.Name) And ([rst]![EntryID] <> cMail.EntryID) Then
I am trying to figure out why. Last, I still need to know how pull a list of user attached documents into the access database. The following code pulls all attachments, including the embedded ones, and only returns the first attachment in the document:
Set cAtch = cMail.Attachments
cntAtch = cAtch.Count
If cntAtch > 0 Then
For j = cntAtch To 1 Step -1
strAtch = cAtch.Item(j).FileName
rst!Attachments = strAtch
Next
Else
rst!Attachments = "No Attachments"
End If
Again, any help would be appreciated. Thanks.
Sub ImportMailPropFromOutlook()
' Code for specifing top level folder and initializing routine.
' Set up Outlook objects.
Dim ol As New Outlook.Application
Dim olns As Outlook.NameSpace
Dim ofO As Outlook.MAPIFolder
Dim ofSubO As Outlook.MAPIFolder
Dim objItems As Outlook.Items
Set olns = ol.GetNamespace("MAPI")
Set ofO = olns.GetDefaultFolder(olFolderInbox) '--- Specifies top level folder for importing Oultook mail.
'Set of = olns.PickFolder '--- Allows user to select top level folder for importing Outlook mail.
'Set info and call GetMailProp code.
Set objItems = ofO.Items
GetMailProp objItems, ofO
'Set info and call ProcessSubFolders.
For Each ofSubO In of.Folders
Set objItems = ofSubO.Items
ProcessSubFolders objItems, ofSubO
Next
End Sub
Sub GetMailProp(objProp As Outlook.Items, ofProp As Outlook.MAPIFolder)
' Code for writeing Outlook mail properties to Access.
' Set up DAO objects (uses existing Access "Email" table).
Dim rst As DAO.Recordset
Set rst = CurrentDb.OpenRecordset("Email")
'Set Up Outlook objects.
Dim cMail As Outlook.MailItem
Dim cAtch As Outlook.Attachments
'Write Outlook mail properties to Access "Email" table.
iNumMessages = objProp.Count
If iNumMessages <> 0 Then
For i = 1 To iNumMessages
If TypeName(objProp(i)) = "MailItem" Then
Set cMail = objProp(i)
If ([rst]![EmailLocation] <> ofProp.Name) And ([rst]![EntryID] <> cMail.EntryID) Then
rst.AddNew
rst!EntryID = cMail.EntryID
rst!ConversationID = cMail.ConversationID
rst!Sender = cMail.Sender
rst!SenderName = cMail.SenderName
rst!SentOn = cMail.SentOn
rst!To = cMail.To
rst!CC = cMail.CC
rst!BCC = cMail.BCC
rst!Subject = cMail.Subject
Set cAtch = cMail.Attachments
cntAtch = cAtch.Count
If cntAtch > 0 Then
For j = cntAtch To 1 Step -1
strAtch = cAtch.Item(j).FileName
rst!Attachments = strAtch
Next
Else
rst!Attachments = "No Attachments"
End If
rst!Count = cMail.Attachments.Count
rst!Body = cMail.Body
rst!HTMLBody = cMail.HTMLBody
rst!Importance = cMail.Importance
rst!Size = cMail.Size
rst!CreationTime = cMail.CreationTime
rst!ReceivedTime = cMail.ReceivedTime
rst!ExpiryTime = cMail.ExpiryTime
rst!EmailLocation = ofProp.Name
rst.Update
End If
End If
Next i
End If
End Sub
Sub ProcessSubFolders(objItemsR As Outlook.Items, OfR As Outlook.MAPIFolder)
'Code for processing subfolders
' Set up Outlook objects.
Dim ofSubR As Outlook.MAPIFolder
'Set info and call GetMailProp code.
GetMailProp objItemsR, OfR
'Set info and call ProcessSubFolders. Recursive.
For Each ofSubR In OfR.Folders
Set objItemsR = ofSubR.Items
ProcessSubFolders objItemsR, ofSubR
Next
End Sub
I had an opportunity to work on the code some more. What I am trying to do is import emails located within all the sub-folders of my Outlook account into Access. The VBA code is in Access. I only need certain mail item properties. Mostly the ones you would need to replicate the print memo function in Outlook.
I added a few more that I thought I would need to help exclude duplicates located in the same folder. The are duplicate emails in different public sub-folders but I need to know that in my database record.
I still need a recursive sub or function to make sure I get all the sub-folders. I tried a For/Next loop but this only searches one level of sub-folders. I could defiantly use some help on this. This seems like the tough part.
My updated code is:
Sub ImportContactsFromOutlook()
' This code is based in Microsoft Access.
' Set up DAO objects (uses existing "Email" table)
Dim rst As DAO.Recordset
Set rst = CurrentDb.OpenRecordset("Email")
' Set up Outlook objects.
Dim ol As New Outlook.Application
Dim olns As Outlook.NameSpace
Dim cf As Outlook.MAPIFolder
Dim cMail As Outlook.MailItem
Dim cAtch As Outlook.Attachments
Dim objItems As Outlook.Items
Dim of As Outlook.Folder
Dim ofSub As Outlook.Folder
Set olns = ol.GetNamespace("MAPI")
'--- (5) ---
'Would eventually be nice to allow a user to select a folder. Folderpicker? Lowest priority.
Set of = olns.GetDefaultFolder(olFolderInbox)
'--- (1) ---
'Loop only searches one level down. I will need all subfolders. Most examples I saw call external Sub? Recursive?
For Each ofSub In of.Folders
Set objItems = ofSub.Items
iNumMessages = objItems.Count
If iNumMessages <> 0 Then
For i = 1 To iNumMessages
If TypeName(objItems(i)) = "MailItem" Then
Set cMail = objItems(i)
rst.AddNew
rst!EntryID = cMail.EntryID
rst!ConversationID = cMail.ConversationID
rst!Sender = cMail.Sender
rst!SenderName = cMail.SenderName
rst!SentOn = cMail.SentOn
rst!To = cMail.To
rst!CC = cMail.CC
rst!BCC = cMail.BCC
rst!Subject = cMail.Subject
'--- (3) ---
'Code only inserts first attachment. Code Also inserts embedded attachments.
'Need code to insert all user selected attachments (ex. PDF Document) and no embedded attachments.
Set cAtch = cMail.Attachments
cntAtch = cAtch.Count
If cntAtch > 0 Then
For j = cntAtch To 1 Step -1
strAtch = cAtch.Item(j).FileName
rst!Attachments = strAtch
Next
Else
rst!Attachments = "No Attachments"
End If
rst!Count = cMail.Attachments.Count
rst!Body = cMail.Body
rst!HTMLBody = cMail.HTMLBody
rst!Importance = cMail.Importance
rst!Size = cMail.Size
rst!CreationTime = cMail.CreationTime
rst!ReceivedTime = cMail.ReceivedTime
rst!ExpiryTime = cMail.ExpiryTime
'--- (2) ---
' Solved - Figured out how to call folder location into databse.
rst!EmailLocation = ofSub.Name
rst.Update
End If
Next i
End If
Next
'--- (4) ---
'Still need code to append Access database with only new records.
'Duplicate email can exist in differenc subfolders but not same subfolder.
End Sub
Any help would be appreciated.
I was able to find some examples on the web to resolve the exclude duplicate mail records and Run-time error '3021' with the following code:
' If code checks outlook mail for and excludes duplicate records based on table fields [EntryID] and [EmailLocation].
If Cnt = DCount("[EntryID] & [EmailLocation]", "Email", "[EntryID] = """ & cMail.EntryID & """ And [EmailLocation] = """ & ofProp.Name & """") = 0 Then
'Code used to insert individual outlook mail properties.
End If
Still need to resolve the issue with attachments. Any help would be appreciated. Thank you.
Check this example for selecting the Outlook contact, from code written by Helen Feddema.
"Exporting Calendar Items to Excel"
http://www.helenfeddema.com/Code%20Samples.htm

html parsing of cricinfo scorecards

Aim
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel
As an example the current Australian Big Bash 2011/12 scorecards are available from
Game 1: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
Last Game: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
Background
I am proficient in using VBA (either automating IE or using XMLHTTP and then using regular expressions) to scrape data from websites, ie
Extract values from HTML TD and Tr
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
Query
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
Please bear in mind that my preference is a repeatable CSV format containing:
the date/name of the match
Team 1 name
the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
Team 2 name
the output should dump up to 11 records for Team 2 (blank records where players haven't batted)
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Sample Site links and Data to be Extracted
There are 2 techniques that I use for "VBA". I will describe them 1 by one.
1) Using FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
1) Using FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/Firefox
Free download (http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29
Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29
Free download (http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
Another example would be finding the name of a table on a website which has the data that you need scrapped.
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
Now let's take a simple example on how to scrape the URL posted in your question
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
Paste this code in the Userform code area
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
Similarly you can scrape other info as well.
2) Using Excel's inbuilt facility to get data from the web
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
Enter the url in the "New Web Query Window" and click on "Go"
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
If you wish you can record a macro and automate this as well :)
Here is the macro that I recorded.
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A$1"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Hope this helps. Let me know if you still have some queries.
Sid
For anyone else interested in this I ended up using the code below based on Siddhart Rout's earlier answer
XMLHttp was significantly quicker than automating IE
the code generates a CSV file for each series to be dowloaded (held in the X variable)
the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on
Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub
RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.
You should use the HtmlAgilityPack to query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.
As quite a few people may see this I thought I would use it as a chance to demonstrate a few features I rarely see people using in VBA web-scraping: deleteRow, querySelector and use of clipboard to write out a table (complete with formatting and hyperlinks) to a sheet based on the table.outerHTML.
deleteRow is used to remove the unwanted rows. querySelector is used to apply faster css selectors to match on nodes. Modern browsers/html parsers are optimized for css and class selectors (which I use) are the second fastest selector type (after id).
Use of css selectors and understanding htmlTable methods/properties will allow for much greater flexibility in your web-scraping endeavours. Understanding the use of the clipboard means a simple copy paste method for transferring a table to Excel.
Execution could easily be tied to a button push and the url read in from a cell.
VBA:
Option Explicit
Public Sub test()
WriteOutTable "https://www.espncricinfo.com/series/8044/scorecard/524935/hobart-hurricanes-vs-melbourne-stars-big-bash-league-2011-12"
End Sub
Public Sub WriteOutTable(ByVal url As String)
'required VBE (Alt+F11) > Tools > References > Microsoft HTML Object Library ; Microsoft XML, v6 (your version may vary)
Dim hTable As MSHTML.HTMLTable, clipboard As Object
Dim xhr As MSXML2.xmlhttp60, html As MSHTML.htmlDocument
Set xhr = New MSXML2.xmlhttp60
Set html = New MSHTML.htmlDocument
With xhr
.Open "GET", url, False
.Send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".batsman")
rowCount = hTable.Rows.Length - 1
For i = rowCount To 0 Step -1
Select Case True
Case i = rowCount Or i = rowCount - 1 Or InStr(hTable.Rows(i).outerHTML, "wicket-details") > 0
hTable.deleteRow i
End Select
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial
End Sub