I am using someone else's code because this is an old file other people are using, I want to update it to make it more efficient but I need a little help. Below is the vba operation. What I need is it to get the information but delete everything but a certain word which changes every time the operation is run. I could use regex and objRE.Pattern = "|" but the word changes depending on the status.
HTML:
<span onmouseover="ShowText('Message','blahblah'); return true;"
onmouseout="HideText('Message'); return true;"
href="javascript:ShowText('Message')">---(PSA)---</span>
</font><a href='?srn=numbers12131131'target='_self'><font color='#6666FF'
size='3'>numbers123232343</font></a><font size='3'>----Installed----MUM
Indication:In Scope-<font color='#00CC00'>PASS WITH WARNING</font>--- (20181018)
</td><tr></table> </b><br>
<table class="OrderForm" width="1000"> '
I just want the Installed status in my excel sheet.
VBA code that needs work:
Sub GetComment()
Dim book As Workbook
Dim sheet As Worksheet
Dim row As Integer
Dim SRN As String
Dim whttp As Object
Set book = ThisWorkbook
Set sheet = book.Worksheets("CMT Data")
Set whttp = CreateObject("WinHTTP.WinHTTPrequest.5.1")
row = 2
SRN = sheet.Cells(row, 1)
Do While SRN <> ""
Debug.Print SRN
whttp.Open "GET", "www.websitedatgoeshere.com" & SRN, False
whttp.SetRequestHeader "Cookie", "mycookiefromwebsite;"
whttp.send
Debug.Print whttp.responseText
sheet.Cells(row, 2) = whttp.responseText
row = row + 1
SRN = sheet.Cells(row, 1)
Loop
Set whttp = Nothing
End Sub
This is based on if, and only if, the word is always between "----" and "----", and that it is the first occurrence in the response. If not the first you can adjust the index 1 as required.
Debug.Print Split(Split(whttp.responseText, "----")(1), "----")(0)
sheet.Cells(row, 2) = Split(Split(whttp.responseText, "----")(1), "----")(0)
Related
I am doing a macro that is formatting a data base into a table, and then select ranges from this table in order to send to different persons depending of the range.
But depending of the range sometimes I can have several column empty, I would like to add in my loop that when creating the temporary workbook, to copy paste my subtable that I wanna send, a function or a part that check if the column is empty (I have headers) and if it's the case, hide the columns concerned only for this range and then convert to HTML in my body email the range without my empty column now hidden and after the loop keeps going through my whole table.
Thanks to a previous post, my VBA code is running smoothly but as soon as I add the part which is supposed to hide column, it's not working anymore, I guess, that I am not adding it in the right place but I don't know,
I tried to add it, just after RangeToEmail and in the function that is creating the tempWB, RangetoHTML but it's not working. (see both codes after)
The code I used on a static range and which is working, to hide the column is
Dim iFirstCol As Integer, iLastCol As Integer, i As Integer`
'variables to hold the first and last column numbers
iFirstCol = Range("A2").Column
iLastCol = Range("W2").Column
LastRow = Range(Range("A2"), Range("A2").End(xlDown))
'count backwards through columns
For i = iLastCol To iFirstCol Step -1
'if all cells are blank, hide the column
If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then
Columns(i).EntireColumn.Hidden = True
End If
Next i
and here is the code I use to go from my table to the different subtable and then through TemporaryWB convert to html in my email body
Option Explicit
Sub GetNames()
Dim NameArray() As String
Dim NameRange As Range
Dim C As Range
Dim Counter As Integer
Dim NameFilter As Variant
Dim RangeToEmail As Range
Dim EmailAddress() As String
'Email Stuff
Dim objOutlook As Object
Set objOutlook = CreateObject("Outlook.application")
Dim objEmail As Object
Set NameRange = Range(Range("H2"), Range("H2").End(xlDown))
ReDim NameArray(1 To Range(Range("H2"), Range("H2").End(xlDown)).Rows.Count) ReDim EmailAddress(1 To Range(Range("H2"), Range("H2").End(xlDown)).Rows.Count)
Counter = 0
For Each C In NameRange
Counter = Counter + 1
NameArray(Counter) = C.Value
EmailAddress(Counter) = C.Offset(, 3)
Next
NameArray = ArrayRemoveDups(NameArray)
EmailAddress = ArrayRemoveDups(EmailAddress)
Counter = 0
For Each NameFilter In NameArray
Counter = Counter + 1
ActiveSheet.Range("A1").AutoFilter Field:=8, Criteria1:=NameFilter Set RangeToEmail = ActiveSheet.ListObjects("DataTable").Range
Set objEmail = objOutlook.CreateItem(olMailItem)
With objEmail .To = EmailAddress(Counter)
.Subject = "TestSubject"
.HTMLBody = "Hello, <br><br>Please see the latest report:<br><br>" & RangetoHTML(RangeToEmail)
.Display
End With
Set objEmail = Nothing
Next
ActiveSheet.Range("A1").AutoFilter
End Sub
Function ArrayRemoveDups(MyArray As Variant) As Variant
Dim nFirst As Long, nLast As Long, i As Long
Dim item As String
Dim arrTemp() As String
Dim Coll As New Collection
'Get First and Last Array Positions
nFirst = LBound(MyArray)
nLast = UBound(MyArray)
ReDim arrTemp(nFirst To nLast)
'Convert Array to String
For i = nFirst To nLast
arrTemp(i) = CStr(MyArray(i))
Next i
'Populate Temporary Collection
On Error Resume Next
For i = nFirst To nLast
Coll.Add arrTemp(i), arrTemp(i)
Next i
Err.Clear
On Error GoTo 0
'Resize Array
nLast = Coll.Count + nFirst - 1
ReDim arrTemp(nFirst To nLast) '
Populate Array
For i = nFirst To nLast
arrTemp(i) = Coll(i - nFirst + 1)
Next i
'Output Array
ArrayRemoveDups = arrTemp
End Function
Function RangetoHTML(rng As Range)
Dim fso As Object
Dim ts As Object
Dim TempFile As String
Dim TempWB As Workbook
TempFile = Environ$("temp") & "/" & Format(Now, "dd-mm-yy h-mm-ss") & ".htm"
' Copy the range and create a workbook to receive the data.
rng.Copy
Set TempWB = Workbooks.Add(1)
With TempWB.Sheets(1)
.Cells(1).PasteSpecial Paste:=8
.Cells(1).PasteSpecial xlPasteValues, , False, False
.Cells(1).PasteSpecial xlPasteFormats, , False, False
.Cells(1).Select
Application.CutCopyMode = False
On Error Resume Next
.DrawingObjects.Visible = True
.DrawingObjects.Delete
On Error GoTo 0
End With
' Publish the sheet to an .htm file.
With TempWB.PublishObjects.Add( _ SourceType:=xlSourceRange, _ Filename:=TempFile, _ Sheet:=TempWB.Sheets(1).Name, _ Source:=TempWB.Sheets(1).UsedRange.Address, _ HtmlType:=xlHtmlStatic)
.Publish (True)
End With
' Read all data from the .htm file into the RangetoHTML subroutine.
Set fso = CreateObject("Scripting.FileSystemObject")
Set ts = fso.GetFile(TempFile).OpenAsTextStream(1, -2)
RangetoHTML = ts.ReadAll
ts.Close RangetoHTML = Replace(RangetoHTML, "align=center x:publishsource=", _ "align=left x:publishsource=")
' Close TempWB. TempWB.Close savechanges:=False
' Delete the htm file.
Kill TempFile
Set ts = Nothing
Set fso = Nothing
Set TempWB = Nothing
End Function
First LastRow is not declared as variable properly and therfore you didn't see that
LastRow = Range(Range("A2"), Range("A2").End(xlDown))
is actually writing an array of values into LastRow. Actually your first code cannot work properly. Make sure you use Option Explicit (I recommend always to activate Option Explicit: In the VBA editor go to Tools › Options › Require Variable Declaration).
The issue is probably if your empty columns have headers too then
WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i)))
will never be 0 because you included your header row 1 Cells(1, i) in your range. So if you want to exclude the header you need to change it to start with row 2 like Cells(2, i).
Finally all this code applies to ActiveSheet which is not very reliable because the active sheet can change by a single mouse click. If you can specify the worksheet precisely by a name, do so. If it really has to run on multiple sheets (so you really want to use the active one) at least make sure the active sheet does not change during the code excecutes by reading it only once into a variable Set ws = ThisWorkbook.ActiveSheet.
I would use
Dim ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1") 'name your sheet here!
'or if it really is the active sheet do
'Set ws = ThisWorkbook.ActiveSheet 'and make sure you only use `ws` from now!
'variables to hold the first and last column numbers
Dim iFirstCol As Long
iFirstCol = ws.Columns("A").Column
Dim iLastCol As Long
iLastCol = ws.Columns("W").Column
Dim LastRow As Long
LastRow = ws.Cells(ws.Rows.Count, "A").End(xlDown).Row
'count backwards through columns
Dim i As Long
For i = iLastCol To iFirstCol Step -1
'if all cells are blank, hide the column
If Application.WorksheetFunction.CountA(ws.Range(ws.Cells(2, i), ws.Cells(LastRow, i))) = 0 Then
ws.Columns(i).EntireColumn.Hidden = True
End If
Next i
Apply the same to the rest of your code to make it more reliable.
With Google Translate, the French to English is not returning the correct values when using the code below.
If I manually use the Google Translate app I get the correct translation.
For example;
From code of "salle de l'émetteur", returns "director's room".
From the Google Translate app, correctly returns "transmitter room".
If I inspect the elements in the Google Translate app, I see the correct translation at
span class="tlid-translation translation" lang = "en"
I can't figure out how to get this "innertext" from 'span title class'
Is there a way to get the translation from
span class="tlid-translation translation" lang = "en"'
instead of
objDivs = objHTML.getElementsByTagName("div"), objDiv.className = "t0"
Public Function Translate(strInput As String, strSourceLng As String, strTargetLng As String) As String
Dim strURL As String
Dim objHTTP As Object
Dim objHTML As Object
Dim objDivs As Object
Dim objDiv As Object
Dim strTranslated As String
' send query to web page
strURL = "https://translate.google.com/m?hl=" & strSourceLng & _
"&sl=" & strSourceLng & _
"&tl=" & strTargetLng & _
"&ie=UTF-8&prev=_m&q=" & strInput
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP") 'late binding
objHTTP.Open "GET", strURL, False
objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
objHTTP.send ""
' create an html document
Set objHTML = CreateObject("htmlfile")
With objHTML
.Open
.Write objHTTP.responseText
.Close
End With
Set objDivs = objHTML.getElementsByTagName("div")
For Each objDiv In objDivs
If objDiv.className = "t0" Then
strTranslated = objDiv.innerText
If strTranslated <> "" Then
Translate = strTranslated
End If
End If
Next objDiv
Set objHTML = Nothing
Set objHTTP = Nothing
End Function
If you use early bound html document i.e. MSHTML.HTMLDocument then you get access to querySelector and can try using css selectors to target that element
e.g.
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
html.body.innerHTML = objHTTP.responseText
Debug.Print html.querySelector(".translation[lang=en]").innerText
Requires VBA>Tools>References> Add reference to Microsoft HTML Object Library.
This assumes the translation is in the .innerText. It would help to share the relevant part of the response showing the translation.
Thanks QHarr,
I tried your recommendation, but get error 91 "Object variable or With block not set"... I must be missing something or it's just not clear to me.
Maybe the translation issue is with "XMLHTTP" because it seems to return the same translation results when you use try the =Translate(A1,"fr","en") to a cell in Excel.
I did however try your recommendation Excel vba Translate IE.Document empty
and this does return the correct translated response "salle de l'émetteur" now = "transmitter room" with ie.navigate2 "https://translate.google.com/#" & "fr" & '/' & "en" & "/" & str". However, this method is too slow to translate worksheets, since I need to translate worksheets 60 rows x 36 cols.
I tweaked your code up so that it will open IE and then loop through each cell with just the one IE (Goggle Translate) window open, instead of open/close the window for each cell data translation. Works super quick in comparison. I just need to figure out timing issues because it's doesn't always put the correct data to the cells when it fetches the translation from Google Translate.
I sometimes get a 'run time error' at MyStrings = ie.document.querySelector (".translation").innerText because it's empty. So I added a dwell time to the process. It seems that <0.5 sec not long enough but seems to work at >0.8 sec per translated cell.
At least it's a step in the right direction. I'll keep playing with the timing to optimize the process, unless there's some other workable solutions or recommendations.
So after play around with this, my crud code below seems to be doing the job and seem to translate faster...so far!
Function Translate_fr_en()
'***This function loops through an array called Data0 loaded with all the values of the worksheet "Translate_Sheet".
' It translates each row of select columns to a mirror image temp worksheet called "Temp".
' It skip past columns not requiring translations since the values are numeric.
Dim str, strTranslated, TempStr As String
Dim ws As Worksheet
Dim i, ii As Integer
Dim col As Integer
Dim LastRow As Integer
Dim ie As Object
Dim Flag As Boolean
Set wb = ThisWorkbook
Set ws = wb.Sheets("Temp")
Set ie = CreateObject("InternetExplorer.Application")
rcount = 3 'Temp worksheet row counter
ii = 0 'array row counter
lrow = wb.Sheets("Translate_Sheet").UsedRange.rows.count 'worksheet requiring translation
LastRow = lrow - 2
With ie
For col = 1 To 36
Select Case col
Case 1, 5, 6, 9, 10, 11, 18, 21, 31, 32, 33, 34, 36 'Selected columns containing text to translate...other columns not need translation because of numeric values
'Translate columns with text values requiring translation
Do Until ii = LastRow '# of rows
i = col - 1
str = Data0(i, ii) '"Translate_Worksheet" worksheet data preloaded into Data0() array
If str = "" Then 'If blank row than do nothing and skip translation
ii = ii + 1 'Array row counter
Else 'then translate columns with text
skip:
.navigate2 "https://translate.google.com/#" & "fr" & "/" & "en" & "/" & str
While .busy Or .readystate < 4: DoEvents: Wend
If Flag = False Then
Application.Wait (Now + timevlue("0:00:01") / 1.3) 'wait 0.769 second for the first tanslation otherwise returns blank value
Flag = True
End If
strTranslated = ie.document.querySelector(".translation").innerText 'get translated text
'String comparison...if translated text has "..." at the end of string do again.
If strTranslated = CStr(TempStr & "...") Then
GoTo skip
End If
ws.Cells(rcount, col) = strTranslated 'load cell with translation
TempStr = ws.Cells(rcount, col) 'TempStr is temp string to hold value for comparison
rcount = rcount + 1
End If
Loop
ii = 0
rcount = 3
Case Else 'load Temp worksheet columns with numeric values not requiring translation
ii = 0
rcount = 3
Do Until ii = LastRow
str = Data0(i, ii) 'worksheet data
ws.Cells(rcount, col) = str
ii = ii + 1
rcount = rcount + 1
Loop
ii = 0
rcount = 3
End Select
Next col
End With
Set ie = Nothing
End Function
I am currently trying to write a code in VBA, the purpose of this code is in general:
Type input into a cell "A" in a defined range.
Left from this particular cell (within the range) i want different information from a website-search to be printed next to the input-cell "A".
More specifically I want to write a code, where I can type in specific stock tickers in a column (ie. "IBM"), and when I do this, a procedure begins, which goes to finance.yahoo.com, collects different information about this specifik stock (ie. "International Business Machines"), and prints this information "to the right" of the cell where i typed the stock ticker in the first place.
The goal is to be able to type in 20-30 stock tickers, and make it retrieve the information for every ticker and print it to the right of those stock tickers.
I have already figured out how to type in one ticker in a specific cell, and make it retrieve the desired data from the website, and then print it into specific cells "to the right" of the input-cell.
My challenge now, is to be able to do this for a large range of cells below the first input-cell.
My code as of now:
Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Row = Range("Sheet1!$A$2").Row And _
Target.Column = Range("Sheet1!$A$2").Column Then
Dim IE As New InternetExplorer
IE.Visible = False
IE.navigate "http://finance.yahoo.com/q;_ylt=AiMiBWm16z_q5Ai0SlNb3jaiuYdG;_ylu=X3oDMTBxdGVyNzJxBHNlYwNVSCAzIERlc2t0b3AgU2VhcmNoIDEx;_ylg=X3oDMTBsdWsyY2FpBGxhbmcDZW4tVVMEcHQDMgR0ZXN0Aw--;_ylv=3?s=" & Range("Sheet1!$A$2").Value 'This types in the value from my input-cell into the website, so i get directed to the webpage for this particular company.
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
Dim Name_001 As String 'Info-cell with name of the company
Dim Ticker_001 As String 'Info-cell with current price of the company
Name_001 = Trim(Doc.getElementsByClassName("title")(0).innerText)
Ticker_001 = Trim(Doc.getElementsByClassName("yfi_rt_quote_summary_rt_top sigfig_promo_1")(0).innerText)
IE.Quit
Dim Nam_001 As Variant
Dim Tic_001 As Variant
Nam_001 = Split(Name_001, "(")
Tic_001 = Split(Ticker_001, " ")
Range("Sheet1!$B$2").Value = Nam_001(0)
Range("Sheet1!$C$2").Value = Tic_001
End If
End Sub
"Sheet1!$A$2": The input-cell
"Sheet1!$B$2": First output cell for the name of the stock ticker.
"Sheet1!$C$2": Second output cell for the current price of the stock ticker.
Right now my code only works for one row, i want it to work for all the rows beneath the current input-cell:
I hope my question is clear, and that you can help me with my problem (I am not allowed to post any pictures, so I cant show you my work book, however I have tickers in column "A", name in column "B" and price in column "C").
Thanks in advance - Juhlers.
I modified your code: (1) changed sheet name; (2) Removed anchored row ($2) and replaced with 'Target.Row'; (3) Added Error Trap - sometimes get error; (4) Changed Cursor to 'Busy' since it takes a few seconds. Try The following:
Option Explicit
Private Sub Worksheet_Change(ByVal Target As Range)
Dim IE As New InternetExplorer
Dim Doc As HTMLDocument
Dim Name_001 As String 'Info-cell with name of the company
Dim Ticker_001 As String 'Info-cell with current price of the company
Dim Nam_001 As Variant
Dim Tic_001 As Variant
On Error GoTo Error_Trap
If Target.column <> 1 Then
Exit Sub
End If
Application.Cursor = xlWait
IE.Visible = False
IE.navigate "http://finance.yahoo.com/q;_ylt=AiMiBWm16z_q5Ai0SlNb3jaiuYdG;_ylu=X3oDMTBxdGVyNzJxBHNlYwNVSCAzIERlc2t0b3AgU2VhcmNoIDEx;_ylg=X3oDMTBsdWsyY2FpBGxhbmcDZW4tVVMEcHQDMgR0ZXN0Aw--;_ylv=3?s=" & Range("Stocks!$A" & Target.row).value 'This types in the value from my input-cell into the website, so i get directed to the webpage for this particular company.
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Set Doc = IE.document
Name_001 = Trim(Doc.getElementsByClassName("title")(0).innerText)
Ticker_001 = Trim(Doc.getElementsByClassName("yfi_rt_quote_summary_rt_top sigfig_promo_1")(0).innerText)
IE.Quit
Nam_001 = Split(Name_001, "(")
Tic_001 = Split(Ticker_001, " ")
Range("Stocks!$B" & Target.row).value = Nam_001(0)
Range("Stocks!$C" & Target.row).value = Tic_001
Application.Cursor = xlNormal
Exit Sub
Error_Trap:
Application.Cursor = xlNormal
MsgBox "Error: " & Err.Number & vbTab & Err.Description
Exit Sub
End Sub
I am pulling out my hair trying to parse data or edit into a msgraph series collection.
I get error 438 - object does not support this property or method.
I can manipulate other properties that the object has such as ChartTitle.Font.Size but not the seriescollection.
Intellisencing is not working wth this object which leads me to susspect that I have not set a particular reference.
Sections of the code is below.
The main routine gets the object:
strReportName = "Security Selection"
strChartName = "MACD_Chart"
DoCmd.OpenReport strReportName, acViewDesign
Set rptMACD = Reports(strReportName)
Set chartMACD = rptMACD(strChartName)
A data recordset is built then all of it is passed into the subroutine:
Call UpdateChart(chartMACD, rstMACD)
Public Sub UpdateChart(chartPlot As Object, rstChart As ADODB.Recordset)
'FUNCTION:
' a chart object is passed into the routine,
' source data is update to the recordset being passed in.
Dim lngType As Long
Dim i, j, iFieldCount As Integer
Dim rst As Recordset
Dim arXValues() As Date
Dim arValues() As Double
Dim strChartName, strYAxis, strXAxis As String
Dim ChrtCollection As ChartObjects
Dim colmCount As Integer
chartPlot.RowSourceType = "Table/Query"
'get number of columns in chart table/Query
iFieldCount = rstChart.Fields.Count
With chartPlot
'change chart data to arrays of data from recordset
.Activate
j = 0
rstChart.MoveFirst
Do While Not rstChart.EOF
j = j + 1
ReDim Preserve arXValues(1 To j)
arXValues(j) = rstChart.Fields("Date").Value
rstChart.MoveNext
Loop
For i = 1 To iFieldCount - 1 'Date is first field
j = 0
rstChart.MoveFirst
Do While Not rstChart.EOF 'get next array of data
j = j + 1
ReDim Preserve arValues(1 To j)
arValues(j) = rstChart.Fields(i + 1).Value
rstChart.MoveNext
Loop
.SeriesCollection(i).Name = rstChart.Fields(i + 1).Name
.SeriesCollection(1).XValues = arXValues
.SeriesCollection(i).Values = arValues
Next i
end sub
I've tried many things and now I'm totally confused. I've also been trying to parse in recordsets (which is my preference) but i'll take anything at the moment.
Before continuing: I recommend setting the Chart's Rowsource property to a query that returns the data you want and then Requerying the Chart. This is WAY easier than the following.
You are getting the Error 438 because Name, XValues, Values are not properties of the Series Object. MSDN Info
That being said, here is a go at your method and some recommendations for doing it that way. The SeriesCollection doesn't contain the values associated with MSGraph points like it does in Excel. You need to edit the data in the DataSheet, which is VERY finicky. A reference to the Microsoft Graph Library must be included. This was tested to work with my database. Microsoft Graph MSDN info
DAO
Public Sub testing()
Dim rstChart As Recordset
Dim seri As Object, fld As Field
Dim app As Graph.Chart
chartPlot.SetFocus
Set app = chartPlot.Object
Set rstChart = CurrentDb.OpenRecordset("SELECT DateTime, ASIMeasured FROM Surv_ASI WHERE CycleID = 2 ORDER BY DateTime")
app.Application.DataSheet.Range("00:AA1000").Clear
With rstChart
For Each fld In .Fields
app.Application.DataSheet.Range("a1:AA1").Cells(0, fld.OrdinalPosition) = fld.Name
Next
Do While Not .EOF
For Each fld In .Fields
app.Application.DataSheet.Range("a2:AA1000").Cells(.AbsolutePosition, fld.OrdinalPosition).Value = fld
Next
.MoveNext
Loop
End With
app.Refresh
End Sub
ADO (Assuming rstChart is already a valid ADODB.Recordset)
Public Sub testing()
Dim app As Graph.Chart, i As Integer
chartPlot.SetFocus
Set app = chartPlot.Object
app.Application.DataSheet.Range("00:AA1000").Clear
With rstChart
.MoveFirst 'Since I don't know where it was left off before this procedure.
For i = 0 To .Fields.Count - 1
app.Application.DataSheet.Range("a1:AA1").Cells(0, i) = .Fields(i).Name
Next
Do While Not .EOF
For i = 0 To .Fields.Count - 1
app.Application.DataSheet.Range("a2:AA1000").Cells(.AbsolutePosition, i).Value = .Fields(i)
Next
.MoveNext
Loop
End With
app.Refresh
End Sub
Some notes about my changes:
1. I prefer having my With point to the Recordset being cycled, instead of the Object being operated on, especially since more calls are made to the Recordset's properties in your procedure.
2. You don't need to specify the variable to which a Next applies (Next i). Just put Next.
3. Please pick my answer if it helped :)
Aim
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel
As an example the current Australian Big Bash 2011/12 scorecards are available from
Game 1: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
Last Game: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
Background
I am proficient in using VBA (either automating IE or using XMLHTTP and then using regular expressions) to scrape data from websites, ie
Extract values from HTML TD and Tr
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
Query
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
Please bear in mind that my preference is a repeatable CSV format containing:
the date/name of the match
Team 1 name
the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
Team 2 name
the output should dump up to 11 records for Team 2 (blank records where players haven't batted)
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Sample Site links and Data to be Extracted
There are 2 techniques that I use for "VBA". I will describe them 1 by one.
1) Using FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
1) Using FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/Firefox
Free download (http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29
Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29
Free download (http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
Another example would be finding the name of a table on a website which has the data that you need scrapped.
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
Now let's take a simple example on how to scrape the URL posted in your question
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
Paste this code in the Userform code area
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
Similarly you can scrape other info as well.
2) Using Excel's inbuilt facility to get data from the web
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
Enter the url in the "New Web Query Window" and click on "Go"
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
If you wish you can record a macro and automate this as well :)
Here is the macro that I recorded.
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A$1"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Hope this helps. Let me know if you still have some queries.
Sid
For anyone else interested in this I ended up using the code below based on Siddhart Rout's earlier answer
XMLHttp was significantly quicker than automating IE
the code generates a CSV file for each series to be dowloaded (held in the X variable)
the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on
Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub
RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.
You should use the HtmlAgilityPack to query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.
As quite a few people may see this I thought I would use it as a chance to demonstrate a few features I rarely see people using in VBA web-scraping: deleteRow, querySelector and use of clipboard to write out a table (complete with formatting and hyperlinks) to a sheet based on the table.outerHTML.
deleteRow is used to remove the unwanted rows. querySelector is used to apply faster css selectors to match on nodes. Modern browsers/html parsers are optimized for css and class selectors (which I use) are the second fastest selector type (after id).
Use of css selectors and understanding htmlTable methods/properties will allow for much greater flexibility in your web-scraping endeavours. Understanding the use of the clipboard means a simple copy paste method for transferring a table to Excel.
Execution could easily be tied to a button push and the url read in from a cell.
VBA:
Option Explicit
Public Sub test()
WriteOutTable "https://www.espncricinfo.com/series/8044/scorecard/524935/hobart-hurricanes-vs-melbourne-stars-big-bash-league-2011-12"
End Sub
Public Sub WriteOutTable(ByVal url As String)
'required VBE (Alt+F11) > Tools > References > Microsoft HTML Object Library ; Microsoft XML, v6 (your version may vary)
Dim hTable As MSHTML.HTMLTable, clipboard As Object
Dim xhr As MSXML2.xmlhttp60, html As MSHTML.htmlDocument
Set xhr = New MSXML2.xmlhttp60
Set html = New MSHTML.htmlDocument
With xhr
.Open "GET", url, False
.Send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".batsman")
rowCount = hTable.Rows.Length - 1
For i = rowCount To 0 Step -1
Select Case True
Case i = rowCount Or i = rowCount - 1 Or InStr(hTable.Rows(i).outerHTML, "wicket-details") > 0
hTable.deleteRow i
End Select
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial
End Sub