VBScript to recursively scrape a local intranet page for links - csv

I've been tasked with identifying all of the many links we have on our team's intranet. The goal is to declutter (find duplicate links or dead links).
I wrote this script that will go to our page and scrape every link while identifying the file extension. What I'm not sure how to do is to make this recursive. Once it goes to our site and scrapes those links, if it finds another URL (such as htm or html) I want it to follow THAT link and scrape the same from there and continue on until every link associated with the initial URL is exhausted. I'd like it to create a type of hierarchy in the csv such as (example headers):
lvl0_Link_Title,lvl0_File_Type,lvl0_URL,lvl1_Link_Title,lvl1_File_Type,lvl1_URL,lvl2_Link_Title,lvl2_File_Type,lvl2_URL,lvl3_Link...etc.
Obviously, this would end up with a pretty massive csv. If there is a better/cleaner method to achieve the same, I'm open to it.
Set objWshShell = Wscript.CreateObject("Wscript.Shell")
Set fso = CreateObject("Scripting.FileSystemObject")
Set IE = CreateObject("internetexplorer.application")
on error resume next
filename = fso.GetParentFolderName(WScript.ScriptFullName) & "\URL_Dump_Oldsite.csv"
'==============================================
'Create headers for CSV
set output = fso.opentextfile(filename,2,true)
output.writeline "Link Title,File Type,URL"
output.close
'==============================================
IE.Visible = false
IE.Navigate "URL OF OUR INTRANET"
Do While IE.Busy or IE.ReadyState <> 4: WScript.sleep 100: Loop
Do Until IE.Document.ReadyState = "complete": WScript.sleep 100: Loop
for each url in ie.document.getelementsbytagname("a")
if not url.href is nothing then
ext = mid(url.href,instrrev(url.href,"."))
set output = fso.opentextfile(filename,8,true)
output.writeline replace(url.innertext,","," / ") & "," & ext & ",=HYPERLINK(" & chr(34) & url.href & chr(34) & ")"
output.close
end if
next
'===========================================
'Keyword filter for removal
Dim arrFilter
arrFilter = Array("bakpcweb", _
"aims", _
"element", _
"objid", _
"nodeid", _
"objaction", _
"javascript", _
"itemtype")
'===========================================
'Delete lines from csv file containing keywords
strFile1 = fso.GetParentFolderName(WScript.ScriptFullName) & "\URL_Dump_Oldsite.csv"
Set objFile1 = fso.OpenTextFile(strFile1)
Do Until objFile1.AtEndOfStream
i = 0
strLine1 = trim(lcase(objFile1.Readline))
for a = lbound(arrFilter) to ubound(arrFilter)
if instr(strLine1,arrFilter(a)) <> 0 then
i = i + 1
End If
next
if i = 0 then
strNewContents1 = strNewContents1 & strLine1 & vbCrLf
end if
Loop
objFile1.Close
Set objFile1 = fso.OpenTextFile(strFile1,2,true)
objFile1.Write strNewContents1
objFile1.Close
'===========================================
'Delete blank lines from csv file
strFile = fso.GetParentFolderName(WScript.ScriptFullName) & "\URL_Dump_Oldsite.csv"
Set objFile = fso.OpenTextFile(strFile)
Do Until objFile.AtEndOfStream
strLine = objFile.Readline
strLine = Trim(strLine)
If Len(strLine) > 0 Then
strNewContents = strNewContents & strLine & vbCrLf
End If
Loop
objFile.Close
Set objFile = fso.OpenTextFile(strFile,2,true)
objFile.Write strNewContents
objFile.Close
'===========================================
'Remove duplicate lines from csv file
Set objDictionary = CreateObject("Scripting.Dictionary")
strFile = fso.GetParentFolderName(WScript.ScriptFullName) & "\URL_Dump_Oldsite.csv"
Set objFile = fso.OpenTextFile(strFile)
Do Until objFile.AtEndOfStream
strLine = objFile.Readline
strLine = Trim(strLine)
If Not objDictionary.Exists(strLine) Then
objDictionary.Add strLine, strLine
End If
Loop
objFile.Close
Set objFile = fso.opentextfile(strFile,2,true)
For Each strKey in objDictionary.Keys
objFile.WriteLine strKey
Next
objFile.Close
objDictionary.clearall
'===========================================
wscript.echo "Done!"
ie.quit
wscript.quit
Thank you!

This might not be the answer you were expecting, but it sounds like you're reinventing the wheel here, and with a sub-standard tool. In my experience, I also wouldn't find the lvl0, lvl1 etc. format particularly useful when reporting later.
I would strongly recommend you instead use an existing program to scan your intranet, such as Xenu or for a more in-depth analysis, try Screaming Frog SEO Spider (free version is limited to about 500 pages, as I recall, but you can give it a try). These tools have features to save reports, which should suit your needs.
If that doesn't work for you, please comment or edit your answer to explain why you must do this yourself and report in the specified format.
Edit: Here's an example screenshot from the free Xenu program, which lists every resource it tried, its status, links in/out, and type, which can assist with your requirement to report on filetypes. It will also generate full HTML reports if you want stats.

Related

Display Pdf preview in Ms Access Report using pdf file path

I am new in MS Access. I have pdf file location in textbox. I want when access report load then specific pdf file preview in that report (pdf read from file location). How can I achieve it? Please help?
You can display PDF in Report by converting its pages to images and display them. Withwsh.Runyou can extract duringReport_Loadevent, then store the pages paths in a temporary table.
Have Irfanview with PDF-Plugin installed.
In Front-End, create a table namedTmpExtractedPageswith oneShort-Textfield namedPathto store the paths of the extracted pages.
Create a report with Record-Source.
SELECT TmpExtractedPages.Path FROM TmpExtractedPages;
Add a Picture-Control in Detail-Section (no Header/Footer-Section), that fits to the page and bind it toPath
Put the following code inReport_Loadevent
Private Sub Report_Load()
Dim TempPath As String
TempPath = CurrentProject.Path & "\TempPdf"
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
If fso.FolderExists(TempPath) Then
fso.DeleteFolder TempPath
End If
fso.CreateFolder TempPath
Dim PdfFile As String
PdfFile = Me.OpenArgs
Const PathToIrfanView As String = "C:\Program Files (x86)\IrfanView\i_view32.exe"
Dim CmdArgs As String
CmdArgs = Chr(34) & PdfFile & Chr(34) & " /extract=(" & Chr(34) & TempPath & Chr(34) & ",jpg) /cmdexit" 'see i_options.txt in IrfanView folder for command line options
Dim ShellCmd As String
ShellCmd = Chr(34) & PathToIrfanView & Chr(34) & " " & CmdArgs
Debug.Print ShellCmd
Dim wsh As Object
Set wsh = CreateObject("WScript.Shell")
Const WaitOnReturn As Boolean = True
Const WindowStyle As Long = 0
wsh.Run ShellCmd, WindowStyle, WaitOnReturn
With CurrentDb
.Execute "Delete * From TmpExtractedPages", dbFailOnError
Dim f As Object
For Each f In fso.GetFolder(TempPath).Files
.Execute "Insert Into TmpExtractedPages (Path) Values ('" & Replace(f.Path, "'", "''") & "');", dbFailOnError
Next f
End With
Set fso = Nothing
Set wsh = Nothing
End Sub
You provide the path to the PDF to display asOpenArgsargument on open report:
DoCmd.OpenReport "rpt_pdf", acViewPreview, , , , "path\to\pdf"
Keep in mind that adding, then deleting records to the temp table, will bloat your database if you don't compact it later (or just deploy a fresh Front-End copy on start, as I do).
If you just need to display the pdf file, you could create a button next to the textbox and in its on click event:
Private Sub cmdView_Click()
If Nz(Me.txtPdfLocation) <> "" Then
Application.FollowHyperlink Me.txtPdfLocation
End If
End Sub

write lines to file and when there 5 lines in needs to execute a statement vbscript

here is a code which i wanne run on background so no windowmessages. The meaning of it is that it checks a connection. If there isn't a connection it writes a error to a file. a function reads that file if there are 5 lines it should create a event-error. The problem is that the last part doesn't work correctly.
my qeustion is can somebody fix it or help me fixing it. Here is the code:
strDirectory = "Z:\text2"
strFile = "\foutmelding.txt"
strText = "De connectie is verbroken"
strWebsite = "www.helmichbeens.com"
If PingSite(strWebsite) Then WScript.Quit 'Website is pingable - no further action required
Set objFSO = CreateObject("Scripting.FileSystemObject")
RecordSingleEvent
If EventCount >= 5 Then
objFSO.DeleteFile strDirectory & strFile
Set WshShell = WScript.CreateObject("WScript.Shell")
strCommand = "eventcreate /T Error /ID 100 /L Scripts /D " & _
Chr(34) & "Test event." & Chr(34)
WshShell.Run strcommand
End if
'------------------------------------
'Record a single event in a text file
'------------------------------------
Sub RecordSingleEvent
If Not objFSO.FolderExists(strDirectory) Then objFSO.CreateFolder(strDirectory)
Set objTextFile = objFSO.OpenTextFile(strDirectory & strFile, 8, True)
objTextFile.WriteLine(Now & strText)
objTextFile.Close
End sub
'----------------
'Ping my web site
'----------------
Function PingSite( myWebsite )
Set objHTTP = CreateObject( "WinHttp.WinHttpRequest.5.1" )
objHTTP.Open "GET", "http://" & myWebsite & "/", False
objHTTP.SetRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MyApp 1.0; Windows NT 5.1)"
On Error Resume Next
objHTTP.Send
PingSite = (objHTTP.Status = 200)
On Error Goto 0
End Function
'-----------------------------------------------
'Counts the number of lines inside the text file
'-----------------------------------------------
Function EventCount()
strData = objFSO.OpenTextFile(strDirectory & strFile,ForReading).ReadAll
arrLines = Split(strData,vbCrLf)
EventCount = UBound(arrLines)
Set objFSO = Nothing
End Function
thats the code you can copy it to see it your self. i thank you for your time and intrest
Greets helmich
It doesn't work because function EventCount sets objFSO=nothing, so,
If EventCount >= 5 Then
objFSO.DeleteFile strDirectory & strFile
fails
Use the logevent method of the Shell object
If EventCount >= 5 Then
objFSO.DeleteFile strDirectory & strFile
Set WshShell = WScript.CreateObject("WScript.Shell")
Call WshShell.LogEvent(1, "Test Event")
End if
You don't need to run a separate command
Thats not the problem is this
Windows host script gives a error
Line:41
Char:2
Translation of error: the data required for this operation are not yet available
code: 80070057
source: WinHttp.WinHttpRequest
thats the problem and i do not know how to fix it
it has something to do that he can't read the lines in the txtfile and then not execute the create event command

Parse and import unstructured text files into Microsoft Access (file has potential delimiters)

I have a bunch of text files that I need to import into MS Access (thousands) - can use 2007 or 2010. The text files have categories that are identified in square brackets and have relevant data between the categories - for example:
[Location]Tenessee[Location][Model]042200[Model][PartNo]113342A69447B6[PartNo].
I need to capture both the categories and the data between them and import them into Access - the categories to one table, the data to another. There are hundreds of these categories in a single file and the text file has no structure - they are all run together as in the example above. The categories in the brackets are the only clear delimiters.
Through research on the web I have come up with a script for VBS (I am not locked into VBS, willing to use VBA or another method), but when I run it, I am getting a VBS info window with nothing displaying in it. Any advice or guidance would be most gratefully appreciated (I do not tend to use VBS and VBA) and I thank you.
The Script:
Const ForReading = 1
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\Users\testGuy\Documents\dmc_db_test\DMC-TEST-A-00-00-00-00A-022A-D_000 - Copy01.txt", ForReading)
strContents = objFile.ReadAll
objFile.Close
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Global = True
objRegEx.Pattern = "\[.{0,}\]"
Set colMatches = objRegEx.Execute(strContents)
If colMatches.Count > 0 Then
For Each strMatch in colMatches
strMatches = strMatches & strMatch.Value
Next
End If
strMatches = Replace(strMatches, "]", vbCrlf)
strMatches = Replace(strMatches, "[", "")
Wscript.Echo strMatches
Regular expressions are wonderful things, but in your case it looks like they might be overkill. The following code uses plain old InStr() to find the [Tags] and parses the file(s) out to a single CSV file. That is, for input files
testfile1.txt:
[Location]Tennessee[Location][Model]042200[Model][PartNo]113342A69447B6[PartNo]
[Location]Mississippi[Location][Model]042200[Model][SerialNo]3212333222355[SerialNo]
and testfile2.txt:
[Location]Missouri[Location][Model]042200[Model][PartNo]AAABBBCCC111222333[PartNo]
...the code will write the following output file...
"FileName","LineNumber","ItemNumber","FieldName","FieldValue"
"testfile1.txt",1,1,"Location","Tennessee"
"testfile1.txt",1,2,"Model","042200"
"testfile1.txt",1,3,"PartNo","113342A69447B6"
"testfile1.txt",2,1,"Location","Mississippi"
"testfile1.txt",2,2,"Model","042200"
"testfile1.txt",2,3,"SerialNo","3212333222355"
"testfile2.txt",1,1,"Location","Missouri"
"testfile2.txt",1,2,"Model","042200"
"testfile2.txt",1,3,"PartNo","AAABBBCCC111222333"
...which you can then import into Access (or whatever) and proceed from there. This is VBA code, but it could easily be tweaked to run as a VBScript.
Sub ParseSomeFiles()
Const InFolder = "C:\__tmp\parse\in\"
Const OutFile = "C:\__tmp\parse\out.csv"
Dim fso As FileSystemObject, f As File, tsIn As TextStream, tsOut As TextStream
Dim s As String, Lines As Long, Items As Long, i As Long
Set fso = New FileSystemObject
Set tsOut = fso.CreateTextFile(OutFile, True)
tsOut.WriteLine """FileName"",""LineNumber"",""ItemNumber"",""FieldName"",""FieldValue"""
For Each f In fso.GetFolder(InFolder).Files
Debug.Print "Parsing """ & f.Name & """..."
Set tsIn = f.OpenAsTextStream(ForReading)
Lines = 0
Do While Not tsIn.AtEndOfStream
s = Trim(tsIn.ReadLine)
Lines = Lines + 1
Items = 0
Do While Len(s) > 0
Items = Items + 1
tsOut.Write """" & f.Name & """," & Lines & "," & Items
i = InStr(1, s, "]", vbBinaryCompare)
' write out FieldName
tsOut.Write ",""" & Replace(Mid(s, 2, i - 2), """", """""", 1, -1, vbBinaryCompare) & """"
s = Mid(s, i + 1)
i = InStr(1, s, "[", vbBinaryCompare)
' write out FieldValue
tsOut.Write ",""" & Replace(Mid(s, 1, i - 1), """", """""", 1, -1, vbBinaryCompare) & """"
s = Mid(s, i)
i = InStr(1, s, "]", vbBinaryCompare)
' (no need to write out ending FieldName tag)
s = Mid(s, i + 1)
tsOut.WriteLine
Loop
Loop
tsIn.Close
Set tsIn = Nothing
Next
Set f = Nothing
tsOut.Close
Set tsOut = Nothing
Set fso = Nothing
Debug.Print "Done."
End Sub

html parsing of cricinfo scorecards

Aim
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV form for data analysis in Excel
As an example the current Australian Big Bash 2011/12 scorecards are available from
Game 1: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
Last Game: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
Background
I am proficient in using VBA (either automating IE or using XMLHTTP and then using regular expressions) to scrape data from websites, ie
Extract values from HTML TD and Tr
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
Query
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
Please bear in mind that my preference is a repeatable CSV format containing:
the date/name of the match
Team 1 name
the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
Team 2 name
the output should dump up to 11 records for Team 2 (blank records where players haven't batted)
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Sample Site links and Data to be Extracted
There are 2 techniques that I use for "VBA". I will describe them 1 by one.
1) Using FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
1) Using FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/Firefox
Free download (http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29
Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29
Free download (http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
Another example would be finding the name of a table on a website which has the data that you need scrapped.
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
Now let's take a simple example on how to scrape the URL posted in your question
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
Paste this code in the Userform code area
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
Similarly you can scrape other info as well.
2) Using Excel's inbuilt facility to get data from the web
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
Enter the url in the "New Web Query Window" and click on "Go"
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
If you wish you can record a macro and automate this as well :)
Here is the macro that I recorded.
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A$1"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Hope this helps. Let me know if you still have some queries.
Sid
For anyone else interested in this I ended up using the code below based on Siddhart Rout's earlier answer
XMLHttp was significantly quicker than automating IE
the code generates a CSV file for each series to be dowloaded (held in the X variable)
the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on
Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub
RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.
You should use the HtmlAgilityPack to query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.
As quite a few people may see this I thought I would use it as a chance to demonstrate a few features I rarely see people using in VBA web-scraping: deleteRow, querySelector and use of clipboard to write out a table (complete with formatting and hyperlinks) to a sheet based on the table.outerHTML.
deleteRow is used to remove the unwanted rows. querySelector is used to apply faster css selectors to match on nodes. Modern browsers/html parsers are optimized for css and class selectors (which I use) are the second fastest selector type (after id).
Use of css selectors and understanding htmlTable methods/properties will allow for much greater flexibility in your web-scraping endeavours. Understanding the use of the clipboard means a simple copy paste method for transferring a table to Excel.
Execution could easily be tied to a button push and the url read in from a cell.
VBA:
Option Explicit
Public Sub test()
WriteOutTable "https://www.espncricinfo.com/series/8044/scorecard/524935/hobart-hurricanes-vs-melbourne-stars-big-bash-league-2011-12"
End Sub
Public Sub WriteOutTable(ByVal url As String)
'required VBE (Alt+F11) > Tools > References > Microsoft HTML Object Library ; Microsoft XML, v6 (your version may vary)
Dim hTable As MSHTML.HTMLTable, clipboard As Object
Dim xhr As MSXML2.xmlhttp60, html As MSHTML.htmlDocument
Set xhr = New MSXML2.xmlhttp60
Set html = New MSHTML.htmlDocument
With xhr
.Open "GET", url, False
.Send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".batsman")
rowCount = hTable.Rows.Length - 1
For i = rowCount To 0 Step -1
Select Case True
Case i = rowCount Or i = rowCount - 1 Or InStr(hTable.Rows(i).outerHTML, "wicket-details") > 0
hTable.deleteRow i
End Select
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial
End Sub

Escaping non-ASCII characters (or how to remove the BOM?)

I need to create an ANSI text file from an Access recordset that outputs to JSON and YAML. I can write the file, but the output is coming out with the original characters, and I need to escape them. For example, an umlaut-O (ö) should be "\u00f6".
I thought encoding the file as UTF-8 would work, but it doesn't. However, having looked at the file coding again, if you write "UTF-8 without BOM" then everything works.
Does anyone know how to either
a) Write text out as UTF-8 without BOM, or
b) Write in ANSI but escaping the non-ASCII characters?
Public Sub testoutput()
Set db = CurrentDb()
str_filename = "anothertest.json"
MyFile = CurrentProject.Path & "\" & str_filename
str_temp = "Hello world here is an ö"
fnum = FreeFile
Open MyFile For Output As fnum
Print #fnum, str_temp
Close #fnum
End Sub
... ok .... i found some example code on how to remove the BOM. I would have thought it would be possible to do this more elegantly when actually writing the text in the first place. Never mind. The following code removes the BOM.
(This was originally posted by Simon Pedersen at http://www.imagemagick.org/discourse-server/viewtopic.php?f=8&t=12705)
' Removes the Byte Order Mark - BOM from a text file with UTF-8 encoding
' The BOM defines that the file was stored with an UTF-8 encoding.
Public Function RemoveBOM(filePath)
' Create a reader and a writer
Dim writer, reader, fileSize
Set writer = CreateObject("Adodb.Stream")
Set reader = CreateObject("Adodb.Stream")
' Load from the text file we just wrote
reader.Open
reader.LoadFromFile filePath
' Copy all data from reader to writer, except the BOM
writer.Mode = 3
writer.Type = 1
writer.Open
reader.Position = 5
reader.CopyTo writer, -1
' Overwrite file
writer.SaveToFile filePath, 2
' Return file name
RemoveBOM = filePath
' Kill objects
Set writer = Nothing
Set reader = Nothing
End Function
It might be useful for someone else.
Late to the game here, but I can't be the only coder who got got fed up with my SQL imports being broken by text files with a Byte Order Marker. There are very few 'Stack questions that touch on the problem - this is one of closest - so I'm posting an overlapping answer here.
I say 'overlapping' because the code below is solving a slightly different problem to yours - the primary purpose is writing a Schema file for a folder with a heterogeneous collection of files - but the BOM-handling segment is clearly marked.
The key functionality is that we iterate through all the '.csv' files in a folder, and we test each file with a quick nibble of the first four bytes: and we only only strip out the Byte Order Marker if we see one.
After that, we're working in low-level file-handling code from the primordial C. We have to, all the way down to using byte arrays, because everything else that you do in VBA will deposit the Byte Order Markers embedded in the structure of a string variable.
So, without further adodb, here's the code:
BOM-Disposal code for text files in a schema.ini file:
Public Sub SetSchema(strFolder As String)
On Error Resume Next
' Write a Schema.ini file to the data folder.
' This is necessary if we do not have the registry privileges to set the
' correct 'ImportMixedTypes=Text' registry value, which overrides IMEX=1
' The code also checks for ANSI or UTF-8 and UTF-16 files, and applies a
' usable setting for CharacterSet ( UNICODE|ANSI ) with a horrible hack.
' OEM codepage-defined text is not supported: further coding is required
' ...And we strip out Byte Order Markers, if we see them - the OLEDB SQL
' provider for textfiles can't deal with a BOM in a UTF-16 or UTF-8 file
' Not implemented: handling tab-delimited files or other delimiters. The
' code assumes a header row with columns, specifies 'scan all rows', and
' imposes 'read the column as text' if the data types are mixed.
Dim strSchema As String
Dim strFile As String
Dim hndFile As Long
Dim arrFile() As Byte
Dim arrBytes(0 To 4) As Byte
If Right(strFolder, 1) <> "\" Then strFolder = strFolder & "\"
' Dir() is an iterator function when you call it with a wildcard:
strFile = VBA.FileSystem.Dir(strFolder & "*.csv")
Do While Len(strFile) > 0
hndFile = FreeFile
Open strFolder & strFile For Binary As #hndFile
Get #hndFile, , arrBytes
Close #hndFile
strSchema = strSchema & "[" & strFile & "]" & vbCrLf
strSchema = strSchema & "Format=CSVDelimited" & vbCrLf
strSchema = strSchema & "ImportMixedTypes=Text" & vbCrLf
strSchema = strSchema & "MaxScanRows=0" & vbCrLf
If arrBytes(2) = 0 Or arrBytes(3) = 0 Then ' this is a hack
strSchema = strSchema & "CharacterSet=UNICODE" & vbCrLf
Else
strSchema = strSchema & "CharacterSet=ANSI" & vbCrLf
End If
strSchema = strSchema & "ColNameHeader = True" & vbCrLf
strSchema = strSchema & vbCrLf
' BOM disposal - Byte order marks confuse OLEDB text drivers:
If arrBytes(0) = &HFE And arrBytes(1) = &HFF _
Or arrBytes(0) = &HFF And arrBytes(1) = &HFE Then
hndFile = FreeFile
Open strFolder & strFile For Binary As #hndFile
ReDim arrFile(0 To LOF(hndFile) - 1)
Get #hndFile, , arrFile
Close #hndFile
BigReplace arrFile, arrBytes(0) & arrBytes(1), ""
hndFile = FreeFile
Open strFolder & strFile For Binary As #hndFile
Put #hndFile, , arrFile
Close #hndFile
Erase arrFile
ElseIf arrBytes(0) = &HEF And arrBytes(1) = &HBB And arrBytes(2) = &HBF Then
hndFile = FreeFile
Open strFolder & strFile For Binary As #hndFile
ReDim arrFile(0 To LOF(hndFile) - 1)
Get #hndFile, , arrFile
Close #hndFile
BigReplace arrFile, arrBytes(0) & arrBytes(1) & arrBytes(2), ""
hndFile = FreeFile
Open strFolder & strFile For Binary As #hndFile
Put #hndFile, , arrFile
Close #hndFile
Erase arrFile
End If
strFile = ""
strFile = Dir
Loop
If Len(strSchema) > 0 Then
strFile = strFolder & "Schema.ini"
hndFile = FreeFile
Open strFile For Binary As #hndFile
Put #hndFile, , strSchema
Close #hndFile
End If
End Sub
Public Sub BigReplace(ByRef arrBytes() As Byte, ByRef SearchFor As String, ByRef ReplaceWith As String)
On Error Resume Next
Dim varSplit As Variant
varSplit = Split(arrBytes, SearchFor)
arrBytes = Join$(varSplit, ReplaceWith)
Erase varSplit
End Sub
The code's easier to understand if you know that a Byte Array can be assigned to a VBA.String, and vice versa. The BigReplace() function is a hack that sidesteps some of VBA's inefficient string-handling, especially allocation: you'll find that large files cause serious memory and performance problems if you do it any other way.