Convert html to plain text in VBA - html

I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.
I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.

Set a reference to "Microsoft HTML object library".
Function HtmlToText(sHTML) As String
Dim oDoc As HTMLDocument
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = sHTML
HtmlToText = oDoc.body.innerText
End Function
Tim

A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.
Function StripTags(ByVal html As String) As String
Dim text As String
Dim accumulating As Boolean
Dim n As Integer
Dim c As String
text = ""
accumulating = True
n = 1
Do While n <= Len(html)
c = Mid(html, n, 1)
If c = "<" Then
accumulating = False
ElseIf c = ">" Then
accumulating = True
Else
If accumulating Then
text = text & c
End If
End If
n = n + 1
Loop
StripTags = text
End Function
This can leave lots of extraneous whitespace, but it will help in removing the tags.

Tim's solution was great, worked liked a charm.
I´d like to contribute: Use this code to add the "Microsoft HTML Object Library" in runtime:
Set ID = ThisWorkbook.VBProject.References
ID.AddFromGuid "{3050F1C5-98B5-11CF-BB82-00AA00BDCE0B}", 2, 5
It worked on Windows XP and Windows 7.

Tim's answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.
Function HtmlToText(sHTML) As String
Dim oDoc As HTMLDocument
If IsNull(sHTML) Then
HtmlToText = ""
Exit Function
End-If
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = sHTML
HtmlToText = oDoc.body.innerText
End Function

Yes! I managed to solve my problem as well. Thanks everybody/
In my case, I had this sort of input:
<p>Lorem ipsum dolor sit amet.</p>
<p>Ut enim ad minim veniam.</p>
<p>Duis aute irure dolor in reprehenderit.</p>
And I did not want the result to be all jammed together without breaklines.
So I first splitted my input for every <p> tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).
In addition I concatenated each cleaned 'paragraph' with this breakline character Crh(10) for VBA/Excel.
The final code is:
Public Function HtmlToText(ByVal sHTML As String) As String
Dim oDoc As HTMLDocument
Dim result As String
Dim paragraphs() As String
If IsNull(sHTML) Then
HtmlToText = ""
Exit Function
End If
result = ""
paragraphs = Split(sHTML, "<p>")
For Each paragraph In paragraphs
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = paragraph
result = result & Chr(10) & Chr(10) & oDoc.body.innerText
Next paragraph
HtmlToText = result
End Function

Here's a variation of Tim's and Gardoglee's solution that does not require setting a reference to "Microsoft HTML object library". This method is known as Late Binding and will also work in vbscript.
Function HtmlToText(sHTML) As String
Dim oDoc As Object ' As HTMLDocument
If IsNull(sHTML) Then
HtmlToText = ""
Exit Function
End If
Set oDoc = CreateObject("HTMLFILE")
oDoc.body.innerHTML = sHTML
HtmlToText = oDoc.body.innerText
End Function
Note that if you are using VBA in Access 2007 or greater, there is an Application.PlainText() method built-in that does the same thing as the code above.

Related

Retrieving a "Var" values from an HTML file

I have an html file that contains many "var"s in a section delimited by "<!--";
<!--
var g_stickyTableHeadersScrollVersion=1;... ;var g_priceListInfo={...,"arrProducts":[{"name":"...","type":"...","arrVariants":[{"name":"...","priceGroup":"..."},{"name":"...","priceGroup":"..."},...,{"name":"...","defaultSlabSize":[...,...],"priceGroup":"..."}],{"name":"...","price":...,"isSlabPricing":1}]}...}
-->
I'm at loss as to how to get the arrProducts array of g_priceListInfo variable values
After many (really a lot of) different attempts I thought I could use querySelector method of HTMLDocument as follows:
Dim url As String
url = "C:\Users\...\myHTMLFile.html"
Dim oFSO As FileSystemObject
Dim oFS As Object, sText As String
Set oFSO = New FileSystemObject ' CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(url)
Do Until oFS.AtEndOfStream
sText = oFS.ReadAll()
Loop
Dim doc As HTMLDocument
Set doc = CreateObject("htmlfile")
doc.body.innerHTML = sText
Dim ele As Object
Set ele = doc.querySelector("g_priceListInfo.arrProducts.name")
but, provided that is the right path, I couldn't find the correct syntax to make it work
thanks in advance for any help
EDIT: adding the relevant html page code view snapshots
EDIT 19/08/2022:
I finally made it by means of a brute force string manipulation
Then I found the no-ScriptControl & no-GitHub JSon parser solution at this link, which gave me the same results of my brute force method
I'd point everybody with the same need as this one of mine to that solution

How to find/extract an HTML "font" element with attribute size="+1" using Excel VBA

I want to extract a U.S. Patent title from a url like
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874
(Update: as pointed out the comments, the patent title is not labeled "Title;" however, it consistently appears by itself above "Abstract" on the web page.) In most cases it is in the 7th child element of "body" or the 3rd "font" element in the document, but occasionally a notice at the top of the page to "** Please see images for: ( Certificate of Correction ) **" or "( Reexamination Certificate )" messes up both methods of extraction by inserting one additional child of "body" and three additional "font" elements before you get to the title.
However, the title seems to be consistently the first "font" element with the attribute "size" having a value of "+1". Unfortunately other elements have size="-1", including the aforementioned elements that are not always present, so it has to be specifically with that attribute and value. I have searched but can't figure out how to get elements by attribute and value. Here is my code:
Function Test_UpdateTitle(url As String)
Dim title As String
Dim pageSource As String
Dim xml_obj As XMLHTTP60
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", url, False
xml_obj.send
pageSource = xml_obj.responseText
Set xml_obj = Nothing
Dim html_doc As HTMLDocument
Set html_doc = CreateObject("HTMLFile")
html_doc.body.innerHTML = pageSource
Dim fontElement As IHTMLElement
'Methods 1 and 2 fail in cases of a certificate of correction or reexamination certificate
'Method 1
' Dim body As IHTMLElement
' Set body = html_doc.getElementsByTagName("body").Item(0)
' Set fontElement = body.Children(6)
'Method 2
' Set fontElement = html_doc.getElementsByTagName("font").Item(3)
'Method 3
Dim n As Integer
For n = 3 To html_doc.getElementsByTagName("font").Length - 1
Set fontElement = html_doc.getElementsByTagName("font").Item(n)
If InStr(fontElement.innerText, "Please see") = 0 And _
InStr(fontElement.innerText, "( Certificate of Correction )") = 0 And _
InStr(fontElement.innerText, "( Reexamination Certificate )") = 0 And _
InStr(fontElement.innerText, " **") = 0 Then
Test_UpdateTitle = fontElement.innerText
Exit Function
End If
Next n
End Function
I should add that the " **" is not working to skip the the last element <b> **</b> and I am getting " **" as the title where there is a notice to please see images. Is asterisk a wildcard character in this context?
You can try this. As long as its the first font tag with the size attribute and a value of "+1" this should work. I only tested with 3 different pages but they all returned the correct results.
Function Test_UpdateTitle(url)
title = "Title Not Found!"
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", url, False
xml_obj.send
pageSource = xml_obj.responseText
Set xml_obj = Nothing
Set document = CreateObject("HTMLFile")
document.write pageSource
For i = 0 To document.getElementsByTagName("font").length - 1
If document.getElementsByTagName("font")(i).size = "+1" Then
title = document.getElementsByTagName("font")(i).innerText
Exit For
End If
Next
Test_UpdateTitle = title
End Function
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=fight.TI.&OS=TTL/fight&RS=TTL/fight")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=14&f=G&l=50&d=PTXT&p=1&S1=search&OS=search&RS=search")
This answer is somewhat incomplete because my Excel won't do these lines:
Dim xml_obj As XMLHTTP60
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
But I think this might be a preferred approach.
Instead of using USPTO's site, what about using Google's?
Hit this URL: http://www.google.com/patents/US6293874
Note that the patent number is apparent in that URL.
Then, in your function, just pull that tag named invention-title.
Set titleElement = html_doc.getElementsByTagName("invention-title").Item(0)
title = titleElement.innerText
MsgBox(title)
If you check the source on that page, there's only one of those.
If you're open to this alternative approach, it would be relatively easy to parse patent numbers from the URLs you have, and I think the extraction of invention-title would be much more reliable.
See if this answer is working as intended. Make sure you have references to the following libraries in your workbook:
Microsoft XML, v6.0
Microsoft HTML Object Library
If you are not sure how to add these to Excel just give a read to this link Link to reference adding
Option Explicit
Sub Test()
Debug.Print Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874")
End Sub
Function Test_UpdateTitle(ByVal strURL As String) As String
Dim oHTTP As MSXML2.XMLHTTP60
Dim oDoc As MSHTML.HTMLDocument
Dim oFontTags As Variant
Dim oFontTag As HTMLFontElement
Dim strInnerText As String
Dim strSize As String
' Create the http object and send it.
Set oHTTP = New MSXML2.XMLHTTP60
oHTTP.Open "GET", strURL, False
oHTTP.send
' Make sure that get the a reponse back
If oHTTP.Status = 200 Then
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = oHTTP.responseText
Set oFontTags = oDoc.getElementsByTagName("font")
' Go through all the tags.
For Each oFontTag In oFontTags
'Get the inner text and size of each tag.
strInnerText = oFontTag.innerText
strSize = oFontTag.getAttributeNode("size").Value
'Compare to make sure you have what's needed
If InStr(strInnertText, "Please see") = 0 And _
InStr(strInnertText, "( Certificate of Correction )") = 0 And _
InStr(strInnertText, "( Reexamination Certificate )") = 0 And _
InStr(strInnertText, " **") = 0 Then
If strSize = "+1" Then
Test_UpdateTitle = strInnerText
Exit Function
End If
End If
Next oFontTag
End If
End Function
I hope this helps. :)
In short:
Debug.Print html_doc.querySelector("font[size=+1]").innerText
tl;dr;
① CSS selector:
No need for any of the long winded methods. You have stated the styling pattern. Use a CSS selector to grab it.
font[size=+1]
The reads as font tag with attribute size who value is +1. You may need font[size='+1'] when using VBA to try both.
② CSS query:
③ VBA:
As it is the first match you want you can use the querySelector method of document to apply the selector and retrieve a single element.
html_doc.querySelector("font[size=+1]")
You may need to add a reference to HTML Object Library and use an early bound call of Dim html_doc As HTMLDocument to access the method. The late bound method may expose the querySelector method but if the interface doesn't then use early binding.

Extract variable from HTML source

I'm trying to figure out a way to extract information from an HTML source using a Visual Basic based application I made in Visual Studio 2010. What I'd like to do is have the system load a webpage (based on an order number) in the background and search for a value that is assigned to an HTML tag and return that value. For example, in the href string below:
href="#" class="lnk11blue" onClick="parent.gotoPage('/oe/cllctrshp_.html?dlvyId=26130700&hdrId=7205902&lineId=21188936&ordLnTyp=FEL SVC LINE','dWnd')">26130700
I want the tool to return the 26130700, either after the "dlvyId=" or at the end of the href. The issue I have is the dlvyId changes with every order, as probably does the hdrId and lineId values, so is there a way to have my program read the value after "dlvyId=" after it locates this string? Or is there a way for the program to read the text after the greater than carat after locating the string?
I'll mess around and see what I can find out (and hopefully post code of some attempts), but any ideas/help in the meantime would be greatly appreciated.
Edit: Thanks to Steve, I've got the function to search a string. However, now I'm having trouble loading the page source. I tried this code below but it doesn't seems to work:
Dim objHttp as object
Dim strURL As String
Dim strText As String
objHttp = CreateObject("MSXML2.ServerXMLHTTP")
strURL = "http://companywebprd.tc.company.com/oe/cllctrord_.html?order_nbr=" &
RMA_Number.Text & "&customer_id=&po_nbr=&ord_date=&ord_cond=0&ord_kind=0&serial_nbr=&item_id=
&i_ord_type=&instance=&svcChk=1&custSiteUseId=0&custSiteUseCd=0"
objHttp.Open("GET", strURL, False)
objHttp.Send("")
strText = objHttp.responsetext
Advice? I'll keep searching around as well
This should get you started down the right path...
Sub test()
Const mystring = "href='#' class='lnk11blue' onClick='parent.gotoPage('/oe/cllctrshp_.html?dlvyId=26130700&hdrId=7205902&lineId=21188936&ordLnTyp=FEL SVC LINE','dWnd')'>26130700'"
If InStr(1, mystring, "dlvyId=") <> 0 Then
For i = InStr(1, mystring, "dlvyId=") To Len(mystring)
If Mid(mystring, i, 1) = "&" Then
Exit For
End If
Next i
MsgBox Mid(mystring, InStr(1, mystring, "dlvyId="), (i - InStr(1, mystring, "dlvyId=")))
End If
End Sub

Read Local HTML File into String With VBA

This feels like it should be simple. I have a .HTML file stored on my computer, and I'd like to read the entire file into a string. When I try the super straightforward
Dim FileAsString as string
Open "C:\Myfile.HTML" for input as #1
Input #1, FileAsString
Close #1
debug.print FileAsString
I don't get the whole file. I only get the first few lines (I know the immediate window cuts off, but that's not the issue. I'm definitely not getting the whole file into my string.) I also tried using an alternative method using the file system object, and got similar results, only this time with lots of weird characters and question marks thrown in. This makes me think it's probably some kind of encoding issue. (Although frankly, I don't fully understand what that means. I know there are different encoding formats and that this can cause issues with string parsing, but that's about it.)
So more generally, here's what I'd really like to know: How can I use vba to open a file of any extension (that can be viewed in a text editor) and length (that's doesn't exceed VBA's string limit), and be sure that whatever characters I would see in a basic text editor are what gets read into a string? (If that can't be (easily) done, I'd certainly appreciate being pointed towards a method that's likely to work with .html files) Thanks so much for your help
EDIT:
Here's an example of what happens when I use the suggested method. Specifically
Dim oFSO As Object
Dim oFS As Object, sText As String
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(Path)
Do Until oFS.AtEndOfStream
sText = oFS.ReadAll()
Loop
FileToString = sText
Set oFSO = Nothing
Set oFS = Nothing
End Function
I'll show you both the beginning (via a message box) and the end (via the immediate window) because both are weird in different ways. In both cases I'll compare it to a screen capture of the html source displayed in chrome:
Beginning:
End:
This is one method
Option Explicit
Sub test()
Dim oFSO As Object
Dim oFS As Object, sText As String
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv")
Do Until oFS.AtEndOfStream
' sText = oFS.ReadLine 'read line by line
sText = oFS.ReadAll()
Debug.Print sText
Loop
End Sub
EDIT:
Try changing the following line to one of the following 3 lines and see if it makes any difference
http://msdn.microsoft.com/en-us/library/aa265347(v=vs.60).aspx
Set FS = FSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv", 1, 0)
Set FS = FSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv", 1, 1)
Set FS = FSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv", 1, 2)
EDIT2:
Does this code work for you?
Function ExecuteWebRequest(ByVal url As String) As String
Dim oXHTTP As Object
Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
oXHTTP.Open "GET", url, False
oXHTTP.send
ExecuteWebRequest = oXHTTP.responseText
Set oXHTTP = Nothing
End Function
Function OutputText(ByVal outputstring As String)
MyFile = ThisWorkbook.Path & "\temp.html"
'set and open file for output
fnum = FreeFile()
Open MyFile For Output As fnum
'use Print when you want the string without quotation marks
Print #fnum, outputstring
Close #fnum
End Function
Sub test()
Dim oFSO As Object
Dim oFS As Object, sText As String
Dim Uri As String, HTML As String
Uri = "http://www.forrent.com/results.php?search_type=citystate&page_type_id=city&seed=859049165&main_field=12345&ssradius=-1&min_price=%240&max_price=No+Limit&sbeds=99&sbaths=99&search-submit=Submit"
HTML = ExecuteWebRequest(Uri)
OutputText (HTML)
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(ThisWorkbook.Path & "\temp.html")
Do Until oFS.AtEndOfStream
' sText = oFS.ReadLine 'read line by line
sText = oFS.ReadAll()
Debug.Print sText
Loop
End Sub
Okay so I finally managed to figure this out. The VBA file system object can only read asciiII files, and I had saved mine as unicode. Sometimes, as in my case, saving an asciiII file can cause errors. You can get around this, however, by converting the file to binary, and then back to a string. The details are explained here http://bytes.com/topic/asp-classic/answers/521362-write-xmlhttp-result-text-file.
A bit late to answer but I did this exact thing today (works perfectly):
Sub modify_local_html_file()
Dim url As String
Dim html As Object
Dim fill_a As Object
url = "C:\Myfile.HTML"
Dim oFSO As Object
Dim oFS As Object, sText As String
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(url)
Do Until oFS.AtEndOfStream
sText = oFS.ReadAll()
Debug.Print sText
Loop
Set html = CreateObject("htmlfile")
html.body.innerHTML = sText
oFS.Close
Set oFS = Nothing
'# grab some element #'
Set fill_a = html.getElementById("val_a")
MsgBox fill_a.innerText
'# change its inner text #'
fill_a.innerText = "20%"
MsgBox fill_a.innerText
'# open file this time to write to #'
Set oFS = oFSO.OpenTextFile(url, 2)
'# write it modified html #'
oFS.write html.body.innerHTML
oFS.Close
Set oFSO = Nothing
Set oFS = Nothing
End Sub

Replace Module Text in MS Access using VBA

How do I do a search and replace of text within a module in Access from another module in access? I could not find this on Google.
FYI, I figured out how to delete a module programatically:
Call DoCmd.DeleteObject(acModule, modBase64)
I assume you mean how to do this programatically (otherwise it's just ctrl-h). Unless this is being done in the context of a VBE Add-In, it is rarely (if ever) a good idea. Self modifying code is often flagged by AV software an although access will let you do it, it's not really robust enough to handle it, and can lead to corruption problems etc. In addition, if you go with self modifying code you are preventing yourself from ever being able to use an MDE or even a project password. In other words, you will never be able to protect your code. It might be better if you let us know what problem you are trying to solve with self modifying code and see if a more reliable solution could be found.
After a lot of searching I found this code:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Function to Search for a String in a Code Module. It will return True if it is found and
'False if it is not. It has an optional parameter (NewString) that will allow you to
'replace the found text with the NewString. If NewString is not included in the call
'to the function, the function will only find the string not replace it.
'
'Created by Joe Kendall 02/07/2003
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Public Function SearchOrReplace(ByVal ModuleName As String, ByVal StringToFind As String, _
Optional ByVal NewString, Optional ByVal FindWholeWord = False, _
Optional ByVal MatchCase = False, Optional ByVal PatternSearch = False) As Boolean
Dim mdl As Module
Dim lSLine As Long
Dim lELine As Long
Dim lSCol As Long
Dim lECol As Long
Dim sLine As String
Dim lLineLen As Long
Dim lBefore As Long
Dim lAfter As Long
Dim sLeft As String
Dim sRight As String
Dim sNewLine As String
Set mdl = Modules(ModuleName)
If mdl.Find(StringToFind, lSLine, lSCol, lELine, lECol, FindWholeWord, _
MatchCase, PatternSearch) = True Then
If IsMissing(NewString) = False Then
' Store text of line containing string.
sLine = mdl.Lines(lSLine, Abs(lELine - lSLine) + 1)
' Determine length of line.
lLineLen = Len(sLine)
' Determine number of characters preceding search text.
lBefore = lSCol - 1
' Determine number of characters following search text.
lAfter = lLineLen - CInt(lECol - 1)
' Store characters to left of search text.
sLeft = Left$(sLine, lBefore)
' Store characters to right of search text.
sRight = Right$(sLine, lAfter)
' Construct string with replacement text.
sNewLine = sLeft & NewString & sRight
' Replace original line.
mdl.ReplaceLine lSLine, sNewLine
End If
SearchOrReplace = True
Else
SearchOrReplace = False
End If
Set mdl = Nothing
End Function
Check out the VBA object browser for the Access library. Under the Module object you can search the Module text as well as make replacements. Here is an simple example:
In Module1
Sub MyFirstSub()
MsgBox "This is a test"
End Sub
In Module2
Sub ChangeTextSub()
Dim i As Integer
With Application.Modules("Module1")
For i = 1 To .CountOfLines
If InStr(.Lines(i, 1), "This is a Test") > 0 Then
.ReplaceLine i, "Msgbox ""It worked!"""
End If
Next i
End With
End Sub
After running ChangeTextSub, MyFirstSub should read
Sub MyFirstSub()
MsgBox "It worked!"
End Sub
It's a pretty simple search but hopefully that can get you going.
additional for the function (looping through all the lines)
Public Function ReplaceWithLine(modulename As String, StringToFind As String, NewString As String)
Dim mdl As Module
Set mdl = Modules(modulename)
For x = 0 To mdl.CountOfLines
Call SearchOrReplace(modulename, StringToFind, NewString)
Next x
Set mdl = Nothing
End Function
Enjoy ^^