Extract text from html with substring methods - html

I wanto to extract text from html.
I'm already getting html source with webrequest.
How can I extract text like the following example?:
class="btn btn-success btn-lg" href="I wanto to get this link that is changing every time" rel="nofollow noopener">Click</a><
Can I do it using substring methods like startwith and end with?
Thanks

So, Using string.indexof I found a solution. I was strugglin a bit with those "" in html string, but this now does what it is supposed to do.
I found the solution!
Dim allinputtext As String = RichTextBox1.Text
Dim textafter As String = """ rel=""nofollow noopener"
Dim textbefore As String = "class=""btn btn-success btn-lg"" href="""
Dim startPosition As Integer = allInputText.IndexOf(textBefore)
'If text before was not found, return Nothing
If startPosition < 0 Then
End If
'Move the start position to the end of the text before, rather than the beginning.
startPosition += textBefore.Length
'Find the first occurrence of text after the desired number
Dim endPosition As Integer = allInputText.IndexOf(textAfter, startPosition)
'If text after was not found, return Nothing
If endPosition < 0 Then
End If
'Get the string found at the start and end positions
Dim textFound As String = allInputText.Substring(startPosition, endPosition - startPosition)
TextBox4.Text = (textFound)

Related

Parsing HTML to recreate tables in a Word Document using VBA

Is there a way of taking html code for a table and printing out the same table in a word document using VBA (VBA should be able to parse the html code block for a table)?
It is possible to take the contents of the table and copy them into a new table created in Word, however is it possible to recreate a table using the html code and vba?
For any of this, where can one begin to research?
EDIT:
Thanks to R3uK: here is the first portion of the VBA script which reads a line of html code from a file and uses R3uK's code to print it to the excel worksheet:
Private Sub button1_Click()
Dim the_string As String
the_string = Trim(ImportTextFile("path\to\file.txt"))
' still working on removing new line characters
Call PrintHTML_Table(the_string)
End Sub
Public Function ImportTextFile(strFile As String) As String
' http://mrspreadsheets.com/1/post/2013/09/vba-code-snippet-22-read-entire-text-file-into-string-variable.html
Open strFile For Input As #1
ImportTextFile = Input$(LOF(1), 1)
Close #1
End Function
' Insert R3uK's portion of the code here
This could be a good place to start, you will only need to check content after to see if there is any problem and then copy it to word.
Sub PrintHTML_Table(ByVal StrTable as String)
Dim TA()
Dim Table_String as String
Table_String = " " & StrTable & " "
TA = SplitTo2DArray(Table_String, "</tr>", "</td>")
For i = LBound(TA, 1) To UBound(TA, 1)
For j = LBound(TA, 2) To UBound(TA, 2)
ActiveSheet.Cells(i + 1, j + 1) = Trim(Replace(Replace(TA(i, j), "<td>", ""), "<tr>", ""))
Next j
Next i
End Sub
Public Function SplitTo2DArray(ByRef StringToSplit As String, ByRef RowSep As String, ByRef ColSep As String) As String()
Dim Rows As Variant
Dim rowNb As Long
Dim Columns() As Variant
Dim i As Long
Dim maxlineNb As Long
Dim lineNb As Long
Dim asCells() As String
Dim j As Long
' Split up the table value by rows, get the number of rows, and dim a new array of Variants.
Rows = Split(StringToSplit, RowSep)
rowNb = UBound(Rows)
ReDim Columns(0 To rowNb)
' Iterate through each row, and split it into columns. Find the maximum number of columns.
maxlineNb = 0
For i = 0 To rowNb
Columns(i) = Split(Rows(i), ColSep)
lineNb = UBound(Columns(i))
If lineNb > maxlineNb Then
maxlineNb = lineNb
End If
Next i
' Create a 2D string array to contain the data in <Columns>.
ReDim asCells(0 To maxlineNb, 0 To rowNb)
' Copy all the data from Columns() to asCells().
For i = 0 To rowNb
For j = 0 To UBound(Columns(i))
asCells(j, i) = Columns(i)(j)
Next j
Next i
SplitTo2DArray = asCells()
End Function

Read Local HTML File into String With VBA

This feels like it should be simple. I have a .HTML file stored on my computer, and I'd like to read the entire file into a string. When I try the super straightforward
Dim FileAsString as string
Open "C:\Myfile.HTML" for input as #1
Input #1, FileAsString
Close #1
debug.print FileAsString
I don't get the whole file. I only get the first few lines (I know the immediate window cuts off, but that's not the issue. I'm definitely not getting the whole file into my string.) I also tried using an alternative method using the file system object, and got similar results, only this time with lots of weird characters and question marks thrown in. This makes me think it's probably some kind of encoding issue. (Although frankly, I don't fully understand what that means. I know there are different encoding formats and that this can cause issues with string parsing, but that's about it.)
So more generally, here's what I'd really like to know: How can I use vba to open a file of any extension (that can be viewed in a text editor) and length (that's doesn't exceed VBA's string limit), and be sure that whatever characters I would see in a basic text editor are what gets read into a string? (If that can't be (easily) done, I'd certainly appreciate being pointed towards a method that's likely to work with .html files) Thanks so much for your help
EDIT:
Here's an example of what happens when I use the suggested method. Specifically
Dim oFSO As Object
Dim oFS As Object, sText As String
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(Path)
Do Until oFS.AtEndOfStream
sText = oFS.ReadAll()
Loop
FileToString = sText
Set oFSO = Nothing
Set oFS = Nothing
End Function
I'll show you both the beginning (via a message box) and the end (via the immediate window) because both are weird in different ways. In both cases I'll compare it to a screen capture of the html source displayed in chrome:
Beginning:
End:
This is one method
Option Explicit
Sub test()
Dim oFSO As Object
Dim oFS As Object, sText As String
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv")
Do Until oFS.AtEndOfStream
' sText = oFS.ReadLine 'read line by line
sText = oFS.ReadAll()
Debug.Print sText
Loop
End Sub
EDIT:
Try changing the following line to one of the following 3 lines and see if it makes any difference
http://msdn.microsoft.com/en-us/library/aa265347(v=vs.60).aspx
Set FS = FSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv", 1, 0)
Set FS = FSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv", 1, 1)
Set FS = FSO.OpenTextFile("C:\Users\osknows\Desktop\import-store.csv", 1, 2)
EDIT2:
Does this code work for you?
Function ExecuteWebRequest(ByVal url As String) As String
Dim oXHTTP As Object
Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
oXHTTP.Open "GET", url, False
oXHTTP.send
ExecuteWebRequest = oXHTTP.responseText
Set oXHTTP = Nothing
End Function
Function OutputText(ByVal outputstring As String)
MyFile = ThisWorkbook.Path & "\temp.html"
'set and open file for output
fnum = FreeFile()
Open MyFile For Output As fnum
'use Print when you want the string without quotation marks
Print #fnum, outputstring
Close #fnum
End Function
Sub test()
Dim oFSO As Object
Dim oFS As Object, sText As String
Dim Uri As String, HTML As String
Uri = "http://www.forrent.com/results.php?search_type=citystate&page_type_id=city&seed=859049165&main_field=12345&ssradius=-1&min_price=%240&max_price=No+Limit&sbeds=99&sbaths=99&search-submit=Submit"
HTML = ExecuteWebRequest(Uri)
OutputText (HTML)
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(ThisWorkbook.Path & "\temp.html")
Do Until oFS.AtEndOfStream
' sText = oFS.ReadLine 'read line by line
sText = oFS.ReadAll()
Debug.Print sText
Loop
End Sub
Okay so I finally managed to figure this out. The VBA file system object can only read asciiII files, and I had saved mine as unicode. Sometimes, as in my case, saving an asciiII file can cause errors. You can get around this, however, by converting the file to binary, and then back to a string. The details are explained here http://bytes.com/topic/asp-classic/answers/521362-write-xmlhttp-result-text-file.
A bit late to answer but I did this exact thing today (works perfectly):
Sub modify_local_html_file()
Dim url As String
Dim html As Object
Dim fill_a As Object
url = "C:\Myfile.HTML"
Dim oFSO As Object
Dim oFS As Object, sText As String
Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oFS = oFSO.OpenTextFile(url)
Do Until oFS.AtEndOfStream
sText = oFS.ReadAll()
Debug.Print sText
Loop
Set html = CreateObject("htmlfile")
html.body.innerHTML = sText
oFS.Close
Set oFS = Nothing
'# grab some element #'
Set fill_a = html.getElementById("val_a")
MsgBox fill_a.innerText
'# change its inner text #'
fill_a.innerText = "20%"
MsgBox fill_a.innerText
'# open file this time to write to #'
Set oFS = oFSO.OpenTextFile(url, 2)
'# write it modified html #'
oFS.write html.body.innerHTML
oFS.Close
Set oFSO = Nothing
Set oFS = Nothing
End Sub

How to extract text content from tags in .NET?

I'm trying to code a vb.net function to extract specific text content from tags; I wrote this function
Public Function GetTagContent(ByRef instance_handler As String, ByRef start_tag As String, ByRef end_tag As String) As String
Dim s As String = ""
Dim content() As String = instance_handler.Split(start_tag)
If content.Count > 1 Then
Dim parts() As String = content(1).Split(end_tag)
If parts.Count > 0 Then
s = parts(0)
End If
End If
Return s
End Function
But it doesn't work, for example with the following debug code
Dim testString As String = "<body>my example <div style=""margin-top:20px""> text to extract </div> <br /> another line.</body>"
txtOutput.Text = testString.GetTagContent("<div style=""margin-top:20px"">", "</div>")
I get only "body>my example" string, instead of "text to extract"
can anyone help me? tnx in advance
I wrote a new routine and the following code works however I would know if exists a better code for performance:
Dim s As New StringBuilder()
Dim i As Integer = instance_handler.IndexOf(start_tag, 0)
If i < 0 Then
Return ""
Else
i = i + start_tag.Length
End If
Dim j As Integer = instance_handler.IndexOf(end_tag, i)
If j < 0 Then
s.Append(instance_handler.Substring(i))
Else
s.Append(instance_handler.Substring(i, j - i))
End If
Return s.ToString
XPath is one way of accomplishing this task. I'm sure others will suggest LINQ. Here's an example using XPath:
Dim testString As String = "<body>my example <div style=""margin-top:20px""> text to extract </div> <br /> another line.</body>"
Dim doc As XmlDocument = New XmlDocument()
doc.LoadXml(testString)
MessageBox.Show(doc.SelectSingleNode("/body/div").InnerText)
Obviously, a more complex document may require a more complex xpath than simply "/body/div", but it's still pretty simple.
If you need to get a list of multiple elements that match the path, you can use doc.SelectNodes.

Convert html to plain text in VBA

I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.
I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.
Set a reference to "Microsoft HTML object library".
Function HtmlToText(sHTML) As String
Dim oDoc As HTMLDocument
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = sHTML
HtmlToText = oDoc.body.innerText
End Function
Tim
A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.
Function StripTags(ByVal html As String) As String
Dim text As String
Dim accumulating As Boolean
Dim n As Integer
Dim c As String
text = ""
accumulating = True
n = 1
Do While n <= Len(html)
c = Mid(html, n, 1)
If c = "<" Then
accumulating = False
ElseIf c = ">" Then
accumulating = True
Else
If accumulating Then
text = text & c
End If
End If
n = n + 1
Loop
StripTags = text
End Function
This can leave lots of extraneous whitespace, but it will help in removing the tags.
Tim's solution was great, worked liked a charm.
I´d like to contribute: Use this code to add the "Microsoft HTML Object Library" in runtime:
Set ID = ThisWorkbook.VBProject.References
ID.AddFromGuid "{3050F1C5-98B5-11CF-BB82-00AA00BDCE0B}", 2, 5
It worked on Windows XP and Windows 7.
Tim's answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.
Function HtmlToText(sHTML) As String
Dim oDoc As HTMLDocument
If IsNull(sHTML) Then
HtmlToText = ""
Exit Function
End-If
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = sHTML
HtmlToText = oDoc.body.innerText
End Function
Yes! I managed to solve my problem as well. Thanks everybody/
In my case, I had this sort of input:
<p>Lorem ipsum dolor sit amet.</p>
<p>Ut enim ad minim veniam.</p>
<p>Duis aute irure dolor in reprehenderit.</p>
And I did not want the result to be all jammed together without breaklines.
So I first splitted my input for every <p> tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).
In addition I concatenated each cleaned 'paragraph' with this breakline character Crh(10) for VBA/Excel.
The final code is:
Public Function HtmlToText(ByVal sHTML As String) As String
Dim oDoc As HTMLDocument
Dim result As String
Dim paragraphs() As String
If IsNull(sHTML) Then
HtmlToText = ""
Exit Function
End If
result = ""
paragraphs = Split(sHTML, "<p>")
For Each paragraph In paragraphs
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = paragraph
result = result & Chr(10) & Chr(10) & oDoc.body.innerText
Next paragraph
HtmlToText = result
End Function
Here's a variation of Tim's and Gardoglee's solution that does not require setting a reference to "Microsoft HTML object library". This method is known as Late Binding and will also work in vbscript.
Function HtmlToText(sHTML) As String
Dim oDoc As Object ' As HTMLDocument
If IsNull(sHTML) Then
HtmlToText = ""
Exit Function
End If
Set oDoc = CreateObject("HTMLFILE")
oDoc.body.innerHTML = sHTML
HtmlToText = oDoc.body.innerText
End Function
Note that if you are using VBA in Access 2007 or greater, there is an Application.PlainText() method built-in that does the same thing as the code above.

Replace Module Text in MS Access using VBA

How do I do a search and replace of text within a module in Access from another module in access? I could not find this on Google.
FYI, I figured out how to delete a module programatically:
Call DoCmd.DeleteObject(acModule, modBase64)
I assume you mean how to do this programatically (otherwise it's just ctrl-h). Unless this is being done in the context of a VBE Add-In, it is rarely (if ever) a good idea. Self modifying code is often flagged by AV software an although access will let you do it, it's not really robust enough to handle it, and can lead to corruption problems etc. In addition, if you go with self modifying code you are preventing yourself from ever being able to use an MDE or even a project password. In other words, you will never be able to protect your code. It might be better if you let us know what problem you are trying to solve with self modifying code and see if a more reliable solution could be found.
After a lot of searching I found this code:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Function to Search for a String in a Code Module. It will return True if it is found and
'False if it is not. It has an optional parameter (NewString) that will allow you to
'replace the found text with the NewString. If NewString is not included in the call
'to the function, the function will only find the string not replace it.
'
'Created by Joe Kendall 02/07/2003
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Public Function SearchOrReplace(ByVal ModuleName As String, ByVal StringToFind As String, _
Optional ByVal NewString, Optional ByVal FindWholeWord = False, _
Optional ByVal MatchCase = False, Optional ByVal PatternSearch = False) As Boolean
Dim mdl As Module
Dim lSLine As Long
Dim lELine As Long
Dim lSCol As Long
Dim lECol As Long
Dim sLine As String
Dim lLineLen As Long
Dim lBefore As Long
Dim lAfter As Long
Dim sLeft As String
Dim sRight As String
Dim sNewLine As String
Set mdl = Modules(ModuleName)
If mdl.Find(StringToFind, lSLine, lSCol, lELine, lECol, FindWholeWord, _
MatchCase, PatternSearch) = True Then
If IsMissing(NewString) = False Then
' Store text of line containing string.
sLine = mdl.Lines(lSLine, Abs(lELine - lSLine) + 1)
' Determine length of line.
lLineLen = Len(sLine)
' Determine number of characters preceding search text.
lBefore = lSCol - 1
' Determine number of characters following search text.
lAfter = lLineLen - CInt(lECol - 1)
' Store characters to left of search text.
sLeft = Left$(sLine, lBefore)
' Store characters to right of search text.
sRight = Right$(sLine, lAfter)
' Construct string with replacement text.
sNewLine = sLeft & NewString & sRight
' Replace original line.
mdl.ReplaceLine lSLine, sNewLine
End If
SearchOrReplace = True
Else
SearchOrReplace = False
End If
Set mdl = Nothing
End Function
Check out the VBA object browser for the Access library. Under the Module object you can search the Module text as well as make replacements. Here is an simple example:
In Module1
Sub MyFirstSub()
MsgBox "This is a test"
End Sub
In Module2
Sub ChangeTextSub()
Dim i As Integer
With Application.Modules("Module1")
For i = 1 To .CountOfLines
If InStr(.Lines(i, 1), "This is a Test") > 0 Then
.ReplaceLine i, "Msgbox ""It worked!"""
End If
Next i
End With
End Sub
After running ChangeTextSub, MyFirstSub should read
Sub MyFirstSub()
MsgBox "It worked!"
End Sub
It's a pretty simple search but hopefully that can get you going.
additional for the function (looping through all the lines)
Public Function ReplaceWithLine(modulename As String, StringToFind As String, NewString As String)
Dim mdl As Module
Set mdl = Modules(modulename)
For x = 0 To mdl.CountOfLines
Call SearchOrReplace(modulename, StringToFind, NewString)
Next x
Set mdl = Nothing
End Function
Enjoy ^^