Removing a portion of HTML within a HTML page - html

I am trying to remove some tags with content while loading a page to restrict not sending few tags.
I was doing with search string and its not helpful for larger data set.
string startTag = "<section>"+Environment.NewLine+
" <div id=\"nonPrintable123\">";
var startIndex = htmlString.IndexOf(startTag);
var html = htmlString.Substring(0, startIndex) + "</div></form> </body></html>";
Is there any way so I could use Regex and remove /replace a whole div- child with empty string?
The Data within <Section> {data} </Section>
should be replaced with empty or any other suppression.

using String.Replace has worked for me in the past.
https://learn.microsoft.com/en-us/dotnet/api/system.string.replace?view=netframework-4.7.2
startString &= startString.Replace("<div>HTML you want to replace</div>", "")

I did with the following piece of code using vb.net:
Private Sub removehtml()
Dim str As String = " <div id=nonPrintable123> <!--# Start --> hjhjhty iuh hwjkednjkb dvhv xcaisfdchascjk bkasj df kh <!--End #-->"
Dim sindex As Integer = 0
Dim eindex As Integer = 0
sindex = str.IndexOf("<!--#")
eindex = str.IndexOf("#-->")
Dim substr As String = String.Empty
substr = str.Substring(sindex, (eindex - sindex) + 4)
str = str.Replace(substr, String.Empty)
End Sub
By this way I have removed all the non required data from given string

Related

Is there a better way to extract HTML code using Visual Basic

I'm trying to extract some HTML code here, I only want the final String to say
'Entity B'. Is there a better way to do this than what I have done here?
Also this is a format for many entries, so Entity B wont always be Entity B and same for Entity C
SMethod = "<b>Entity B<br/>Entity C</b>"
SMethod = SMethod.Replace("</b>", "</c>")
SMethod = SMethod.Replace("<br/>", "</b><c>")
SMethod = "<a>" & SMethod & "</a>"
Dim ShippingMethod As XDocument = XDocument.Parse(SMethod)
SMethod = ShippingMethod.Element("a").Element("b").Value.Trim
I'm not 100% clear what your end game is, but as for the first sentence of your question where you want to take a string with HTML code in it and remove all the code, this function will remove any tag enclosed in <>:
Public Function RemoveHTML(ByVal input As String) As String
While InStr(input, "<") > 0
Dim tagStart As Integer = InStr(input, "<")
Dim tagEnd As Integer = InStr(input, ">")
input = Left(input, tagStart - 1) & Right(input, Len(input) - tagEnd)
End While
Return input
End Function
And if you're also trying to trim off anything after the <br/> tag, this will do that:
Public Function OneEntity(ByVal input As String) As String
If InStr(input, "<br/>" Then
Dim parts() As String = Split(input, "<br/>")
Return RemoveHTML(parts(0))
Else
Return RemoveHTML(input)
End If
End Function

Get First Vid From Youtube VB.NET

Im trying to get the first youtube link from youtube or google but I can't get it to work. can someone please help me out?
Dim m As New Regex("<a href=""/watch?v=.*""")
Dim request2 As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("https://www.youtube.com/results?search_query=" + ListBox1.SelectedItem + " " + ListBox2.SelectedItem)
Dim responseyoutube As System.Net.HttpWebResponse = request2.GetResponse
TextBox2.Text = (request2.Address.ToString)
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(responseyoutube.GetResponseStream())
Dim rssourcecodey As String = sr.ReadToEnd
Dim matches As MatchCollection = m.Matches(rssourcecodey)
TextBox1.Text = rssourcecodey
For Each itemcode2 As Match In matches
youtube = itemcode2.Value.Split("=").GetValue(1)
ListBox2.Items.Add(youtube)
? is a special meta char in regex which makes the previous token as optional one (not the one after * or +). So you need to escape the ? symbol in-order to match a literal ? symbol.
Dim m As New Regex("<a href=""/watch[?]v=.*""")
OR
Dim m As New Regex("<a href=""/watch\\?v=.*""")

Parsing HTML to recreate tables in a Word Document using VBA

Is there a way of taking html code for a table and printing out the same table in a word document using VBA (VBA should be able to parse the html code block for a table)?
It is possible to take the contents of the table and copy them into a new table created in Word, however is it possible to recreate a table using the html code and vba?
For any of this, where can one begin to research?
EDIT:
Thanks to R3uK: here is the first portion of the VBA script which reads a line of html code from a file and uses R3uK's code to print it to the excel worksheet:
Private Sub button1_Click()
Dim the_string As String
the_string = Trim(ImportTextFile("path\to\file.txt"))
' still working on removing new line characters
Call PrintHTML_Table(the_string)
End Sub
Public Function ImportTextFile(strFile As String) As String
' http://mrspreadsheets.com/1/post/2013/09/vba-code-snippet-22-read-entire-text-file-into-string-variable.html
Open strFile For Input As #1
ImportTextFile = Input$(LOF(1), 1)
Close #1
End Function
' Insert R3uK's portion of the code here
This could be a good place to start, you will only need to check content after to see if there is any problem and then copy it to word.
Sub PrintHTML_Table(ByVal StrTable as String)
Dim TA()
Dim Table_String as String
Table_String = " " & StrTable & " "
TA = SplitTo2DArray(Table_String, "</tr>", "</td>")
For i = LBound(TA, 1) To UBound(TA, 1)
For j = LBound(TA, 2) To UBound(TA, 2)
ActiveSheet.Cells(i + 1, j + 1) = Trim(Replace(Replace(TA(i, j), "<td>", ""), "<tr>", ""))
Next j
Next i
End Sub
Public Function SplitTo2DArray(ByRef StringToSplit As String, ByRef RowSep As String, ByRef ColSep As String) As String()
Dim Rows As Variant
Dim rowNb As Long
Dim Columns() As Variant
Dim i As Long
Dim maxlineNb As Long
Dim lineNb As Long
Dim asCells() As String
Dim j As Long
' Split up the table value by rows, get the number of rows, and dim a new array of Variants.
Rows = Split(StringToSplit, RowSep)
rowNb = UBound(Rows)
ReDim Columns(0 To rowNb)
' Iterate through each row, and split it into columns. Find the maximum number of columns.
maxlineNb = 0
For i = 0 To rowNb
Columns(i) = Split(Rows(i), ColSep)
lineNb = UBound(Columns(i))
If lineNb > maxlineNb Then
maxlineNb = lineNb
End If
Next i
' Create a 2D string array to contain the data in <Columns>.
ReDim asCells(0 To maxlineNb, 0 To rowNb)
' Copy all the data from Columns() to asCells().
For i = 0 To rowNb
For j = 0 To UBound(Columns(i))
asCells(j, i) = Columns(i)(j)
Next j
Next i
SplitTo2DArray = asCells()
End Function

Extract specific html string from html source code(website) in vb.net

Actually I have full html source code of the website ..I want to extract data between the specific div tag
here is my code..
Dim request As WebRequest = WebRequest.Create("https://www.crowdsurge.com/store/index.php?storeid=1056&menu=detail&eventid=41815")
Using response As WebResponse = request.GetResponse()
Using reader As New StreamReader(response.GetResponseStream())
html = reader.ReadToEnd()
End Using
End Using
Dim pattern1 As String = "<div class = ""ei_value ei_date"">(.*)"
Dim m As Match = Regex.Match(html, pattern1)
If m.Success Then
MsgBox(m.Groups(1).Value)
End If
An easier approach for parsing HTML (especially from a source that you don't control) is to use the HTML Agility Pack, which would allow you to do something a little like:
Dim req As WebRequest = WebRequest.Create("https://www.crowdsurge.com/store/index.php?storeid=1056&menu=detail&eventid=41815")
Dim doc As New HtmlDocument()
Using res As WebResponse = req.GetResponse()
doc.Load(res.GetResponseStream())
End Using
Dim nodes = doc.DocumentNode.SelectNodes("//div[#class='ei_value ei_date']")
If nodes IsNot Nothing Then
For Each var node in nodes
MsgBox(node.InnerText)
Next
End IF
(I've assumed Option Infer)
Try that:
Dim pattern1 As String = "<div class\s*=\s*""ei_value ei_date"">(.*?)</div>"
or
Dim pattern1 As String = "<div class=""ei_value ei_date"">(.*?)</div>"

How to extract text content from tags in .NET?

I'm trying to code a vb.net function to extract specific text content from tags; I wrote this function
Public Function GetTagContent(ByRef instance_handler As String, ByRef start_tag As String, ByRef end_tag As String) As String
Dim s As String = ""
Dim content() As String = instance_handler.Split(start_tag)
If content.Count > 1 Then
Dim parts() As String = content(1).Split(end_tag)
If parts.Count > 0 Then
s = parts(0)
End If
End If
Return s
End Function
But it doesn't work, for example with the following debug code
Dim testString As String = "<body>my example <div style=""margin-top:20px""> text to extract </div> <br /> another line.</body>"
txtOutput.Text = testString.GetTagContent("<div style=""margin-top:20px"">", "</div>")
I get only "body>my example" string, instead of "text to extract"
can anyone help me? tnx in advance
I wrote a new routine and the following code works however I would know if exists a better code for performance:
Dim s As New StringBuilder()
Dim i As Integer = instance_handler.IndexOf(start_tag, 0)
If i < 0 Then
Return ""
Else
i = i + start_tag.Length
End If
Dim j As Integer = instance_handler.IndexOf(end_tag, i)
If j < 0 Then
s.Append(instance_handler.Substring(i))
Else
s.Append(instance_handler.Substring(i, j - i))
End If
Return s.ToString
XPath is one way of accomplishing this task. I'm sure others will suggest LINQ. Here's an example using XPath:
Dim testString As String = "<body>my example <div style=""margin-top:20px""> text to extract </div> <br /> another line.</body>"
Dim doc As XmlDocument = New XmlDocument()
doc.LoadXml(testString)
MessageBox.Show(doc.SelectSingleNode("/body/div").InnerText)
Obviously, a more complex document may require a more complex xpath than simply "/body/div", but it's still pretty simple.
If you need to get a list of multiple elements that match the path, you can use doc.SelectNodes.