Add newline in HTML source code using HTML Agility Pack - html

I am modifying a HTML file using the HTML Agility Pack.
Here is an example on a HTML file containing tables:
Dim document As New HtmlDocument
Dim tables As Array
document.Load(path_html)
Dim div1 As HtmlNode = HtmlNode.CreateNode("<div></div>")
Dim div2 As HtmlNode = HtmlNode.CreateNode("<div></div>")
tables = document.DocumentNode.Descendants("table").ToArray()
For Each tr As HtmlNode In tables.Descendants("tr").ToArray
tr.AppendChild(div1)
tr.AppendChild(div2)
Next
document.save(path_html)
And here is the result in the HTML file:
<div></div><div></div>
What I would like is:
<div></div>
<div></div>
I think this should be implemented by default as it makes my HTML file unclear.
I saw this question (which is my exact issue) here but the answer is not working for me (maybe because of VB.NET and the answer is C#).
Can anyone help?

Haven't written any vb.net in a long time, so first tried this in C#:
var document = new HtmlDocument();
var div = HtmlNode.CreateNode("<div></div>");
var newline = HtmlNode.CreateNode("\r\n");
div.AppendChild(newline);
for (int i = 0; i < 2; ++i)
{
div.AppendChild(HtmlNode.CreateNode("<div></div>"));
div.AppendChild(newline);
}
document.DocumentNode.AppendChild(div);
Console.WriteLine(document.DocumentNode.WriteTo());
Works great - the output:
<div>
<div></div>
<div></div>
</div>
Then thought, "no way....it can't be" - note the commented lines:
Dim document = New HtmlDocument()
Dim div = HtmlNode.CreateNode("<div></div>")
' this writes the literal string...
Dim newline = HtmlNode.CreateNode("\r\n")
' this works!
' Dim newline = HtmlNode.CreateNode(Environment.NewLine)
div.AppendChild(newline)
For i = 1 To 2
div.AppendChild(HtmlNode.CreateNode("<div></div>"))
div.AppendChild(newline)
Next
document.DocumentNode.AppendChild(div)
Console.WriteLine(document.DocumentNode.WriteTo())
Unfortunately it is so, and probably why the question you linked to was not marked answered - the output:
<div>\r\n<div></div>\r\n<div></div>\r\n</div>
Finally, instead of using the newline string as \r\n tried Environment.NewLine, which does work and outputs:
<div>
<div></div>
<div></div>
</div>
Works either way in C#.

Based on this answer you would need to add in a node that represents a Carriage Return (\r) and a Line Feed (\n):
Dim newLineNode As HtmlNode = HtmlNode.CreateNode("\r\n")
Based on your comment:
I tried this but it adds '\r\n' in my HTML, it's not going back to line.
You've already tried this and instead it prints the string literal "\r\n". I too have managed to replicate this issue.
Instead look at using <br> tag which is a line break:
Dim newLineNode As HtmlNode = HtmlNode.CreateNode("<br>")
Based on your example code, your code would look something like this:
Dim newLineNode As HtmlNode = HtmlNode.CreateNode("<br>")
For Each tr As HtmlNode In tables.Descendants("tr").ToArray
tr.AppendChild(div1)
tr.AppendChild(newLineNode)
tr.AppendChild(div2)
Next
However tables.Descendants("tr").ToArray did provide a compile error for me. As that's out of the scope of this question and you haven't raised it as an issue I'll make an assumption that it works for you.

Related

How to Remove a specific Img tag from string

Using visual basic
I have a string that contains HTML inside of it. There may be many img tags inside of it, but there is an img tag with a specific alt attribute that I want to remove.
How do I remove the entire img tag from the string if it contains 'badImage' as the alt attribute? I still want to keep any other img tags that may be inside the string.
Dim myString as string = "<html><body><span>some text here..</span><img src='#' alt='goodImage'/><span>more text...</span><img src='#' alt='badImage'/></body></html>
I have the following code so far, but it removes ALL img tags from the string, whereas I only want to remove the img tag with the 'badImage' alt attribute. Is this possible?
Dim imgRegex As New Regex("<img[^>]*>", RegexOptions.IgnoreCase)
myString = myString.Replace(bodyContent, "")
Please answer in VB.Net. Thanks for any assistance!
Hoping that the html source is a well-formatted html/xml/[whatever markup language], you can remove bad tags by using XmlDocument to read your source then remove bad elements detect them by “alt” attribute.
A little code demonstration:
Function ClearBadImgTags(source As String) As String
Dim xDoc As XmlDocument = New XmlDocument
Try
xDoc.LoadXml(source)
Dim badImgs As IEnumerable(Of XmlElement) = From el In xDoc.GetElementsByTagName("img")
Select img = CType(el, XmlElement)
Where img.HasAttribute("alt") AndAlso img.Attributes("alt").Value = "badImage"
For i As Integer = 0 To badImgs.Count - 1 : badImgs(i).ParentNode.RemoveChild(badImgs(i)) : Next
Return xDoc.OuterXml
Catch ex As Exception
Stop 'Bad XML or something go wrong
End Try
Return ""
End Function
Then:
Dim myString As String = "<html><body><span>some text here..</span><img src='#' alt='goodImage'/><span>more text...</span><img src='#' alt='badImage'/></body></html>"
Dim newString As String = ClearBadImgTags(myString)

Loop Through HTML Elements and Nodes

I'm working on an HTML page highlighter project but ran into problems when a search term is a name of an HTML tag metadata or a class/ID name; eg if search terms are "media OR class OR content" then my find and replace would do this:
<link href="/css/DocHighlighter.css" <span style='background-color:yellow;font-weight:bold;'>media</span>="all" rel="stylesheet" type="text/css">
<div <span style='background-color:yellow;font-weight:bold;'>class</span>="container">
I'm using Lucene for highlighting and my current code (sort of):
InputStreamReader xmlReader = new INputStreamReader(xmlConn.getInputStream(), "UTF-8");
if (searchTerms!=null && searchTerms!="") {
QueryScorer qryScore = new QueryScorer(qp.parse(searchTerms));
Highlighter hl = new Highlighter(new SimpleHTMLFormatter(hlStart, hlEnd), qryScore);
}
if (xmlReader!=null) {
BufferedReader br = new BufferedReader(xmlReader);
String inputLine;
while((inputLine = br.readLine())!=null) {
String tmp = inputLine.trim();
StringReader strReader = new stringReader(tmp);
HTMLStripCharFilter htm = HTMLStripCharFilter(strReader.markSupported() ? strReader : new BufferedReader(strReader));
String tHL = hl.getBestFragment(analyzer, "", htm);
tmp = (tHL==null ? tmp : tHL);
}
xmlDoc+=tmp;
}
bufferedReader.close()
As you can see (if you understand Lucene highlighting) this does an indiscriminate find/replace. Since my document will be HTML and the search terms are dictated by users there is no way for me to parse on certain elements or tags. Also, since the find/replace basically loops and appends the HTML to a string (the return type of the method) I have to keep all HTML tags and values in place and order. I've tried using Jsoup to loop through the page but handles the HTML tag as one big result. I also tried tag soup to remove the broken HTML caused by the problem but it doesn't work correctly. Does anyone know how to basically loop though the elements and node (data value) of html?
I've been having the most luck with this
StringBuilder sb = new StringBuilder();
sb.append("<?xml version=\"1.0\" enconding=\"UTF-8\"?><!DOCTYPE html>");
Document doc = Jsoup.parse(txt.getResult());
Element elements = doc.getAllElements();
for (Element e : elements) {
if (!(e.tagName().equalsIgnoreCase("#root"))) {
sb.append("<" + e.tagName() + e.attributes() + ">" + e.ownText() + "\n");
}// end if
}// end for
return sb;
The one snag I still get is the nesting isn't always "repaired" properly but still semi close. I'm working more on this.

hInserting line-break into XML so that it appears after XSL rendering in VB.NET

I have a System.xml.xmlDocument() object which is rendered onto a web page by using XSL. I want to insert a 'linebreak` inside certain nodes in the XML object, so when the XML is rendered using XSLT there is an actual line break there. My Code to do this looks like this:
Dim parentNodes As System.Xml.XmlNodeList = objOutput.SelectNodes("//PARENT")
Dim currentParentValue As String = String.Empty
Dim resultParent As String = String.Empty
For Each par As System.Xml.XmlNode In parentNodes
currentParentValue = par.InnerText
Dim parArray As String() = currentParentValue.Split(";")
If parArray.Length > 2 Then
resultParent = String.Empty
Dim parCounter As Integer = 0
For Each Parent As String In parArray
parCounter = parCounter + 1
resultParent = resultParent + Parent + "; "
If (parCounter Mod 2) = 0 Then
resultParent = resultParent + "
"
End If
Next
End If
par.InnerText = resultParent
Next
And in XSL:
<td width="50%" nowrap="nowrap">
<xsl:value-of select="STUDENT_DETAILS/PARENT"/>
</td>
However, it looks like xmlDocument is automatically escaping the next line character, so it just appears as text on the page, can anyone tell how to fix this?
If you change
<td width="50%" nowrap="nowrap">
<xsl:value-of select="STUDENT_DETAILS/PARENT"/>
</td>
to
<td width="50%" nowrap="nowrap">
<pre>
<xsl:value-of select="STUDENT_DETAILS/PARENT"/>
</pre>
</td>
the browser will render line breaks.
you can just simple append "<'br\>" next to your nodes, that will insert the linebreak between yours two nodes.
Notes:
please remove the ' before br.
You problem resolves around this line....
resultParent = resultParent + "
"
Now, you are probably trying to output your XML like this:
<PARENT>George Aaron
Susan Lee Aaron
Richard Elliot Aaron
</PARENT>
However, this escaped
entity is only relevant if the document has yet to be parsed. If it were a text document, that gets subsequent read and parsed into an XML document, then the entities would be handled as expected. But you are working with an XML document that has already been parsed. Therefore, when you do resultParent = resultParent + "
" it is actually going to insert a string of five characters into an existing text node, and because & is a special character, it gets escaped.
Now, what you can simply do is this...
resultParent = resultParent + chr(10)
But ultimately this will prove fruitless because HTML doesn't recognise line-break characters, so you would have to write your XSLT to replace the line break with a <br /> element.
If you wanted to do this in your VB code though, you could create new br elements yourself, and insert them
For Each par As System.Xml.XmlNode In parentNodes
currentParentValue = par.InnerText
par.InnerText = String.Empty
Dim parArray As String() = currentParentValue.Split(";")
For Each Parent As String In parArray
If Parent.Length > 0 Then
Dim person As XmlText = objOutput.CreateTextNode(Parent)
par.AppendChild(person)
par.AppendChild(objOutput.CreateElement("br"))
End If
Next
Next
So, this takes the PARENT node, clears it down, then adds a text node, and new br element for each parent. The output would then be like so, which would be much easier to output as HTML using XSLT
<PARENT>George Aaron<br />Susan Lee Aaron<br />Richard Elliot Aaron<br /></PARENT>
(It shouldn't be too hard to add the br after every second parent if required).
However, if may not necessarily be a good idea to put "presentational" information in a XML file. Suppose you later had to transform the XML into a different format? An alternate approach would be separate each parent into their own element.
For Each par As System.Xml.XmlNode In parentNodes
currentParentValue = par.InnerText
par.InnerText = String.Empty
Dim parArray As String() = currentParentValue.Split(";")
For Each Parent As String In parArray
If Parent.Length > 0 Then
Dim person As XmlElement = objOutput.CreateElement("PERSON")
person.InnerText = Parent.Trim()
par.AppendChild(person)
End If
Next
Next
This would output something like this..
<PARENT>
<PERSON>George Aaron</PERSON>
<PERSON>Susan Lee Aaron</PERSON>
<PERSON>Richard Elliot Aaron</PERSON>
<PERSON>Albert Smith</PERSON>
</PARENT>
Displaying this as HTML would also be straight-forward
Hint: To display in groups of two, your XSLT may look something like this....
<xsl:for-each select="PERSON[postion() mod 2 = 1]">
<xsl:value-of select=".">;
<xsl:value-of select="following-sibling::PERSON[1]" />
<br />
</xsl:for-each>

iTextSharp HTML to PDF preserving spaces

I am using the FreeTextBox.dll to get user input, and storing that information in HTML format in the database. A samle of the user's input is the below:
                                                                     133 Peachtree St NE                                                                     Atlanta,  GA 30303                                                                     404-652-7777                                                                      Cindy Cooley                                                                     www.somecompany.com                                                                     Product Stewardship Mgr                                                                     9/9/2011Deidre's Company123 Test StAtlanta, GA 30303Test test.  
I want the HTMLWorker to perserve the white spaces the users enters, but it strips it out. Is there a way to perserve the user's white space? Below is an example of how I am creating my PDF document.
Public Shared Sub CreatePreviewPDF(ByVal vsHTML As String, ByVal vsFileName As String)
Dim output As New MemoryStream()
Dim oDocument As New Document(PageSize.LETTER)
Dim writer As PdfWriter = PdfWriter.GetInstance(oDocument, output)
Dim oFont As New Font(Font.FontFamily.TIMES_ROMAN, 8, Font.NORMAL, BaseColor.BLACK)
Using output
Using writer
Using oDocument
oDocument.Open()
Using sr As New StringReader(vsHTML)
Using worker As New html.simpleparser.HTMLWorker(oDocument)
worker.StartDocument()
worker.SetInsidePRE(True)
worker.Parse(sr)
worker.EndDocument()
worker.Close()
oDocument.Close()
End Using
End Using
HttpContext.Current.Response.ContentType = "application/pdf"
HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("attachment;filename={0}.pdf", vsFileName))
HttpContext.Current.Response.BinaryWrite(output.ToArray())
HttpContext.Current.Response.End()
End Using
End Using
output.Close()
End Using
End Sub
There's a glitch in iText and iTextSharp but you can fix it pretty easily if you don't mind downloading the source and recompiling it. You need to make a change to two files. Any changes I've made are commented inline in the code. Line numbers are based on the 5.1.2.0 code rev 240
The first is in iTextSharp.text.html.HtmlUtilities.cs. Look for the function EliminateWhiteSpace at line 249 and change it to:
public static String EliminateWhiteSpace(String content) {
// multiple spaces are reduced to one,
// newlines are treated as spaces,
// tabs, carriage returns are ignored.
StringBuilder buf = new StringBuilder();
int len = content.Length;
char character;
bool newline = false;
bool space = false;//Detect whether we have written at least one space already
for (int i = 0; i < len; i++) {
switch (character = content[i]) {
case ' ':
if (!newline && !space) {//If we are not at a new line AND ALSO did not just append a space
buf.Append(character);
space = true; //flag that we just wrote a space
}
break;
case '\n':
if (i > 0) {
newline = true;
buf.Append(' ');
}
break;
case '\r':
break;
case '\t':
break;
default:
newline = false;
space = false; //reset flag
buf.Append(character);
break;
}
}
return buf.ToString();
}
The second change is in iTextSharp.text.xml.simpleparser.SimpleXMLParser.cs. In the function Go at line 185 change line 248 to:
if (html /*&& nowhite*/) {//removed the nowhite check from here because that should be handled by the HTML parser later, not the XML parser
Thanks for the help everyone. I was able to find a small work around by doing the following:
vsHTML.Replace(" ", " ").Replace(Chr(9), " ").Replace(Chr(160), " ").Replace(vbCrLf, "<br />")
The actual code does not display properly but, the first replace is replacing white spaces with , Chr(9) with 5 , and Chr(160) with .
I would recommend using wkhtmltopdf instead of iText. wkhtmltopdf will output the html exactly as rendered by webkit (Google Chrome, Safari) instead of iText's conversion. It is just a binary that you can call. That being said, I might check the html to ensure that there are paragraphs and/or line breaks in the user input. They might be stripped out before the conversion.

replace keyword within html string

I am looking for a way to replace keywords within a html string with a variable. At the moment i am using the following example.
returnString = Replace(message, "[CustomerName]", customerName, CompareMethod.Text)
The above will work fine if the html block is spread fully across the keyword.
eg.
<b>[CustomerName]</b>
However if the formatting of the keyword is split throughout the word, the string is not found and thus not replaced.
e.g.
<b>[Customer</b>Name]
The formatting of the string is out of my control and isn't foolproof. With this in mind what is the best approach to find a keyword within a html string?
Try using Regex expression. Create your expressions here, I used this and it works well.
http://regex-test.com/validate/javascript/js_match
Use the text property instead of innerHTML if you're using javascript to access the content. That should remove all tags from the content, you give back a clean text representation of the customer's name.
For example, if the content looks like this:
<div id="name">
<b>[Customer</b>Name]
</div>
Then accessing it's text property gives:
var name = document.getElementById("name").text;
// sets name to "[CustomerName]" without the tags
which should be easy to process. Do a regex search now if you need to.
Edit: Since you're doing this processing on the server-side, process the XML recursively and collect the text element's of each node. Since I'm not big on VB.Net, here's some pseudocode:
getNodeText(node) {
text = ""
for each node.children as child {
if child.type == TextNode {
text += child.text
}
else {
text += getNodeText(child);
}
}
return text
}
myXml = xml.load(<html>);
print getNodeText(myXml);
And then replace or whatever there is to be done!
I have found what I believe is a solution to this issue. Well in my scenario it is working.
The html input has been tweaked to place each custom field or keyword within a div with a set id. I have looped through all of the elements within the html string using mshtml and have set the inner text to the correct value when a match is found.
e.g.
Function ReplaceDetails(ByVal message As String, ByVal customerName As String) As String
Dim returnString As String = String.Empty
Dim doc As IHTMLDocument2 = New HTMLDocument
doc.write(message)
doc.close()
For Each el As IHTMLElement In doc.body.all
If (el.id = "Date") Then
el.innerText = Now.ToShortDateString
End If
If (el.id = "CustomerName") Then
el.innerText = customerName
End If
Next
returnString = doc.body.innerHTML
return returnString
Thanks for all of the input. I'm glad to have a solution to the problem.