Reading embedded HTML in an XML document with QXmlStreamReader

Reading embedded HTML in an XML document with QXmlStreamReader - html

Using QXmlStreamReader, I would like to have XML files with rich text formatting, using HTML tags. For instance, in this file, it would be nice to have access to <em> and other HTML tags for formatting text. (And with Qt I can put HTML anywhere, in a QLabel or something.)
<?xml version="1.0" encoding="UTF-8"?>
<course name="Introductory Course">
<course-description>Welcome to the <em>basic course</em>.</course-description>
</course>
If I use QXmlStreamReader::readElementText(QXmlStreamReader::IncludeChildElements) when at the start element of <course-description>, I get the text inside <course-description> stripped of the tags, for example Welcome to the basic course.
Of course, I would like to do this without having to account for every single HTML tag in my code.

What I ended up doing is creating a method that I can use in places where I would otherwise call QXmlStreamReader::readElementText. In the XML file, I mark a tag with the XHTML namespace:
<?xml version="1.0" encoding="UTF-8"?>
<course name="Introductory Course">
<course-description xmlns="http://www.w3.org/1999/xhtml">Welcome to the <em>basic course</em>.</course-description>
</course>
Then whenever I read a tag with QXmlStreamReader, I can call readHtml. If the element has the XHTML namespace, it reads and returns all the elements until it reaches the closing element. (This implies that an element with the same name as the namespace-containing element (<course-description> above), cannot be included in the HTML code.)
QString MyClass::readHtml(QXmlStreamReader &xml)
{
if( xml.namespaceUri().toString() != "http://www.w3.org/1999/xhtml" )
{
return xml.readElementText(QXmlStreamReader::IncludeChildElements);
}
QString terminatingElement = xml.name().toString();
QString html;
QXmlStreamWriter writer(&html);
do
{
xml.readNext();
switch( xml.tokenType() )
{
case QXmlStreamReader::StartElement:
writer.writeStartElement(xml.name().toString());
writer.writeAttributes(xml.attributes());
break;
case QXmlStreamReader::EndElement:
writer.writeEndElement();
break;
case QXmlStreamReader::Characters:
writer.writeCharacters(xml.text().toString());
break;
// a more thorough approach would handle these; enumerating them removes a compiler warning
case QXmlStreamReader::NoToken:
case QXmlStreamReader::Invalid:
case QXmlStreamReader::StartDocument:
case QXmlStreamReader::EndDocument:
case QXmlStreamReader::Comment:
case QXmlStreamReader::DTD:
case QXmlStreamReader::EntityReference:
case QXmlStreamReader::ProcessingInstruction:
break;
}
}
while (!xml.atEnd() && xml.name() != terminatingElement );
return html;
}

Related

How can I render a tr with soy templates?

I've got this soy template
{template .myRowTemplate}
<tr><td>Hello</td></tr>
{/template}
and I want to do something like
var myTable = goog.dom.createElement("table");
goog.dom.appendChild(myTable, goog.soy.renderAsFragment(mytemplates.myRowTemplate));
goog.dom.appendChild(myTable, goog.soy.renderAsFragment(mytemplates.myRowTemplate));
But that causes
Uncaught goog.asserts.AssertionError
Assertion failed: This template starts with a <tr>,
which cannot be a child of a <div>, as required by soy internals.
Consider using goog.soy.renderElement instead.
Template output: <tr><td>Hello</td></tr>
What's the best way to do this?

Why it fails
Right, the documentation of renderAsFragment is a bit confusing; it reads:
Renders a Soy template into a single node or a document fragment. If the rendered HTML string represents a single node, then that node is returned
However, the (simplified) implementation of renderAsFragment is:
var output = template(opt_templateData);
var html = goog.soy.ensureTemplateOutputHtml_(output);
goog.soy.assertFirstTagValid_(html); // This is your failure
var safeHtml = output.toSafeHtml();
return dom.safeHtmlToNode(safeHtml);
So why do the closure author assert that the first tag is not <tr>?
That's because, internally, safeHtmlToNode places safeHtml in a temporary div, before deciding if it should return the div wrappper (general case) or the only child (if the rendered HTML represents only one Node). Once again simplified, the code of safeHtmlToNode is:
var tempDiv = goog.dom.createElement_(doc, goog.dom.TagName.DIV);
goog.dom.safe.setInnerHtml(tempDiv, html);
if (tempDiv.childNodes.length == 1) {
return tempDiv.removeChild(tempDiv.firstChild);
} else {
var fragment = doc.createDocumentFragment();
while (tempDiv.firstChild) {
fragment.appendChild(tempDiv.firstChild);
}
return fragment;
}
renderAsElement won't work either
And I'm unsure what you are asking for fragments, but unfortunately goog.soy.renderAsElement() will behave the same because it also uses a temporary div to render the DOM.
renderElement cannot loop
The error message suggests goog.soy.renderElement, but that will only work if your table has one row, since it replaces content, and doesn't append children nodes.
Recommended approach
So usually, we do the for loop in the template:
{template .myTable}
<table>
{foreach $person in $data.persons}
<tr><td>Hello {$person.name}</td></tr>
{/foreach}
</table>
{/template}
Of course, we can keep the simple template you have for one row and call it from the larger template.

Allow using some html tags in MVC 4

How i can allow client to use html tags in MVC 4?
I would like to save records to the database and when it extract in view allow only some HTML tags (< b > < i > < img >) and others tags must be represented as text.
My Controller:
[ValidateInput(false)]
[HttpPost]
public ActionResult Rep(String a)
{
var dbreader = new DataBaseReader();
var text = Request["report_text"];
dbreader.SendReport(text, uid, secret).ToString();
...
}
My View:
#{
var dbreader = new DataBaseReader();
var reports = dbreader.GetReports();
foreach (var report in reports)
{
<div class="report_content">#Html.Raw(report.content)</div>
...
}
}

You can replace all < chars to HTML entity:
tags = tags.Replace("<", "<");
Now, replace back only allowed tags:
tags = tags
.Replace("<b>", "<b>")
.Replace("</b>", "</b>")
.Replace("<i>", "</i>")
.Replace("</i>", "</i>")
.Replace("<img ", "<img ");
And render to page using #Html.Raw(tags)

If you are trying some property of your view model object to accept Html text, use AllowHtmlAttribute
[AllowHtml]
public string UserComment{ get; set; }
and before binding to the view
model.UserComment=model.UserComment.Replace("<othertagstart/end>",""); //hard

Turn off validation for report_text (1) and write custom HTML encoder (2):
Step 1:
Request.Unvalidated().Form["report_text"]
More info here. You don't need to turn off validation for entire controller action.
Step 2:
Write a custom html encoder (convert all tags except b, i, img to e.g.: script -> ;ltscript;gt), since you are customizing a default behaviour of request validation and html tag filtering. Consider to safeguard yourself from SQL injection attacks by checking SQL parameters passed to stored procedures/functions etc.

You may want to check out BBCode BBCode on Wikipedia. This way you have some control on what is allowed and what's not, and prevent illegal usage.
This would work like this:
A user submits something like 'the meeting will now be on [b]monday![/b]'
Before saving it to your database you remove all real html tags ('< ... >') to avoid the use of illegal tags or code injection, but leave the pseudo tags as they are.
When viewed you convert only the allowed pseudo html tags into real html

I found solution of my problem:
html = Regex.Replace(html, "<b>(.*?)</>", "<b>$1</b>");
html = Regex.Replace(html, "<i>(.*?)</i>", "<i>$1</i>");
html = Regex.Replace(html, "<img(?:.*?)src="(.*?)"(?:.*?)/>", "<img src=\"$1\"/>");

Can use of HtmlAgilityPack be modified to only extract main part of HTML document?

I have some .NET code that ingests HTML files and extracts text from them. I am using HtmlAgilityPack to do the extraction. Before I wanted to extract most of the text that was there that was there, so it worked fine. Now requirements have changed and I need to only extract text from he main body of the document. So suppose I scraped HTML from a news webpage. I just want the text of the article, not the ads, titles of other albeit related articles, header/footers etc.
It is possible to modify my calls to HtmlAgilityPack to only extract the main text? Or is there an alternative way to do the extraction?
Here's the current block of code that gets text from HTML:
using HtmlAgilityPack;
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode) node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
So, ideally, what I want is to let HtmlAgilityPack determine which parts of the input HTML constitute the "main" text block and input only those elements. I do not know what the structure of input HTML will be but I do know that it will vary a lot (before it was a lot more static)

code tag and pre css in html not functioning properly

in html i am using the code tag as below and also i am using the css as shown below :-
<style type="text/css">
code { white-space: pre; }
</style>
<code>
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</code>
When i use the code tag as shown above some part of the code is missing in the html page when viewed. when view the above html page the output is as shown below:-
public static ArrayList getFiles(File[] files){
ArrayList _files = new ArrayList();
for (int i=0; i fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
In the first method some part is missing and the second method is not appearing at all. what is the problem and how to fix it?

You have these <File> inside your <code> tag, you need to convert them to < and > html entities
Demo
<code>
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</code>

As already identified by Mr. Alien, you have characters being interpreted as markup inside your <code> block.
As an alternative to escaping lots of characters, providing your code does not include the string </script, you can exploit the parsing and (non)execution behaviour of the <script> element like this:
<code>
<script type="text/x-code">
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</script>
</code>
with this CSS:
script[type=text\/x-code] {
display: block;
white-space: pre;
line-height: 20px;
margin-top: -20px;
}
See JSfiddle: http://jsfiddle.net/fZuPm/3/
Update: In the comments, RoToRa raises some interesting points about the "correctness" of this approach, and I thank RoToRa for them.
Using a type attribute to stop the contents of a script tag from being executed as JavaScript is a well understood technique, and although the list of type names that cause script to be executed varies from browser to browser, finding one that won't cause execution is not hard.
More interesting is the question of the semantics. It is my view that the semantics of the script element are essentially inert, like a div or span element, while RoToRa's view is that it affects the semantics of the content. Looking at the specs, it is not easy to resolve. HTML 4.01 says very little about the semantics of the script element, concentrating solely on its functionality.
The HTML5 spec is not much better, but it does say "The element does not represent content for the user.". I don't know what to make of that. Saying what an element doesn't do is not very helpful. If it implies that its contents are semantically "hidden" from the user, such that the its contents are not semantically part of contents of the containing code element, then this technique should not be used.
If, however, it means that no new semantics are introduced by the script element, then there doesn't appear to be any problem.
I can't find any evidence of a script element being semantically required to contain script, as RoToRa suggests, and while it might be considered common-sense to infer that, that's not how HTML semantics works.
In many ways, this approach is really about trying to find a way to do validly what the XMP element does in browsers anyway, but is not valid. XMP was very nearly made valid in HTML5 but just missed out. The editor described it as a tough call. Using the script element like this meets that requirement, but it seems nevertheless to be controversial. If you are uncomfortable with whatever semantics you feel are being applied is this approach, I would suggest that you don't use it.

Word having single quotes search from xml file using jquery issue

Hi I need to parse XML file using jquery. I created read and display functionality. But when a word having single quote not working.
My XML is like this
<container>
<data name="Google" definition="A search engine"/>
<data name=" Mozilla's " definition="A web browser"/>
</ container>
using my jquery code I can read definition of Google. But I can't read Mozilla's definition due to that single quotes. This is my jquery code.
var displayDefinition = function(obj){
$.get("definitions.xml", function(data){
xml_data1.find("data[name^='"+obj.innerHTML+"']").each(function(k, v){
right=''+ $(this).attr("Defination") + '';
}
}
$(".result").append(right);
}
Any body knows the solution for this please help me.
Thanks

jQuery deals with single quotes very well. the structure of your function looks really wild though. I changed it a big assuming you want to create a function that can display the definition based on passing it a name: http://jsfiddle.net/rkw79/VQxZ2/
function display(id) {
$('container').find('data[name="' +id.trim()+ '"]').each(function() {
var right = $(this).attr("definition");
$(".result").html(right);
});
}
Note, you have to make sure your 'name' attribute does not begin or end with spaces; and just trim the string that the user passes in.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Reading embedded HTML in an XML document with QXmlStreamReader - html

Related

How can I render a tr with soy templates?

Allow using some html tags in MVC 4

Can use of HtmlAgilityPack be modified to only extract main part of HTML document?

code tag and pre css in html not functioning properly

Word having single quotes search from xml file using jquery issue

Categories

Resources