How can I transform some HTML fragment into XHTML using groovy? - html

I have an input String containing some HTML fragment like the following example
I would have enever thought that <b>those infamous tags</b>,
born in the <abbr title="Don't like that acronym">SGML</abbr> realm,
would make their way into the web of objects that we now experience.
Obviously, real one is by far more complex (including links, iamges, divs, and so on), and I would like to write a method having the following prototype
String toXHTML(String html) {
// What do I have to write here ?
}

Without a description of the input format, it will probably be some html-like stuff.
Parsing such a mess gets ugly quickly. But it looks like someone else did a good job already:
#!/usr/bin/env groovy
#Grapes(
#Grab(group='jtidy', module='jtidy', version='4aug2000r7-dev')
)
import org.w3c.tidy.*
def tidy = new Tidy()
tidy.parse(System.in, System.out)
Use the force, Riduidel.

Check out this: http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/
It might be something you are looking for.

Related

How can I strip HTML tags from a string in the model before I get to the view

Trying to determine how to strip the HTML tags from a string in Ruby. I need this to be done in the model before I get to the view. So using:
ActionView::Helpers::SanitizeHelperstrip_tags()
won't work. I was looking into using Nokogiri, but can't figure out how to do it.
If I have a string:
description = google
I need it to be converted to plain text without including HTML tags so it would just come out as "google".
Right now I have the following which will take care of HTML entities:
def simple_description
simple_description = Nokogiri::HTML.parse(self.description)
simple_description.text
end
You can call the sanitizer directly like this:
Rails::Html::FullSanitizer.new.sanitize('<b>bold</b>')
# => "bold"
There are also other sanitizer classes that may be useful: FullSanitizer, LinkSanitizer, Sanitizer, WhiteListSanitizer.
Nokogiri is a great choice if you don't own the HTML generator and you want to reduce your maintenance load:
require 'nokogiri'
description = 'google'
Nokogiri::HTML::DocumentFragment.parse(description).at('a').text
# => "google"
The good thing about a parser vs. using patterns, is the parser continues work with changes to the tags or format of the document, whereas patterns get tripped up by those things.
While using a parser is a little slower, it more than makes up for that by the ease of use and reduced maintenance.
The code above breaks down to:
Nokogiri::HTML(description).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>google</body></html>\n"
Rather than let Nokogiri add the normal HTML headers, I told it to parse only that one node into a document fragment:
Nokogiri::HTML::DocumentFragment.parse(description).to_html
# => "google"
at finds the first occurrence of that node:
Nokogiri::HTML::DocumentFragment.parse(description).at('a').to_html
# => "google"
text finds the text in the node.
Maybe you could use regular expression in ruby like following
des = 'google'
p des[/<.*>(.*)\<\/.*>/,1]
The result will be "google"
Regular expression is powerful.
You could customize to fit your needs.

How to (re)create JSON records from HTML data attributes

This might be something quite trivial, but I have very little experience in the field so it might as well turn out quite complicated.
Basically, I'm trying to render a bunch of HTML data attributes into their JSON equivalents.
For example I have the following HTML markup (I've cut it since it's too long):
<span class="play-queue play-queue-med-small" data-json=
"{"id":
4276028,
"selected":
false,"type":
"track","sku":
"track-4276028","name":
"This Is What It Feels Like feat. Trevor Guthrie","trackNumber":
2,"active":true,"mixName":
"W&
W Remix","title":
"
This Is What It Feels Like feat. Trevor Guthrie
(W&W Remix)","slug":
&quot
;this-is-what-it-feels-like-feat-trevor-guthrie-w-and-w-remix",
"isrc":"
NLF711303293","releaseDate":
"2013-04-05","publishDate":"2013-04-05","sampleUrl":"
From this I want to render the data-json attribute into a JSON record looking like:
{
“key1”:”value1”,
“key2”:”value2”,
...
“keyN”:”valueN”
}
Is there any way to do this? Maybe a method (either in C# or in Java), or some workaround?
Thank you a million in advance!
The client side solution would look like:
JSON.parse(unescape($('span.play-queue').data('json')))
The C# equivelent is the HttpUtility.HtmlDecode

Good way to store formatted text in DB to output later

I write news for my website and format it like this:
[h1]News[h1]
[red]Happy New Year[/red]
[white]Happy New Year[/white]
The news are stored as is on the MySQL DB.
Then when it's called by my website, a function converts every code into HTML format.
[h1][/h1] = <h1></h1>
[red][/red] = <font color=red></font>
I'm not happy with this method for a long time, but now such codes are obsolet for HTML5.
Instead of using I should add it to CSS.
I'm very beginner with PHP, MySQL, CSS, HTML...really, but I'm trying and learning.
So, what I need is the best solution for this matter.
I was thinking to create a CSS rule like:
span.news-red { color=red }
span.news-white { color=white }
And then them into the code for red text, etc...
Is this an effective solution or just a palliative?
Thank you.
EDIT
I have this two functions to convert format of my text in order to be outputed for the visitor.
1st = Converts [white-text][/white-text] into
$string = preg_replace("/\[white-text\](\S+?)\[\/white-text\]/si","<font color=white>\\1</font>", $string);
2nd - Converts [url][/url] into
$string = preg_replace("/\[url\](\S+?)\[\/url\]/si","\\1", $string);
Problems:
WHITE-TEXT - It only changes the color of one word phrases.
URL - It works fine, but I would like to be able to write anything in the readable part of the URL.
In general, you want to have styles of text that are common. Give them descriptions as to why you are doing what you are doing. If I were you, I would name them something as to what they are in the db. Then let's say you decide that Red is just a horrible choice of colors. You could always change it to a different one very easily, just by editing the CSS.
Not knowing why you choose to make something red, I can't give you much of an answer, other than to try and use the css name that relates to why you chose red, rather than what you are doing in the first place.

Extracting the first formatted line from some RTF/HTML text

OK, I painted myself into a corner on this one and haven't decided the way out yet.
My web application hosts a series of documents written by users, and edited with the CLEditor editor via PrimeFaces. The documents can be any size and have any formatting the user chooses.
What I want to do is treat the first line of the document as a title, so that when I create a listing of those documents I show only the title, then the user can click on that table row to see the whole document. I show the title with
<h:outputText value="#{backBean.doc}" escape="false" />
What I did is pull the substring of the document out up until but not including the first pattern of the br tag. That works unless the user applies formatting that spans past that. The resulting string has unclosed HTML tags usually div or span) and when they are output without escaping they interfere or even blank out the rest of the page.
So I am looking for an easy solution to fix the HTML fragment. I would rather not import a huge library such as JTidy because it pulls in all sorts of dependencies I don't have right now like a DOM parser, etc. Can anyone suggest a cheaper yet robust solution? Is there any way to clean this up on the client side?
I'd suggest Jsoup.
To parse the HTML and get its <body> content, it's a matter of this oneliner:
String htmlBody = Jsoup.parse(userInput).body().html();
By the way, since you seem to intend to redisplay user-controlled HTML unescaped, I strongly recommend to whitelist it to prevent XSS. E.g.
String safeHtmlBody = Jsoup.clean(htmlBody, Whitelist.basic());
This way you can safely redisplay it without worrying about a XSS attack hole:
<h:outputText value="#{bean.safeHtmlBody}" escape="false" />
See also:
What are the pros and cons of the leading Java HTML parsers?
How to implement a possibility for user to post some html-formatted data in a safe way?
CSRF, XSS and SQL Injection attack prevention in JSF
You should be escaping the partial contents of the document somehow, otherwise users can upload documents containing HTML/JavaScript code that will compromise your site. As you can see, even simple formatting can break it. One solution could be to remove all tags (via regex, string replace, etc) and then escape the title.
I figure out the JTidy way of doing it. This seems very heavy-handed to me but I'm going with it until something better is suggested. Also if someone else is in this situation it might be useful:
public class TitleRTF {
private static final Pattern pTidy = Pattern.compile("<body>(.*)</body>");
public TitleRTF() {}
public static String getTitle(String rtfSource) {
org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();
tidy.setQuiet(true);
ByteArrayInputStream bais = new ByteArrayInputStream(rtfSource.getBytes());
org.w3c.dom.Document doc = tidy.parseDOM(new BufferedInputStream(bais), null);
try {
Transformer tr = TransformerFactory.newInstance().newTransformer();
StreamResult result = new StreamResult(new StringWriter());
NodeList list = doc.getElementsByTagName("body");
if (list.getLength() > 0) {
DOMSource source = new DOMSource(list.item(0));
tr.transform(source, result);
String text = result.getWriter().toString();
Matcher m = pTidy.matcher(text);
if (m.find()) return m.group(1);
}
} catch (TransformerException ex) { }
return "(not parsable)";
}
}
One thing that needs to be added to this is a way of keeping JTidy from logging what it sees as HTML errors. The setQuiet(true) doesn't seem to do it.

Is there an easy way to strip HTML from a QString in Qt?

I have a QString with some HTML in it... is there an easy way to strip the HTML from it? I basically want just the actual text content.
<i>Test:</i><img src="blah.png" /><br> A test case
Would become:
Test: A test case
I'm curious to know if Qt has a string function or utility for this.
QString s = "<i>Test:</i><img src=\"blah.png\" /><br> A test case";
s.remove(QRegExp("<[^>]*>"));
// s == "Test: A test case"
If you don't care about performance that much then QTextDocument does a pretty good job of converting HTML to plain text.
QTextDocument doc;
doc.setHtml( htmlString );
return doc.toPlainText();
I know this question is old, but I was looking for a quick and dirty way to handle incorrect HTML. The XML parser wasn't giving good results.
You may try to iterate through the string using QXmlStreamReader class and extract all text (if you HTML string is guarantied to be well formed XML).
Something like this:
QXmlStreamReader xml(htmlString);
QString textString;
while (!xml.atEnd()) {
if ( xml.readNext() == QXmlStreamReader::Characters ) {
textString += xml.text();
}
}
but I'm unsure that its 100% valid ussage of QXmlStreamReader API since I've used it quite longe time ago and may forget something.
the situation that some html is not quite validate xml make it worse to work it out correctly.
If it's valid xml (or not too bad formated), I think QXmlStreamReader + QXmlStreamEntityResolver might not be bad idea.
Sample code in: https://github.com/ycheng/misccode/blob/master/qt_html_parse/utils.cpp
(this can be a comment, but I still don't have permission to do so)
this answer is for who read this post later and using Qt5 or later. simply escape the html characters using inbuilt functions as below.
QString str="<h1>some hedding </h1>"; // a string containing html tags.
QString esc=str.toHtmlEscaped(); //esc contains the html escaped srring.