Extracting the first formatted line from some RTF/HTML text - html

OK, I painted myself into a corner on this one and haven't decided the way out yet.
My web application hosts a series of documents written by users, and edited with the CLEditor editor via PrimeFaces. The documents can be any size and have any formatting the user chooses.
What I want to do is treat the first line of the document as a title, so that when I create a listing of those documents I show only the title, then the user can click on that table row to see the whole document. I show the title with
<h:outputText value="#{backBean.doc}" escape="false" />
What I did is pull the substring of the document out up until but not including the first pattern of the br tag. That works unless the user applies formatting that spans past that. The resulting string has unclosed HTML tags usually div or span) and when they are output without escaping they interfere or even blank out the rest of the page.
So I am looking for an easy solution to fix the HTML fragment. I would rather not import a huge library such as JTidy because it pulls in all sorts of dependencies I don't have right now like a DOM parser, etc. Can anyone suggest a cheaper yet robust solution? Is there any way to clean this up on the client side?

I'd suggest Jsoup.
To parse the HTML and get its <body> content, it's a matter of this oneliner:
String htmlBody = Jsoup.parse(userInput).body().html();
By the way, since you seem to intend to redisplay user-controlled HTML unescaped, I strongly recommend to whitelist it to prevent XSS. E.g.
String safeHtmlBody = Jsoup.clean(htmlBody, Whitelist.basic());
This way you can safely redisplay it without worrying about a XSS attack hole:
<h:outputText value="#{bean.safeHtmlBody}" escape="false" />
See also:
What are the pros and cons of the leading Java HTML parsers?
How to implement a possibility for user to post some html-formatted data in a safe way?
CSRF, XSS and SQL Injection attack prevention in JSF

You should be escaping the partial contents of the document somehow, otherwise users can upload documents containing HTML/JavaScript code that will compromise your site. As you can see, even simple formatting can break it. One solution could be to remove all tags (via regex, string replace, etc) and then escape the title.

I figure out the JTidy way of doing it. This seems very heavy-handed to me but I'm going with it until something better is suggested. Also if someone else is in this situation it might be useful:
public class TitleRTF {
private static final Pattern pTidy = Pattern.compile("<body>(.*)</body>");
public TitleRTF() {}
public static String getTitle(String rtfSource) {
org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();
tidy.setQuiet(true);
ByteArrayInputStream bais = new ByteArrayInputStream(rtfSource.getBytes());
org.w3c.dom.Document doc = tidy.parseDOM(new BufferedInputStream(bais), null);
try {
Transformer tr = TransformerFactory.newInstance().newTransformer();
StreamResult result = new StreamResult(new StringWriter());
NodeList list = doc.getElementsByTagName("body");
if (list.getLength() > 0) {
DOMSource source = new DOMSource(list.item(0));
tr.transform(source, result);
String text = result.getWriter().toString();
Matcher m = pTidy.matcher(text);
if (m.find()) return m.group(1);
}
} catch (TransformerException ex) { }
return "(not parsable)";
}
}
One thing that needs to be added to this is a way of keeping JTidy from logging what it sees as HTML errors. The setQuiet(true) doesn't seem to do it.

Related

Typing is messy if I use html with QTextEdit

I'm trying to change the attributes of single words such font and color. QTextEdit allows me to set the text as html via setHtml(htmlText), after setting QString as html, typing becomes messy. I can't type spaces nor hit enter. Sometimes words are written backward.
void MainWindow::on_textEdit_textChanged()
{
QString plainText = ui->textEdit->toPlainText();
QString htmlText = "<font color='red'>" + plainText + "</font>";
disconnect(ui->textEdit, SIGNAL(textChanged()), this, SLOT(on_textEdit_textChanged()));
ui->textEdit->setHtml(htmlText);
QTextCursor cursor(ui->textEdit->textCursor());
cursor.movePosition(QTextCursor::EndOfWord);
ui->textEdit->setTextCursor(cursor);
connect(ui->textEdit, SIGNAL(textChanged()), this, SLOT(on_textEdit_textChanged()));
}
The color is set correctly but typing is inconsistent. I'm not expert in html. Any suggestions.
HTML is a transfer representation for the syntax tree of the document. You need to be modifying one or the other, otherwise you'll face the fallout from interactions between the two. Choose one and stick to it.
Since you're using the QTextDocument interface, you should be making all changes using that interface. There's no need to deal with HTML directly then. To change attributes of a chunk of text, select the text, then manipulate it via the cursor API.

Trouble with html encoding in Google Apps Script

I need to convert the HTML entity characters to their unicode versions. For example, when I have &amp, I would like just &. Is there a special function for this or do I have to use the function replace() for each couple of HTML Entity character <--> Unicode character?
Thanks in advance.
Even though there's no DOM in Apps Script, you can parse out HTML and get the plain text this way:
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join('');
default: return '';
}
}
calling
getTextFromHtml("hello <div>foo</div>& world <br /><div>bar</div>!");
will return
"hello foo& world bar!".
To explain, Xml.parse with the second param as "true" parses the document as an HTML page. We then walk the document (which will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML page), turning text nodes into text and expanding all other nodes.
In Javascript, (I assume that's what you're using), there's no builtin function, but you can assign the content to an html tag and then read the text out. Here's an example with jQuery:
function htmlDecode(value){
return $('<div/>').html(value).text();
}
Note that the tag does not need to actually be attached to the DOM. This just creates a new tag, reads out its contents, and then throws it away. You can accomplish something very similar in vanilla Javascript with just a few extra lines.

How can I transform some HTML fragment into XHTML using groovy?

I have an input String containing some HTML fragment like the following example
I would have enever thought that <b>those infamous tags</b>,
born in the <abbr title="Don't like that acronym">SGML</abbr> realm,
would make their way into the web of objects that we now experience.
Obviously, real one is by far more complex (including links, iamges, divs, and so on), and I would like to write a method having the following prototype
String toXHTML(String html) {
// What do I have to write here ?
}
Without a description of the input format, it will probably be some html-like stuff.
Parsing such a mess gets ugly quickly. But it looks like someone else did a good job already:
#!/usr/bin/env groovy
#Grapes(
#Grab(group='jtidy', module='jtidy', version='4aug2000r7-dev')
)
import org.w3c.tidy.*
def tidy = new Tidy()
tidy.parse(System.in, System.out)
Use the force, Riduidel.
Check out this: http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/
It might be something you are looking for.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html