Replace characters with HTML entities in Java [duplicate] - html

This question already has answers here:
XSS prevention in JSP/Servlet web application
(10 answers)
Closed 19 days ago.
I want to replace certain characters with their respective HTML entities in an HTML response inside a filter. Characters include <, >, &. I can't use replaceAll() as it will replace all characters, even those that are part of HTML tags.
What is the best approach for doing so?

From Java you may try Apache Commons Lang (legacy v2) StringEscapeUtils.escapeHtml(). Or with commons-lang3: StringEscapeUtils.escapeHtml4().
Please note this also converts à to à & such.

If you're using a technology such as JSTL, you can simply print out the value using <c:out value="${myObject.property}"/> and it will be automatically escaped.
The attribute escapeXml is true by default.
escapeXml - Determines whether characters <,>,&,'," in the resulting
string should be converted to their corresponding character entity
codes. Default value is true.
http://docs.oracle.com/javaee/5/jstl/1.1/docs/tlddocs/

When developing in Spring ecosystem, one can use HtmlUtils.htmlEscape() method.
For full apidocs, visit https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/util/HtmlUtils.html

Since most solutions reference a deprecated Apache class, here's one I've adapted from https://stackoverflow.com/a/16947646/3196753.
public class StringUtilities {
public static final String[] HTML_ENTITIES = {"&", "<", ">", "\"", "'", "/"};
public static final String[] HTML_REPLACED = {"&", "<", ">", """, "&apos;", "&sol;"};
public static String escapeHtmlEntities(String text) {
return StringUtils.replaceEach(text, HTML_ENTITIES, HTML_REPLACED);
}
}
Note: This is not a comprehensive solution (it's not context-aware -- may be too aggressive) but I needed a quick, effective solution.

Related

Regular expression to remove HTML tags from a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regular expression to remove HTML tags
Is there an expression which will get the value between two HTML tags?
Given this:
<td class="played">0</td>
I am looking for an expression which will return 0, stripping the <td> tags.
You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.
The following examples are Java, but the regex will be similar -- if not identical -- for other languages.
String target = someString.replaceAll("<[^>]*>", "");
Assuming your non-html does not contain any < or > and that your input string is correctly structured.
If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:
String target = someString.replaceAll("(?i)<td[^>]*>", "");
Edit:
Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.
For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.
In a situation where multiple tags are expected, we could do something like:
String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();
This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.
A trivial approach would be to replace
<[^>]*>
with nothing. But depending on how ill-structured your input is that may well fail.
You could do it with jsoup http://jsoup.org/
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);

HTML Codes .. Names Vs Numbers

This might seem like a stupid question but I have always wondered. What would be the advantage of using HTML code names versus HTML code numbers. Is there a right place or a wrong place for each version?
By HTML codes I am referring to this..
http://ascii.cl/htmlcodes.htm
I know for validation purposes codes should be used for example using & or & versus using &. However I don't know when it would be right to use & over & ... or does it simply make no difference?
It makes no difference. The reason why, for example, & was created was to make it easier for coders to remember and make code easier to read.
It just comes down to, one is easier for us (humans) to read.
Some terminology: A code like "&" is properly called a character entity reference; a code like "&" is a numeric character reference.
Together, we can refer to them all as "HTML entities." For a given code point, there is sometimes a character entity reference, but there is always a numeric character reference, which can be formed from the Unicode encoding of the character. For instance, ℛ has the numeric character reference "ℛ".
Generally it's the ASCII characters that have character entity references, but not always.
Character entity references are usually easier to read, but in a particular context a set of numeric character references might possibly be. For instance, if you were writing a regular expression to match a certain block of Unicode characters.
When you say "for validation purposes codes should be used," I think you have in mind the rule that a bare ampersand is not valid HTML. That's specific to this character.
Update
An example where you have to use the numeric character entity: There is no character entity reference for the single quote character, "'". A piece of JavaScript to scrub quote characters out of a string has to use the numeric character entity.
Using names is always preferrable, as it is always more readable. Consider the following pieces of identical code:
$location = "Faith Life Church";
$city = "Sarasota";
/*...*/
foreach ($stmt as $row) {
foreach ($row as $variable => $value) {
$variable = strtolower($variable);
$$variable = $value;
}
}
And
$v073124 = "Faith Life Church";
$v915431 = "Sarasota";
/*...*/
foreach ($v3245 as $v9825) {
foreach ($v9825 as $v85423 => $v8245631) {
$v85423 = strtolower($v85423);
$$v85423 = $v8245631;
}
}
Which would you consider more readable?
It's your choice. Code may be easier for someone to remember instead of letters.
BUT I think HTML 5 require you using letters as a standard for what you can. Really not sure about this.

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Qt Regex matches HTML Tag InnerText

I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0
HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.
DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one
i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)
I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.

Regex to Parse Hyperlinks and Descriptions

C#: What is a good Regex to parse hyperlinks and their description?
Please consider case insensitivity, white-space and use of single quotes (instead of double quotes) around the HREF tag.
Please also consider obtaining hyperlinks which have other tags within the <a> tags such as <b> and <i>.
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
As long as there are no nested tags (and no line breaks), the following variant works well:
<a\s+href=(?:"([^"]+)"|'([^']+)').*?>(.*?)</a>
As soon as nested tags come into play, regular expressions are unfit for parsing. However, you can still use them by applying more advanced features of modern interpreters (depending on your regex machine). E.g. .NET regular expressions use a stack; I found this:
(?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|</a>(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:</a>)
Source: http://weblogs.asp.net/scottcate/archive/2004/12/13/281955.aspx
See this example from StackOverflow: Regular expression for parsing links from a webpage?
Using The HTML Agility Pack you can parse the html, and extract details using the semantics of the HTML, instead of a broken regex.
I found this but apparently these guys had some problems with it.
Edit: (It works!)
I have now done my own testing and found that it works, I don't know C# so I can't give you a C# answer but I do know PHP and here's the matches array I got back from running it on this:
Text
array(3) { [0]=> string(52) "Text" [1]=> string(15) "pages/index.php" [2]=> string(4) "Text" }
I have a regex that handles most cases, though I believe it does match HTML within a multiline comment.
It's written using the .NET syntax, but should be easily translatable.
Just going to throw this snippet out there now that I have it working..this is a less greedy version of one suggested earlier. The original wouldnt work if the input had multiple hyperlinks. This code below will allow you to loop through all the hyperlinks:
static Regex rHref = new Regex(#"<a.*?href=[""'](?<url>[^""^']+[.]*?)[""'].*?>(?<keywords>[^<]+[.]*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
public void ParseHyperlinks(string html)
{
MatchCollection mcHref = rHref.Matches(html);
foreach (Match m in mcHref)
AddKeywordLink(m.Groups["keywords"].Value, m.Groups["url"].Value);
}
Here is a regular expression that will match the balanced tags.
(?:""'[""'].*?>)(?(?>(?)|(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:)