Trouble with html encoding in Google Apps Script - google-apps-script

I need to convert the HTML entity characters to their unicode versions. For example, when I have &amp, I would like just &. Is there a special function for this or do I have to use the function replace() for each couple of HTML Entity character <--> Unicode character?
Thanks in advance.

Even though there's no DOM in Apps Script, you can parse out HTML and get the plain text this way:
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join('');
default: return '';
}
}
calling
getTextFromHtml("hello <div>foo</div>& world <br /><div>bar</div>!");
will return
"hello foo& world bar!".
To explain, Xml.parse with the second param as "true" parses the document as an HTML page. We then walk the document (which will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML page), turning text nodes into text and expanding all other nodes.

In Javascript, (I assume that's what you're using), there's no builtin function, but you can assign the content to an html tag and then read the text out. Here's an example with jQuery:
function htmlDecode(value){
return $('<div/>').html(value).text();
}
Note that the tag does not need to actually be attached to the DOM. This just creates a new tag, reads out its contents, and then throws it away. You can accomplish something very similar in vanilla Javascript with just a few extra lines.

Related

Why does the browser automatically unescape html tag attribute values?

Below I have an HTML tag, and use JavaScript to extract the value of the widget attribute. This code will alert <test> instead of <test>, so the browser automatically unescapes attribute values:
alert(document.getElementById("hau").attributes[1].value)
<div id="hau" widget="<test>"></div>
My questions are:
Can this behavior be prevented in any way, besides doing a double escape of the attribute contents? (It would look like this: &lt;test&gt;)
Does anyone know why the browser behaves like this? Is there any place in the HTML specs that this behavior is mentioned explicitly?
1) It can be done without doing a double escape
Looks like yours is closer to htmlEncode().
If you don't mind using jQuery
alert(htmlEncode($('#hau').attr('widget')))
function htmlEncode(value){
//create a in-memory div, set it's inner text(which jQuery automatically encodes)
//then grab the encoded contents back out. The div never exists on the page.
return $('<div/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="hau" widget="<test>"></div>
If you're interested in a pure vanilla js solution
alert(htmlEncode(document.getElementById("hau").attributes[1].value))
function htmlEncode( html ) {
return document.createElement( 'a' ).appendChild(
document.createTextNode( html ) ).parentNode.innerHTML;
};
<div id="hau" widget="<test>"></div>
2) Why does the browser behave like this?
Only because of this behaviour, we are able to do a few specific things, such as including quotes inside of a pre-filled input field as shown below, which would not have been possible if the only way to insert " is by adding itself which again would require escaping with another char like \
<input type='text' value=""You &apos;should&apos; see the double quotes here"" />
The browser unescapes the attribute value as soon as it parses the document (mentioned here). One of the reasons might be that it would otherwise be impossible to include, for example, double quotes in your attribute value (well, technically it would if you put the value in single quotes instead, but then you wouldn't be able to include single quotes in the value).
That said, the behavior cannot be prevented, although if you really must use the value with the HTML entities being part of it, you could simply turn your special characters back into the codes (I recommend Underscore's escape for such task).

Make all lowercase letters capital in font file

I have a .ttf font file I created. I got all the capital characters, but not the lowercase. Is there a tool or easy way I could make all the lowercase letters the same as capital?
Example: The font should display "hello" as "HELLO"
If that is not possible is there a way I can do this with HTML/CSS?
Sure, you can add CSS rule:
body {
text-transform: uppercase;
}
#DominatorX According to your answer, you can try something like this:
var allDomElems = $('body *'),
helper;
allDomElems.each(function () {
helper = $(this).text();
helper = helper.toUpperCase();
if($(this).children().length === 0) {
$(this).text(helper);
}
});
This doesn't work in all cases, so you'll have to debug the script.
I am unsure how to convert cases in HTML, CSS, or font files. There is a method in JavaScript that can convert strings though.
("string").toUpperCase(/*Enter substring value, by default it converts the whole string*/);
A similar method is used for "lowercasing":
("string").toLowerCase();
You could input the HTML content into JavaScript to convert the cases, then document.write them back out onto the page.

Extracting the first formatted line from some RTF/HTML text

OK, I painted myself into a corner on this one and haven't decided the way out yet.
My web application hosts a series of documents written by users, and edited with the CLEditor editor via PrimeFaces. The documents can be any size and have any formatting the user chooses.
What I want to do is treat the first line of the document as a title, so that when I create a listing of those documents I show only the title, then the user can click on that table row to see the whole document. I show the title with
<h:outputText value="#{backBean.doc}" escape="false" />
What I did is pull the substring of the document out up until but not including the first pattern of the br tag. That works unless the user applies formatting that spans past that. The resulting string has unclosed HTML tags usually div or span) and when they are output without escaping they interfere or even blank out the rest of the page.
So I am looking for an easy solution to fix the HTML fragment. I would rather not import a huge library such as JTidy because it pulls in all sorts of dependencies I don't have right now like a DOM parser, etc. Can anyone suggest a cheaper yet robust solution? Is there any way to clean this up on the client side?
I'd suggest Jsoup.
To parse the HTML and get its <body> content, it's a matter of this oneliner:
String htmlBody = Jsoup.parse(userInput).body().html();
By the way, since you seem to intend to redisplay user-controlled HTML unescaped, I strongly recommend to whitelist it to prevent XSS. E.g.
String safeHtmlBody = Jsoup.clean(htmlBody, Whitelist.basic());
This way you can safely redisplay it without worrying about a XSS attack hole:
<h:outputText value="#{bean.safeHtmlBody}" escape="false" />
See also:
What are the pros and cons of the leading Java HTML parsers?
How to implement a possibility for user to post some html-formatted data in a safe way?
CSRF, XSS and SQL Injection attack prevention in JSF
You should be escaping the partial contents of the document somehow, otherwise users can upload documents containing HTML/JavaScript code that will compromise your site. As you can see, even simple formatting can break it. One solution could be to remove all tags (via regex, string replace, etc) and then escape the title.
I figure out the JTidy way of doing it. This seems very heavy-handed to me but I'm going with it until something better is suggested. Also if someone else is in this situation it might be useful:
public class TitleRTF {
private static final Pattern pTidy = Pattern.compile("<body>(.*)</body>");
public TitleRTF() {}
public static String getTitle(String rtfSource) {
org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();
tidy.setQuiet(true);
ByteArrayInputStream bais = new ByteArrayInputStream(rtfSource.getBytes());
org.w3c.dom.Document doc = tidy.parseDOM(new BufferedInputStream(bais), null);
try {
Transformer tr = TransformerFactory.newInstance().newTransformer();
StreamResult result = new StreamResult(new StringWriter());
NodeList list = doc.getElementsByTagName("body");
if (list.getLength() > 0) {
DOMSource source = new DOMSource(list.item(0));
tr.transform(source, result);
String text = result.getWriter().toString();
Matcher m = pTidy.matcher(text);
if (m.find()) return m.group(1);
}
} catch (TransformerException ex) { }
return "(not parsable)";
}
}
One thing that needs to be added to this is a way of keeping JTidy from logging what it sees as HTML errors. The setQuiet(true) doesn't seem to do it.

How to stop an html TEXTAREA from decoding html entities

I have a strange problem:
In the database, I have a literal ampersand lt semicolon:
<div
whenever its printed into a html textarea tag, the source code of the page shows the > as >.
How do I stop this decoding?
You can't stop entities being decoded in a textarea since the content of a textarea is not (unlike a script or style element) intrinsic CDATA, even though error recovery may sometimes give the impression that it is.
The definition of the textarea element is:
<!ELEMENT TEXTAREA - - (#PCDATA) -- multi-line text field -->
i.e. it contains PCDATA which is described as:
Document text (indicated by the SGML construct "#PCDATA"). Text may contain character references. Recall that these begin with & and end with a semicolon (e.g., Hergé's adventures of Tintin contains the character entity reference for the e acute character).
This means that when you type (the invalid HTML of) "start of tag" (<) the browser corrects it to "less than sign" (<) but when you type "start of entity" (&), which is allowed, no error correction takes place.
You need to write what you mean. If you want to include some HTML as data then you must convert any character with special meaning to its respective character reference.
If the data is:
<div
Then the HTML must be:
<textarea>&lt;div</textarea>
You can use the standard functions for converting this (e.g. PHP's htmlspecialchars or Perl's HTML::Entities module).
NB 1: If you were using XHTML[2] (and really using it, it doesn't count if you serve it as text/html) then you could use an explicit CDATA block:
<textarea><![CDATA[<div]]></textarea>
NB 2: Or if browsers implemented HTML 4 correctly
Ok , but the question is . why it decodes them anyway ? assuming i've added & , save the textarea , ti will be saved < , but displayed as < , saving it again will convert it back to < (but it will remain < in the database) , saving again will save it a < in the database , why the textarea decodes it ?
The server sends (to the browser) data encoded as HTML.
The browser sends (to the server) data encoded as application/x-www-form-urlencoded (or multipart/form-data).
Since the browser is not sending the data as HTML, the characters are not represented as HTML entities.
If you take the data received from the client and then put it into an HTML document, then you must encode it as HTML first.
In PHP, this can be done using htmlentities(). Example below.
<?php
$content = "This string contains the TM symbol: ™";
print "<textarea>". htmlentities($content) ."</textarea>";
?>
Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "™".
http://php.net/manual/en/function.htmlentities.php
You have to be sure that this is rendered to the browser:
<textarea name="somename">&lt;div</textarea>
Essentially, this means that the & in < has to be html encoded to &. How to do it will depend on the technologies you're using.
UPDATE: Think about it like this. If you want to display <div> inside a textarea, you'll have to encode <> because otherwise, <div> would be a normal HTML element to the browser:
<textarea name="somename"><div></textarea>
Having said this, if you want to display <div> inside a textarea, you'll have to encode & again, because the browser decodes HTML entities when rendering HTML. It has nothing to do with your database.
You can serve your DB-content from a separate page and then place it in the textarea using a Javascript (jQuery) Ajax-call:
request = $.ajax
({
type: "GET",
url: "url-with-the-troubled-content.php",
success: function(data)
{
document.getElementById('id-of-text-area').value = data;
}
});
Explained at
http://www.endtask.net/how-to-prevent-a-textarea-element-from-decoding-html-entities/
I had the same problem and I just made two replacements on the text to show from the database before letting it into the text area:
myString = Replace(myString, "&", "&")
myString = Replace(myString, "<", "<")
Replace n:o 1 to trick the textarea to show the codes.
replace n:o 2: Without this replacement you can not show the word "" inside the textarea (it would end the textarea tag).
(Asp / vbscript code above, translate to a replace method of your language choice)
I found an alternative solution for reading and working with in-browser, simply read the element's text() using jQuery, it returns the characters as display characters and allows me to write from a textarea to a div's innerHTML using the property via html()...
With only JS and HTML...
...to answer the actual question, with a bare-minimal example:
<textarea id=myta></textarea>
<script id=mytext type=text/plain>
™
</script>
<script> myta.value = mytext.innerText; </script>
Explanation:
Script tags do not render html nor entities. By storing text in a script tag, it will remain unadultered-- problem is it will try to execute as JavaScript. So we use an empty textarea and store the text in a script tag (here, the first one).
To prevent that, we change the mime-type to text/plain instead of it's default, which is text/javascript. This will prevent it from running.
Then to populate the textarea, we copy the script tag's content to it (here done in the second script tag).
The only caveats I have found with this are you have to use JavaScript and you cannot include script tags directly in it.

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html