Best HTML encoder for Delphi? - html

Seems like my data is getting corrupted when using HTTPapp.HTMLEncode( string ): String;
HTMLEncode( 'Jo&hn D<oe' ); // returns 'Jo&am'
This is not correct, and is corrupting my data. Does anyone have suggestions for VCL components that work better? Other than spending my time encoding all the cases
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Update
After understanding more about HTML, I have found there is no need to encode the other characters referenced in my link. You would only need to know about the four HTML reserved characters being
&,<,>,"
The issue with the VCL HTTPApp.HTMLEncode( ) function is because of the buffer size and the new Delphi 2009/2010 specifications for default Unicode string types, this can be fixed the way that #mason says below, or it can be fixed with a call to WideFormatBuf( ) instead of the FormatBuf( ) that is currently in use.

Replacing the <, >, &, and " characters in a string is trivial. You could thus easily write your own routine for this. (And if your HTML page is UTF-8, there is absolutely no reason to encode any other characters, such as U+222B (the integral sign).)
But if you wish to stick to the Delphi RTL, then you can have a look at HTTPUtil.HTMLEscape with the exactly same signature as HTTPApp.HTMLEncode.
Or, have a look at this SO question.

You're probably using Delphi 2009 or 2010. It looks to me like they forgot to update HTMLEncode for Unicode. It's passing the wrong buffer lengths to FormatBuf.
The HTMLEncode routine is basically right, aside from that, and it's pretty short. You could probably just make your own copy. Everywhere it calls FormatBuf, it gives 5 parameters. The second and fourth are integer values. Double both of them in each call, (there are only four of them), and then it will work.
Also, you ought to open a QC report on this so it will get fixed.

Small hint: do not convert single quote (') to &apos; - some browsers do not understand this code because &apos; is not valid HTML
For details, see: "The Curse of &apos;" and "XHTML and '"
(Both Delphi units mentioned do not convert single quotes).

Related

Text encoding problems in JSON.stringified() object

I have a index.html with a which sends a text to a PHP code. This PHP sends it again by POST (curl) to a Node.js server, inserted in a JSON message (utf8-encoded)
//Node.js server file (app.js) -- gets the json and shows it in a <script> to save it in client JS
render(index, {json:{string:"mystring"}})
//Template to render (index.ejs)
var data = <%=JSON.stringify(json)%>;
So that I can pass those variables in the JSON to data. JSON is way bigger than here, I wrote only the part which creates a bug : the string contained here makes an "INvalid character" JS bug. What should I do ? Which encoding/decoding/escaping should I use ?
I have utf-8 everywhere, as all my other strings work, even with german or arabic characters. In this particular case, this is the "mystring" below which breaks the app :
If I remove the characters in the red circles It works.
Here is the string as it is in the JSON i receive :
"Otto\nTheater-, Konzert- und Gpb\n\u2028\u2028Rhoasse\u00dfe 20\u2028\n51065 K\u00f6ln\n\nTelefon: 0000-000000-0\u2028\nTelefax: 0000-000000\n\nE-Mail: address#mail.com\u2028"
Because it is a user-entered text, I must handle this kind of characters. I don't have access to the PHP part of the code, only to the nodeJS and client JS. How can I find and remove/convert those chars in JS ?
<%- JSON.stringify(data).replace(/[\u0000\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/g, "\\n") %>;
I ended up replacing invalid unicode characters (which are valid for JSON but not in JS code) with line breaks. This solves the problem
JSON is commonly thought to be a subset of JavaScript, but it isn't quite. Due to an unfortunate oversight, the raw characters U+2028 and U+2029 are permitted in JSON string literals, but not in JavaScript string literals. In JavaScript, they are interpreted as newlines and so having one in a string literal is a syntax error.
Consequently this:
var data = <%=JSON.stringify(json)%>;
isn't safe. You can make it so by manually replacing them with string-literal-escaped versions:
JSON.stringify(json).replace('\u2028', '\\u2028').replace('\u2029', '\\u2029')
Typically it's best to avoid this kind of problem, and keep code and data strictly separated, by dropping the JSON data into an HTML data- attribute. It can then be read out of the DOM from the client-side script and passed through JSON.parse. Then the only kind of escaping you have to worry about is normal HTML-escaping, which hopefully your templating language does by default.
The other characters in your answer are actually okay for JS string literals, except for the control characters, which JSON also escapes.
It may well make sense to remove some of these characters anyway, as an input filtering step. It's unusual and almost always undesirable to have cruft like U+2028 in your data. You could consider filtering out the characters unsuitable for use in markup which include U+2028/9 and other bad things like bidi overrides that can mess up your page rendering.

Is it possible to escape & &apos; present in database?

The data retrieved from database has & or &apos;. How do I escape and show as & or ' without using gsub method?
If you can't stop the data from being inserted like that, then there is code here to create a function in MySQL that you can use in your query in order to return the decoded data.
Or from within Ruby, not using a replace strategy, take a look at how-do-i-encode-decode-html-entities-in-ruby.
First of all, an escape-sequence is found in string-analysis only, not in html or XML where you talk of masquerading. You can escape a string for reasons of concatenation for example. Html-Entities are specific entities which are replaced in urns to masquerade a special character. It is absolutely wrong to save strings still containing html-entities in a db-table. The masked string has to be demasked first, after you "reget" it from post :). Otherwise you try to save html-entities in a special table, eg. for programming reasons. A text-file should do better - try dBase 2 - or simply google the web for a page with an entity-listing.
The second point is that XML is - for the realization of better reading of your own code (in general), thought to be a personally defined markup-language. That is why any non-std-tags within that specification, have to be defined by your own. (It was strange to read about regular entities as "XML-entities", like in the case of "&apos(;)", explained on this entity-page: http://www.madore.org/~david/computers/unicode/htmlent.html)
Std-XML-tags (not entities) are mainly important in aspects of finalizing your html-code to better fit to ongoing programming languages later on, but in my opinion the mentioned ones are still html-entities!
This can and should be performed on the view level, ie, the front-end, since its an HTML entity.
assuming you use jquery, you can do this to make &apos; appear as ' on the HTML.
$('<div/>').html(''').text()
You can find respective entity values in the link above

How can I populate a query string variable to a text box which contains &,\ and $ in it

I have a variable like say A= drug & medicare $12/$15.
I need to assign it to a text box, but only 'drug' is posted the server. The rest of the data gets truncated.
this.textbox.text= request.querystring["A"].tostring();
The following is not valid for a="foo&bar$12":
http://example.com?a=foo&bar$12
The & symbol is a reserved character, it seperates query string variables. You will need to percent encode a value before sending them to that page.
Also & is a reserved character in HTML/XML. I suggest reading up on percent encoding and html encoding.
I believe you have problems with HTML entities. You need to read up on HTML escaping in your tool of choice. & cannot stand in HTML, since it begins an entity sequence - it needs to be replaced with &. Without specifying at least which toolchain you're using (as per #Richard's comment), we can't really suggest the best way to do it.
EDIT: Now that I reread your question, it seems A is not a variable but a query parameter :) Reading comprehension fail. Anyway, in this case a similar problem exists: & is not a valid character for a query parameter, and it needs URL escaping. Again, how exactly to do it is in the documentation for your toolchain, but in essence & will need to be replaced by %26. Plus sign is also not permitted (or rather it has another meaning); others are tolerated (but there are nicer ways to write them).
That looks more or less like ASP.NET pseudocode, so I'm going to diagnose your problem as the query string needing to be URL encoded. Key/value pairs in the query string are separated by an ampersand (&), and ASP.NET (along with other web platforms) automatically parse out the key value pairs for you.
In this case, the ampersand terminates the value of the "A=..." key/value pair. The problem will be solved if you can URL encode the link that brings the user into your page. If actually using ASP.NET, you can use the HttpUtility.UrlEncode() method for that:
string myValue = Server.UrlEncode("drug & medicare $12/$15");
You'll end up with this querystring instead: A=drug%20%26%20medicare%20%2412%2F%2415

Whats the deal with the named html entity for $ (&dollar;)

I have a fairly simple question. There is a named HTML entity in most references for the dollar sign, and it is what you would expect it to be; &dollar;.
But in other references, this is missing, and tell you only the numeric entity is available ($).
As I remember, the named entity didn't exist for a long time because the $ is part of the standard ASCII set. And due to this earlier/older versions of IE and other browsers don't support this entity.
So what's the deal with this currently? I am looking for what the support for the named entity is and why this wasn't supported in the first place...
Here's a reference to all the currency symbols where strangely enough only the dollar doesn't have a named entity.
Here is a small example of what I am talking about when you use a dollar + int. And yes, I know that in this simple example I could have just escaped the dollar sign with a slash but believe me when I say that making it an entity when I save the string is the sanest solution in my case.
Regardless of my example, I am still curious what the support for the &dollar; entity is.
The official list of entities doesn't list it, so I'd file it under “some browsers may have had support for it, don't rely on it, though.”
Generally, entities were needed to represent non-ASCII characters when the document character set was limited by ASCII. Nowadays with UTF-8 the most frequent character set on the web I think we can finally move past named entities and just use the characters directly.
The only sane solution is to use preg_quote() when using input for regular expressions. Otherwise you need to use html-enities for . \ + * ? [ ^ ] $ ( ) { } = ! < > | : - too.

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.