convert html entities to unicode(utf-8) strings in c? [duplicate] - html

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to decode HTML Entities in C?
This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function should do:
input output
< <
> >
ä ä
ß ß
The function should have the signature char *html2str(char *html) or similar. I'm not reading byte by byte from a stream.
Is there a library function I can use?

There isn't a standard library function to do the job. There must be a large number of implementation available in the Open Source world - just about any program that has to deal with HTML will have one.
There are two aspects to the problem:
Finding the HTML entities in the source string.
Inserting the appropriate replacement text in its place.
Since the shortest possible entity is '&x;' (but, AFAIK, they all use at least 2 characters between the ampersand and the semi-colon), you will always be shortening the string since the longest possible UTF-8 character representation is 4 bytes. Hence, it is possible to edit in situ safely.
There's an illustration of HTML entity decoding in 'The Practice of Programming' by Kernighan and Pike, though it is done somewhat 'in passing'. They use a tokenizer to recognize the entity, and a sorted table of entity names plus the replacement value so that they can use a binary search to identify the replacements. This is only needed for the non-algorithmic entity names. For entities encoded as 'ß', you use an algorithmic technique to decode them.

This sounds like a job for flex. Granted, flex is usually stream-based, but you can change that using the flex function yy_scan_string (or its relatives). For details, see The flex Manual: Scanning Strings.
Flex's basic Unicode support is pretty bad, but if you don't mind coding in the bytes by hand, it could be a workaround. There are probably other tools that can do what you want, as well.

Related

Is it possible to escape & &apos; present in database?

The data retrieved from database has & or &apos;. How do I escape and show as & or ' without using gsub method?
If you can't stop the data from being inserted like that, then there is code here to create a function in MySQL that you can use in your query in order to return the decoded data.
Or from within Ruby, not using a replace strategy, take a look at how-do-i-encode-decode-html-entities-in-ruby.
First of all, an escape-sequence is found in string-analysis only, not in html or XML where you talk of masquerading. You can escape a string for reasons of concatenation for example. Html-Entities are specific entities which are replaced in urns to masquerade a special character. It is absolutely wrong to save strings still containing html-entities in a db-table. The masked string has to be demasked first, after you "reget" it from post :). Otherwise you try to save html-entities in a special table, eg. for programming reasons. A text-file should do better - try dBase 2 - or simply google the web for a page with an entity-listing.
The second point is that XML is - for the realization of better reading of your own code (in general), thought to be a personally defined markup-language. That is why any non-std-tags within that specification, have to be defined by your own. (It was strange to read about regular entities as "XML-entities", like in the case of "&apos(;)", explained on this entity-page: http://www.madore.org/~david/computers/unicode/htmlent.html)
Std-XML-tags (not entities) are mainly important in aspects of finalizing your html-code to better fit to ongoing programming languages later on, but in my opinion the mentioned ones are still html-entities!
This can and should be performed on the view level, ie, the front-end, since its an HTML entity.
assuming you use jquery, you can do this to make &apos; appear as ' on the HTML.
$('<div/>').html(''').text()
You can find respective entity values in the link above

Whats the deal with the named html entity for $ (&dollar;)

I have a fairly simple question. There is a named HTML entity in most references for the dollar sign, and it is what you would expect it to be; &dollar;.
But in other references, this is missing, and tell you only the numeric entity is available ($).
As I remember, the named entity didn't exist for a long time because the $ is part of the standard ASCII set. And due to this earlier/older versions of IE and other browsers don't support this entity.
So what's the deal with this currently? I am looking for what the support for the named entity is and why this wasn't supported in the first place...
Here's a reference to all the currency symbols where strangely enough only the dollar doesn't have a named entity.
Here is a small example of what I am talking about when you use a dollar + int. And yes, I know that in this simple example I could have just escaped the dollar sign with a slash but believe me when I say that making it an entity when I save the string is the sanest solution in my case.
Regardless of my example, I am still curious what the support for the &dollar; entity is.
The official list of entities doesn't list it, so I'd file it under “some browsers may have had support for it, don't rely on it, though.”
Generally, entities were needed to represent non-ASCII characters when the document character set was limited by ASCII. Nowadays with UTF-8 the most frequent character set on the web I think we can finally move past named entities and just use the characters directly.
The only sane solution is to use preg_quote() when using input for regular expressions. Otherwise you need to use html-enities for . \ + * ? [ ^ ] $ ( ) { } = ! < > | : - too.

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.

Best HTML encoder for Delphi?

Seems like my data is getting corrupted when using HTTPapp.HTMLEncode( string ): String;
HTMLEncode( 'Jo&hn D<oe' ); // returns 'Jo&am'
This is not correct, and is corrupting my data. Does anyone have suggestions for VCL components that work better? Other than spending my time encoding all the cases
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Update
After understanding more about HTML, I have found there is no need to encode the other characters referenced in my link. You would only need to know about the four HTML reserved characters being
&,<,>,"
The issue with the VCL HTTPApp.HTMLEncode( ) function is because of the buffer size and the new Delphi 2009/2010 specifications for default Unicode string types, this can be fixed the way that #mason says below, or it can be fixed with a call to WideFormatBuf( ) instead of the FormatBuf( ) that is currently in use.
Replacing the <, >, &, and " characters in a string is trivial. You could thus easily write your own routine for this. (And if your HTML page is UTF-8, there is absolutely no reason to encode any other characters, such as U+222B (the integral sign).)
But if you wish to stick to the Delphi RTL, then you can have a look at HTTPUtil.HTMLEscape with the exactly same signature as HTTPApp.HTMLEncode.
Or, have a look at this SO question.
You're probably using Delphi 2009 or 2010. It looks to me like they forgot to update HTMLEncode for Unicode. It's passing the wrong buffer lengths to FormatBuf.
The HTMLEncode routine is basically right, aside from that, and it's pretty short. You could probably just make your own copy. Everywhere it calls FormatBuf, it gives 5 parameters. The second and fourth are integer values. Double both of them in each call, (there are only four of them), and then it will work.
Also, you ought to open a QC report on this so it will get fixed.
Small hint: do not convert single quote (') to &apos; - some browsers do not understand this code because &apos; is not valid HTML
For details, see: "The Curse of &apos;" and "XHTML and '"
(Both Delphi units mentioned do not convert single quotes).

Is there some functionality in/for Delphi that converts a string with html named and numbered entities to unicode text?

I read data from a mysql database that has is filled by php scripts. All special characters are converted to named or numbered html entities (for example & a m p ; & # 2 8 6 ;).
I know of no way to convert these characters back to the original ones in Delphi as unicode strings. Did anyone ever find or even create such a function? This would be very helpful to me. Thanks!
Marc
In Delphi 2007 there is a unit called HTTPApp.pas (in [Delphi Folder]\Source\Win32\Internet) that has the functions HTMLEncode and HTMLDecode. They might be worth a look.