PHP4: Json_encode method which accepts multi byte chars - json

in my company we have a webservice zu send data from very old projects to pretty new ones. The old projects run PHP4.4 which has natively no json_encode method. So we used the PEAR class Service_JSON instead. http://www.abeautifulsite.net/using-json-encode-and-json-decode-in-php4/
Today, I found out, that this class can not deal with multi byte chars because it extensively uses ord() in order to get charcodes from the string and replace the chars. There is no mb_ord() implementation, not even in newer PHP versions. It also uses $string{$index} to access the char at a index, I'm not completely sure if this supports multi byte chars.
//Excerpt from encode() method
// STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT
$ascii = '';
$strlen_var = $this->strlen8($var);
/*
* Iterate over every character in the string,
* escaping with a slash or encoding to UTF-8 where necessary
*/
for ($c = 0; $c < $strlen_var; ++$c) {
$ord_var_c = ord($var{$c});
//Here comes a switch which replaces chars according o their hex code and writes them to $ascii
we call
$Service_Json = new Service_JSON();
$data = $Service_Json->encode('Marktplatz, Hauptstraße, Endingen');
echo $data; //prints "Marktplatz, Hauptstra\u00dfe, Endinge". The n is missing
We solved this problem by setting up another webservice which receives serialised arrays and returns a json_encoded string. This service runs on a modern mahine, so it uses PHP5.4. But this "solutions is pretty awkward and I should look for a better one. Does anyone have an idea?
Problem description
German umlauts are replaced properly. BUT then the string is cut of at the end because ord returns the wrong chars. . mb_strlen() does not change anything, it gives the same length as strlen in this case.
Input string was "Marktplatz, Hauptstraße, Endingen", the n at the end was cut off. The ß was correctly encoded to \u00df. For every Umlaut it cuts of one more char at the end.
It's also possible the reason is our old database encoding, but the replacement itself works correctly so I guess it's the ord() method.

A colleague found out that
mb_strlen($var, 'ASCII');
solves the problem. We had an older lib version in use which used simple mb_strlen. This fix seems to do the same as your mb_convert_encoding();
Problem is solved now. Thank you very much for your help!

Related

Why is "Warning: Implicit string type conversion from AnsiString to UnicodeString" here while both are Strings?

Here I get an warning Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
....
{$mode DelphiUnicode}
{$H+}
....
Function THeader.ToHtml(Constref input: String): String;
Begin
Result := Format('<h%d>%s</h%d>', [FLevel, Chunk(input), FLevel]); // <--- HERE !
End;
My project settings include -MDelphiUnicode. My Lazarus version is 2.2.2.
As I understand it means that if Chunk() returns symbols outside of ASCII (Unicode), then the Result will be problematic. Right? What to do with this warning? Sure, I can cast the Format() result to String. But why is it required? I see that Format's prototype is:
// somewhere in the sysstrh.inc ...
Function Format (Const Fmt : String; const Args : Array of const) : String;
so it already returns a String (which is magically UnicodeString in my case, as I think). What is the problem actually here? And how to work in the correct way with such library functions like Format() (for instance, GetOptionValue() of TCustomApplication)?
ps. I read FreePascal Wiki about Unicode and String types, but I still cannot understand the reason of this warning :)
There are multiple reasons to do so.
The exact codepage of ansistring is under control of the RTL, which can query the OS for it, without the compiler knowing the details. In Lazarus applications this is generally set to utf8, but the compiler doesn't know that.
So calling a ansistring format() could corrupt strings, and repeated conversions are of course also not ideal for a performance.
delphiunicode is a work in progress, and I would not recommend using it (yet) out of habit, only if you really know what you are doing (and by that I mean knowing the state of it in FPC, not that it works in Delphi)
The original plan was to migrate to unicodestring fully, but since Windows now allows UTF8 as native 1-byte codepage (see thick in application tab of project options), the progress on that migration is glacial.
In short, consider arranging your code as much as possible so that string type doesn't matter, and then use utf8 ansistrings in Lazarus for unicode.
Or ignore the warnings, or disable them with some -vn parameter that allows you to disable specific hints/warnings

MySQL failing on LEFT() with String 8863 characters long?

Yet another fun and unexplained issue with MySQL. Code works perfectly fine with all other shorter strings (and has been for months), but when I try to following code on a String that's 8863 in length (designed to simply remove a comma as the last character), it just does nothing. No error or anything. Length is 8863 both before and after the execution (and note the RIGHT check works fine so the LEFT executes, it just fails to remove the last comma). As mentioned, ONLY happens with a very long string.
Anyone know what crazy limitations in MySQL I might be dealing with?
DECLARE var_sql_insert_1 text;
IF (RIGHT(var_sql_insert_1, 1) = ',') THEN
SET var_sql_insert_1 = LEFT(var_sql_insert_1, LENGTH(var_sql_insert_1) - 1);
END IF;
So the issue is I was using LENGTH which was returning the length in BYTES vs. CHAR_LENGTH which returns the length in characters. Sadly, with all the other languages I've used, the default LENGTH value was character and they BYTE_LENGTH was specifically designed to be byte. For MySQL it appears the reverse is true. Doesn't make much sense for a system that's mainly used to store and manipulate TEXT rather than byte data...
Since MySQl 8 where introduced function REGEXP_REPLACE you can use next solution:
SET var_sql_insert_1 = REGEXP_REPLACE(var_sql_insert_1, ',$', '');
The pattern ',$' mean last comma before end of line $
Look the example

Converting an "HTML entity" emoticon code in UTF16 (in c++)

I'm currently writing my own DrawTextEx() function that supports emoticons. Using this function, a callback is called every time an emoticon is found in the text, giving the opportunity to caller to replace the text segment containing the emoticon by an image. For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image while the text is drawn. Actually this function works fine.
Now I want to implement an image library on the caller side. I receive a text segment like 0x3DD8 0x00DE in my callback function, and my idea is to use this code as key in a map containing all the Unicode combinations, every one linked with a structure containing the image to draw. I found a good package on the http://emojione.com/developers/ website. All the packages available on this site contain several file names, that is an hexadecimal code. So I can iterate through the files contained in the package, and create my map in an automatic way.
However I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development, as it can be seen on the http://graphemica.com/%F0%9F%98%80 website. So, to be able to use these files, I need a solution to convert the HTML entity values contained in their names into an UTF16 code. For example, in the case of the above mentioned smiling face, I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.
A brute force approach may consist to write a map that converts these codes, by adding each of them in my code, one by one. But as the Unicode standard contains, in the most optimist scenario, more than 1800 combinations for the emoticons, I want to know it there is an existing solution, such as a known API or function, that I may use to do the job. Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)
Regards
For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image
The character U+1F600 Grinning Face 😀 is represented by the UTF-16 code unit sequence 0xD83D, 0xDE00.
(Graphemica swapping the order of the bytes for each code unit is super misleading; ignore that.)
I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development
HTML has nothing to do with it. They're plain Unicode characters—just ones outside the Basic Multilingual Plane, above U+FFFF, which is why it takes more than one UTF-16 code unit to represent them.
HTML numeric character references like 😀 (often incorrectly referred to as entities) are a way of referring to characters by code point number, but the escape string is only effective in an HTML (or XML) document, and we're not in one of those.
So:
I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.
sounds more like:
I need to convert representations of U+1F600 Grinning Face: from the code point number 0x1F600 to the UTF-16 code unit sequence 0xD83D, 0xDE00
Which in C# would be:
string face = Char.ConvertFromUtf32(0x1F619); // "😀" aka "\uD83D\uDE00"
or in the other direction:
int codepoint = Char.ConvertToUtf32("\uD83D\uDE00", 0); // 0x1F619
(the name ‘UTF-32’ is poorly-chosen here; we are talking about an integer code point number, not a sequence of four-bytes-per-character.)
Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)
In C++ things are more annoying; there's not (that I can think of) anything that directly converts between code points and UTF-16 code units. You could use various encoding functions/libraries to convert between UTF-32-encoded byte sequences and UTF-16 code units, but that can end up more faff than just writing the conversion logic yourself. eg in most basic form for a single character:
std::wstring fromCodePoint(int codePoint) {
if (codePoint < 0x10000) {
return std::wstring(1, (wchar_t)codePoint);
}
wchar_t codeUnits[2] = {
0xD800 + ((codePoint - 0x10000) >> 10),
0xDC00 + ((codePoint - 0x10000) & 0x3FF)
};
return std::wstring(codeUnits, 2);
}
This is assuming the wchar_t type is based on UTF-16 code units, same as C#'s string type is. On Windows this is probably true. Elsewhere it is probably not, but on platforms where wchar_t is based on code points, you can just pull each code point out of the string as a character with no further processing.
(Optimisation and error handling left as an exercise for the reader.)
I'm using the RAD Studio compiler, and fortunately it provides an implementation for the ConvertFromUtf32 and ConvertToUtf32 functions mentioned by bobince. I tested them and they do exactly what I needed.
For those that doesn't use the Embarcadero products, the fromCodePoint() implementation provided by bobince works also well. For information, here is also the ConvertFromUtf32() function as implemented in RAD Studio, and translated into C++
std::wstring ConvertFromUtf32(unsigned c)
{
const unsigned unicodeLastChar = 1114111;
const wchar_t minHighSurrogate = 0xD800;
const wchar_t minLowSurrogate = 0xDC00;
const wchar_t maxLowSurrogate = 0xDFFF;
// is UTF32 value out of bounds?
if (c > unicodeLastChar || (c >= minHighSurrogate && c <= maxLowSurrogate))
throw "Argument out of range - invalid UTF32 value";
std::wstring result;
// is UTF32 value a 16 bit value that can fit inside a wchar_t?
if (c < 0x10000)
result = wchar_t(c);
else
{
// do divide in 2 chars
c -= 0x10000;
// convert code point value to UTF16 string
result = wchar_t((c / 0x400) + minHighSurrogate);
result += wchar_t((c % 0x400) + minLowSurrogate);
}
return result;
}
Thanks to bobince for his response, which pointed me in the right direction and helped me to solve this problem.
Regards

AS3 Multiple warnings saying ExternalInterface escapes strings using JSON conventions

Flash CC, Target: Flash Player 17.
First frame code:
ExternalInterface.call("test", "\\");
Test movie gives console warning:
WARNING: For content targeting Flash Player version 14 or higher, ExternalInterface escapes strings using JSON conventions. To maintain compatibility, content published to earlier Flash Player versions continues to use the legacy escaping behavior.
How to get rid of this warning?
UPDATE:
var a:Object = {test:"\\"};
ExternalInterface.call("console.log", a);
This code works correct, browser console displays:
Object {test: "\"}
but why I'm still receiving this warning?
You can "get rid of this warning" by encoding your strings:
ExternalInterface.call("test", encodeURIComponent("\\"));
JS:
function test(encoded) {
var decoded = decodeURIComponent(encoded);
}
EDIT: To be clear, the warning only shows up for strings that contain certain escaped characters, like slashes. You do not need to encode other data types, like boolean, number, integer, etc.
This warning isn't really a problem, though, it's just telling you that certain characters you are using are escaped differently than they used to be.
Reply to your Update
This code works correct, but why I'm still receiving this warning?
Yes, like I said, it's expected to work without extra encoding. The warning is not telling you there's a problem, it's simply warning you that certain characters are escaped in a way that older player targets escape differently (or not at all). If that doesn't mean anything to you, then the warning doesn't mean anything to you either.
Your object example can be encoded this way:
var a:Object = {test:"\\"};
ExternalInterface.call("logFromFlash", encodeURIComponent(JSON.stringify(a)));
JS:
function logFromFlash(encoded) {
var object = JSON.parse(decodeURIComponent(encoded));
console.log(object); // Object {test: "\"}
}
Or you could encode just the specific string properties that might contain slashes, if you have that foreknowledge.
Again, is all this encoding and decoding worth doing just to hide a harmless warning? Not in my opinion, but it's up to you.

Scala Play 2.4.x handling extended characters through anorm (MySQL) to Java Mail

I was under the impression that UTF-8 was the answer to everything :0
Problem: Using Play's idiomatic form handling to go from a web page (basic HTML Text Area Input field) to a MySQL database through the Anorm abstraction layer (so all properly escaped) and then reading the database to gather that data and create an email using the JavaMail API's to send HTML email with alternate characters (accented characters like é for example. (I'd post more but I suspect we might get strange artifacts here as well -- I'll try that in a comment below perhaps)
I can use a moderate set of characters and create a TEXT email (edited via Atom and placed into the stream directly at the code level) and it comes through as an email with all the characters I've chosen in tact.
I have not yet systematically worked through the characters I was just using a relatively random sampling as an initial test.
I place the same set of characters into a text field and try to save them to the database and I can only save about 1 in 5 or less of them.
The errors look like this:
SQLException: Incorrect string value: '\xC4\x93\x0D\x0A\x0D\x0A...' for column 'content' at row 1
I suspect I'm about to learn a ton of new information about either Play and/or UTF-8 or HTML or some part of the chain where this is going off the rails.
My question then is this: Is there an idiomatic Play example of how to handle UTF-8 end to end through Anorm and into Java Mail?
(I think I kinda expected it to be "built-in" but then I expected a LOT more to be baked into the core product as well...)
I want/need both a TEXT and and HTML path for the email portion. (I can write BOTH and they work fine -- the problem is moving alternate characters though the channels as indicated above).
I'm currently seeing if this might be an answer:
https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/
However presently hitting this roadblock...
How to turn off specific Implicit's in Scala that prevent code from compiling due to overloaded methods?
This appears to be a hopeful candidate -- I am researching it now end to end.
import org.apache.commons.lang3._
def htmlEncode(input: String) = htmlEncode_sb(input).toString
def htmlEncode_sb(input: String, stringBuilder: StringBuilder = new StringBuilder()) = {
stringBuilder.synchronized {
for ((c, i) <- input.zipWithIndex) {
if (CharUtils.isAscii(c)) {
// Encode common HTML equivalent characters
stringBuilder.append(StringEscapeUtils.escapeHtml4(c.toString()))
} else {
// Why isn't this done in escapeHtml4()?
stringBuilder.append(s"""&#${Character.codePointAt(input, i)};""")
}
}
stringBuilder
}
}
In order to get it to work inside Play you'll need this in your build.sbt file
"org.apache.commons" % "commons-lang3" % "3.4",
This blog post lead me to write that code: https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/
Update: Confirmed that it does work end to end.
Web Page Input as TextArea inside a Form saved to MySQL database escaped by Anorm, reread from database and displayed inside a TextArea on a web page with extended characters (visually) appearing precisely as input.
You'll need to call #Html(htmlContentString) inside the Twirl template to re-render this as the original HTML but the browser (Safari 8.0.7) displayed exactly what I gave it after a round trip to and from the database.
One caveat -- it creates machine readable HTML not human readable HTML. It would be nice if it didn't encode angle brackets and such so it looks more like HTML that we expect. I'm sure a pattern match block will be added next to exclude just that :)