Twitter Special Characters and Emoticons UTF-8 JSON parsing - json

I am connecting to the Twitter Stream through a Node.js (Javascript) server, parsing Tweets, and then storing them in a CouchDB instance. The issue is when I try and write to CouchDB I get this error about 40% of the time:
{ error: 'bad_request', reason: 'invalid UTF-8 JSON' }
When I compare the Tweets that are stored successfully and the ones that are not, it seems like the difference is the existence of special characters in the Tweet or user description. Emoticons (hearts, smiley faces, etc.), Asian language characters, etc.
How can I correctly parse and store these Tweets in CouchDB? I think its a CouchDB issue since when I log the data to my console using Node.js I see the Emoticons.

It turns out the issue was with the setting of Content-Length in node.js when sending the PUT request. I was calculating the length of the string prior to encoding and therefore when Node expanded the UTF8 special characters, they took up more space.
Lesson learned: Be careful when calculating the length of a JSON object, especially with special characters.

Related

RAD Server Delphi - using savetostream und loadfromstream does not work because of mutated vowels after Json conversion

I try to exchange Data via RadServer IIS Package and Delphi Client with EMSEndpoint.
What I try looks simple to me but I can't get it done now.
In the Package there is a TFDConnection pointing to a MSSql Server. TFDQuery is connected with that Connection.
With this code I create the JSON Response (Serverside):
var lStream: TStringStream := TStringStream.create;
FDQuery.SaveToStream(lStream,sfJSON);
AResponse.Body.SetStream(lStream,'application/json' ,True);
with that code I try to load the Dataset into TFDMemtable (Clientside):
lstrstream: TStringStream := TStringStream.create(EMSBackendEndpoint.Response.Content);
aMemtable.LoadFromStream(lstrstream, sfJSON);
The Memtable says [FireDac][Stan]-719 invalid JSON storage format
How could that be? I know where the Problem is, there are äöü Symbols in my Stream, but when I load that from one Component to the other it should work, shouldn't it?
Any suggestions what I can try? What I have tryed so far:
Loading JSON in Client over UTF8toUnicode. That let me load the Memtable but results in missing Letters like öäü
Changing UTF8toUnicode on the Serverside and backwords on the Client side. That leads to not readable JSON for the Memtable
Loading JSON into JSONString and Format it localy before loading into Memtable. That leads to not Readable JSON because also the Array and Object chars are quoted out.
JSON is most commonly exchanged using UTF-8, but by default TStringStream does not use UTF-8 on Windows, only on Posix systems. Try using TStringStream.Create(..., TEncoding.UTF8) to force UTF-8.
This assumes that FDQuery.SaveToStream() saves using UTF-8, and aMemtable.LoadFromStream() loads using UTF-8, otherwise you will still have an encoding mismatch.

JSON special chars when running code from console

I am writing an automation tool, that mostly send requests and get JSON responses from a server. When I run my code directly from IntelliJ - I get a proper response. But, when I run my program from the console there is a problem. Special Spanish or French chars are being displayed in a wrong way.
For example:
We’ve
My code:
RestResponse restResponse = restRequest.sendRequest();
JSONObject jsonResponse = restResponse.getResponseJson();
What may be the cause for this error and how to get the foreign language chars to appear as they should?
The error might be cause by the character encoding, in IntelliJ you probably have already defined character encoding but not in the console.
So you can define the console encoding to UTF8 :
Console.OutputEncoding = Encoding.UTF8;
You can find some hint there :
- How to get a UTF-8 JSON
- Encode String to UTF-8
- System.out character encoding

In python3, how do I choose to encode in order to be able to read the page correctly

I am trying to use the following command to get the page's source code
requests.get("website").text
But I get an error:
UnicodeEncodeError: 'gbk' codec can't encode character '\xe6' in position 356: illegal multibyte sequence
Then I tried to change the page code to utf-8
requests.get("website").text.encode('utf-8')
But in addition to English will become the following form
\xe6°\xb8\xe4\xb9\x85\xe6\x8f\x90\xe4\xbe\x9b\xe5\x85\x8dè\xb4\xb9VPN\xe5\xb8\x90\xe5\x8f·\xe5\x92\x8c\xe5\x85\x8dè\xb4\xb
How can I do?
Thank you for your help
You can query the encoding of the requests.Response object by accessing its Response.content attribute.
Whenever you call requests.Response.text, the response object uses requests.Response.encoding to decode the bytes.
This may, however, not always be the correct encoding, hence you sometimes have to set it manually by looking it up in the content attribute, since websites usually specify the encoding there, if it's not utf-8 or similar (this is from experience, I'm not sure if this is actual standard behavior).
See more on requests.Response contents here

getting percent ('%') characters and unicode codepoints in HTTP response

I'm connecting to a HTTP API through a simple GET request.
The expected response is a string, representing a JSON, that may contain hebrew (unicode) characters, but i get something like this (pasted only the beginning):
%u007b%u0020%u0022%u0053%u0074%u0061%u0074%u0075%u0073...
the result is the same whether i use ajax or the browser navigation bar directly.
The only place i get the expected json string is in Firefox console, by logging the response object, selecting it, and viewing the responseText property.
I can also replace the percent characters with backslashes, put the result in a unicode parser and get the correct string.
Anybody has any ideas as to what is going on?
The response appears to be encoded with the deprecated javascript function escape() which yields the %uXXXX encoding. If that is the case then the service should instead use encodeURIComponent() or encodeURI() referenced in the link above.
Your current workaround of manual un-encoding is the right way to go until the service is updated.

AFNetworking received non-English character: how to convert?

I am getting JSON response from some web server, say the server returns:
"kən.grætju'leiʃən"
I use AFNetworking and JSONKit, but what I've received is:
"æm'biʃən"
Not sure if it's AFNetworking's problem or JSONKit's problem, but any way, how to I parse and convert the string so it looks the same as from server?
Thanks
The server may be returning characters encoded in a way that violates the official JSON spec. If those characters are encoded as escaped unicode IDs (like \U1234) then JSONKit and NSJSONSerialization should both handle them fine.
If you can't change the server, you can work around the issue by URL-decoding the string - see https://stackoverflow.com/a/10691541/1445366 for some code to handle it. But if your server isn't following the correct specs, you're likely to run into other issues.