I'm following the demo code from article of phpsqlgeocode.html
In the db, I inserted some Chinese addresses, which are utf-8 encoded. I
found after urlencode the Chinese address, the output of the address
would be wrong. Like this one:
http://maps.google.com.tw/maps/geo?output=csv&key=ABQIAAAAfG3KxFZXjEslq8VNxMBpKRR08snBovzCxLQZ9DWwpnzxH-ROPxSAS9Q36m-6OOy0qlwTL6Ht9qp87w&q=%3F%3F%3F%3F%3F%3F%3F%3F%3F132%3F
Then it outputs 200,5,59.3266963,18.2733433 (I can't query this through PHP, but through the browser instead).
This address is actually located in Taichung, Taiwan, but it turns out to be
in Sweden, Europe. But when I paste the Chinese address(such as 台中市西屯區智惠
街131巷56號58號60號) in the url, the result turns out to be fine!
How do I make sure it sends out the original Chinese
address? How do I avoid urlencode()? I found that removing urlencode() doesn't change anything.
(I've change the MAPS_HOST from maps.google.com to
maps.google.com.tw.)
(I'm sure my key is right, and other English address geocoding are
fine.)
q=%3F%3F%3F%3F%3F%3F%3F%3F%3F132%3F
decodes to:
?????????132?
so something has corrupted the string already before URL-encoding. This could happen if you try to convert the Chinese characters into an encoding that doesn't support Chinese characters, such as Latin-1.
You need to ensure that you're using UTF-8 consistently through your application. In particular you will need to ensure the tables in the database are stored using a UTF-8 character set; in MySQL terms, a UTF-8 collation. The default collation for MySQL otherwise is Latin-1. You'll also want to ensure your connection to the database uses UTF-8 by calling 1mysql_set_charset('utf-8')`.
(I am guessing from your question that you're using PHP.)
Related
I'm working on an application that imports data from a CSV file. I am told that the data in the CSV file comes from SAP, which I am totally unfamiliar with.
My client indicates that there is an issue. One column of data in the CSV file contains postal addresses. Sometimes, the system doesn't see a valid address. Here is a slightly fictionalized example:
1234 MAIN ST A&#C HOUSTON
As you can see, there is a street number, a street name, and a city, all in capital letters. There is no state or zip code specified. In the CSV file, all addresses are assumed to be in the same state.
Normally, where there is text between the street name and city, it is an apartment number or letter. In the above example, we get errors when we try to use the address with other services, such as Google geolocation. One suggested fix is to simply strip out there special characters, but I believe that there must be a better way.
I want to know what this A&#C means. It looks like some sort of escape sequence, but it isn't in a format I'm familiar with. Please tell me what these strange character sequence means.
I'm not totally sure, but I doubt there's a "canonical" escape sequence that looks like this. In the ABAP environment, # is used to replace non-printable characters. It might be that the data was improperly sanitized when importing into the SAP system in the first place, and when writing to the output file, some non-printable character was replaced by #. Another explanation might be that one of the field contained a non-ASCII unicode character (like, ) and the export program failed to convert that to the selected target codepage. It's hard to tell without examining the actual source dataset. Of course, it might also be some programming error or a weird custom field separator...
I'm trying to store text (with emoji) From an iPhone Client App on a MySQL database with Erlang. (Into a varchar column)
I used to do it with a socket connection server done with C++ and mysqlpp, it was working great. (It is the exact same database, So I can assume that the issue is not coming from the database)
However, I decided to pass everything on Erlang for scalability reasons, and since, I am unable to store and retrieve correctly emojis.
I'm using emysql to communicate with my database.
When I'm storing, I'm sending this list to the database :
[240,159,152,130]
When I'm retrieving, here what I get :
<<195,176,194,159,194,152,194,130>>
There is some similarities obviously, we can see 159, 152 and 130 on both lines, but no 240. I do not know where 195, 176 and 194 come from.
I though about changing the emysql encoding when creating the connection pool.
emysql:add_pool(my_db, 3, "login", "password", "db.mydomain.com", 3306, "MyTable", utf8)
But I can seems to find the proper atom for utf32 encoding. (The interesting thing is that I have not set any encoding on C++ and mysqlpp, it worked out of the box).
I have made some test...
storing from C++, retrieving from C++ (Works fine)
storing from Erlang, retrieving from Erlang (Does not work)
storing from Erlang, retrieving from C++ (Does not work)
storing from C++, retrieving from Erlang (Does not work)
One more information, I'm using prepared statement on Erlang, while I'm not on C++
Any help would be appreciated.
AS requested, here the query for storing data :
UPDATE Table SET c=? WHERE id=?
Quite simple really...
It is all about utf-8 encoding. In Erlang a list of characters, in your case [240,159,152,130], aren't normally encoded but are the unicode code points. When you retrieved the data you got a binary containing with utf-8 encoding bytes of your characters. Exactly where this encoding occurred I don't know. From the erlang shell:
10> Bin = <<195,176,194,159,194,152,194,130>>.
<<195,176,194,159,194,152,194,130>>
11> <<M/utf8,N/utf8,O/utf8,P/utf8,R/binary>> = Bin.
<<195,176,194,159,194,152,194,130>>
12> [M,N,O,P].
[240,159,152,130]
Handling unicode in erlang is pretty simple, characters in lists are usually the unicode code points and are very rarely encoded, while storing them in binaries means you have to encode them in some way, as binaries are just arrays of bytes. The default encoding is utf-8. In the module unicode there are functions for converting between unicode lists and binaries.
I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.
I have a comment section and submission form that any of my member can submit.
If my member post in English I will receive an email update and the comment will be post no problem in English. But if they use other than English such an example of Thai language. Then what happen all the words let say for example สวัสดี it will appear as ??????
I don't know why, but I went to check on my php.ini file and the unicode/encoded setted to UTF8 and also on the MySQL collation setted to UTF8 as well. I make sure the meta setted to UTF8 as well on the .html/.php files, but still causing the same problem.
Any suggestion what else I missed to configure?
Make sure you are using multibyte safe string functions or you might be losing your UTF-8 encoding.
From the PHP mbstring manual:
While there are many languages in
which every necessary character can be
represented by a one-to-one mapping to
an 8-bit value, there are also several
languages which require so many
characters for written communication
that they cannot be contained within
the range a mere byte can code (A byte
is made up of eight bits. Each bit can
contain only two distinct values, one
or zero. Because of this, a byte can
only represent 256 unique values (two
to the power of eight)). Multibyte
character encoding schemes were
developed to express more than 256
characters in the regular bytewise
coding system.
When you manipulate (trim, split,
splice, etc.) strings encoded in a
multibyte encoding, you need to use
special functions since two or more
consecutive bytes may represent a
single character in such encoding
schemes. Otherwise, if you apply a
non-multibyte-aware string function to
the string, it probably fails to
detect the beginning or ending of the
multibyte character and ends up with a
corrupted garbage string that most
likely loses its original meaning.
mbstring provides multibyte specific
string functions that help you deal
with multibyte encodings in PHP. In
addition to that, mbstring handles
character encoding conversion between
the possible encoding pairs. mbstring
is designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and
many single-byte encodings for
convenience
I just found out that what is causing the problem
in php.ini
line mbstring.internal_encodingit was setted to something else so I setted it to UTF-8 then magical! now everything worked!
I'm developing an application for a native language learning. I need to store some characters as 'ẽũ'. My database is set to utf-8 charset with a default collation, also the table affected by this characters.
The problem is when I try to add a row using a regular SQL insert:
INSERT INTO text(spanish,guarani) VALUES('text','ẽũ');
This throws a warning:
Warning Code : 1366 Incorrect string value: '\xE1\xBA\xBD\xC5\xA9' for column 'guarani' at row 1
And the result is "??" where there are those characters.
Question: These characters are not covered by the UTF-8 charset? Which one I need?
Note: Same problem with latin-1
Thanks.
QUICK!!! Read http://www.joelonsoftware.com/articles/Unicode.html
It is required reading.
Once you have read that, you should ask yourself:
What encoding is the connection using.
What locale is collation using. (If applicable).
What encoding is the SQL statement in?
What encoding are the string literals in?
What encoding is the html form presented in?
As by other answer, you really should read and understand the basics of Unicode.
It's not difficult, (in one day you can grasp it), it's required knowledge for almost every programmer (and certainly for you), it's non ephemeral knowledge and will be your life simpler and happier.
These characters are not covered by
the UTF-8 charset?
UTF-8 is a Unicode charset, Unicode covers (practically) every character. MYSQL's 'utf8' encoding, on the other hand, is not true UTF-8, it leaves some characters out (thouse outside the BMP). But that is not your problem here.
http://www.fileformat.info/info/unicode/char/1ebd/index.htm
http://www.fileformat.info/info/unicode/char/169/index.htm
You see there that your two characters are valid Unicode, are inside the BMP (hence Mysql crippled 'utf8' should support them), and yu can even see it's UTF-8 encoding. And, as you see, \xE1\xBA\xBD\xC5\xA9 seems just right. So the problem seems to be elsewhere. Are you sure you DB is utf8?