I have a comment section and submission form that any of my member can submit.
If my member post in English I will receive an email update and the comment will be post no problem in English. But if they use other than English such an example of Thai language. Then what happen all the words let say for example สวัสดี it will appear as ??????
I don't know why, but I went to check on my php.ini file and the unicode/encoded setted to UTF8 and also on the MySQL collation setted to UTF8 as well. I make sure the meta setted to UTF8 as well on the .html/.php files, but still causing the same problem.
Any suggestion what else I missed to configure?
Make sure you are using multibyte safe string functions or you might be losing your UTF-8 encoding.
From the PHP mbstring manual:
While there are many languages in
which every necessary character can be
represented by a one-to-one mapping to
an 8-bit value, there are also several
languages which require so many
characters for written communication
that they cannot be contained within
the range a mere byte can code (A byte
is made up of eight bits. Each bit can
contain only two distinct values, one
or zero. Because of this, a byte can
only represent 256 unique values (two
to the power of eight)). Multibyte
character encoding schemes were
developed to express more than 256
characters in the regular bytewise
coding system.
When you manipulate (trim, split,
splice, etc.) strings encoded in a
multibyte encoding, you need to use
special functions since two or more
consecutive bytes may represent a
single character in such encoding
schemes. Otherwise, if you apply a
non-multibyte-aware string function to
the string, it probably fails to
detect the beginning or ending of the
multibyte character and ends up with a
corrupted garbage string that most
likely loses its original meaning.
mbstring provides multibyte specific
string functions that help you deal
with multibyte encodings in PHP. In
addition to that, mbstring handles
character encoding conversion between
the possible encoding pairs. mbstring
is designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and
many single-byte encodings for
convenience
I just found out that what is causing the problem
in php.ini
line mbstring.internal_encodingit was setted to something else so I setted it to UTF-8 then magical! now everything worked!
Related
I am making a table of users where I will store all their info: username, password, etc. My question is: Is it better to store usernames in VARCHAR with a utf-8 encoded table or in CHAR. I am asking because char is only 1 byte and utf-8 encodes up to 3 bytes for some characters and I do not know whether I might lose data. Is it even possible to use CHAR in that case or do I have to use VARCHAR?
In general, the rule is to use CHAR encoding under the following circumstances:
You have short codes that are the same length (think state abbreviations).
Sometimes when you have short code that might differ in length, but you can count the characters on one hand.
When powers-that-be say you have to use CHAR.
When you want to demonstrate how padding with spaces at the end of the string causes unexpected behavior.
In other cases, use VARCHAR(). In practice, users of the database don't expect a bunch of spaces at the end of strings.
Can anyone explain me the differences between scan and binary scan .
format and binary format .
I am getting confusion with the binary commands .
To understand the difference between command sets manipulating binary and string data you have to understand the distinction between these two kinds of data.
In Tcl, as in many (most?) high-level languages, strings are rather abstract — that is, they are described in pretty high-level terms. Particularly in Tcl, strings are defined to have the following properties:
They contain characters from the Unicode repertoire.
The Tcl runtime provides the set of standard commands to operate on strings — such as indexing, searching, appending to, extracting a substring etc.
Note that many things are left out from this definition:
The encoding in which these Unicode characters are stored.
How exactly they are stored (NUL-terminated arrays? linked lists of unsigned longs? something else?).
(To put it into a more interesting perspective, Tcl is able to transparently change the underlying representations of strings it manages — between UTF-8 and UTF-16 encoded sequences. But here we're talking about the reference Tcl implementation, and other implementations (such as Jacl for instance) are free to do something else completely.)
The same approach is used to manipulate all the other kinds of data in the Tcl interpreter. Say, integer numbers are stored using native platform "integers" (roughly "as in C") but they are transparently upgraded into arbitrary sized integers if an arithmetic operation is about to overflow the platform-sized result.
So long as you don't leave the comfortable world of the Tcl interpreter, this is all you should know about the data types it manages. But now there's the outside world. In it, abstract concepts which are Tcl strings do not exist. Say, if you need to communicate to some other program over a network socket or by means of using a file or whatever other kind of media, you have to get down to the level of exact layouts of raw bytes which are described by "wire protocols" and file formats or whatever applies to your case. This is where "binaries" come into play: they allow you to precisely specify how the data is laid out so that it's ready to be transferred to the outside world or be consumed from it — binary format makes these "binaries" and binary scan reads them.
Note that certain Tcl commands for working with the outside world are "smart by default" — for instance, the open command which opens files by default assumes they are textual and are encoded in the default system encoding (which is deduced, broadly speaking, from the environment). You can then use the chan configure (of fconfigure — in older versions of Tcl) command to either change this encoding or completely inhibit conversions by specifying the channel is in "binary mode". The same applies to EOL conversions.
Note also that there are specialized packages for Tcl that effectively hide the complexities of working with a particular wire/file format. To present one example, the tdom package works with XML; when you manipulate XML using this package, you're not concerned with how exactly XML must be represented when, say, saved to a file — you just work with tdom's objects, native Tcl strings etc.
The docs are pretty good and contain examples:
scan: http://www.tcl.tk/man/tcl8.6/TclCmd/scan.htm
format: http://www.tcl.tk/man/tcl8.6/TclCmd/format.htm
binary scan: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M42
binary format: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M16
Maybe you could ask a more specific question?
The format command assembles strings of characters, the binary format command assembles strings of bytes. The scan and binary scan commands do the reverse, extracting formation from character strings and byte strings respectively.
Note that Tcl happens to map byte strings neatly onto character strings where the characters are in the range \u0000–\u00FF, and there are other operations for getting information into and out of binary strings that are sometimes relevant. Most notably, encoding convertto and encoding convertfrom: encoding convertto formats a string as a sequence of bytes that represent that string in a given encoding (an operation which can lose information) and encoding converfrom goes in the opposite direction.
So what encoding are Tcl's strings really in? Well, none really. Or many. The logical level works with character sequences exclusively, and the implementation will actually move things back and forth (mostly between a variant of UTF-8 and UCS-2, though with optimisations for handling byte strings via arrays of unsigned char) as necessary. While this is not always perfectly efficient, most code never notices what's going on due to the type-caching used.
If you have Tcl 8.6, you can peek behind the covers to observe the types with an unsupported command:
# Output is human-readable; experiment to see what it says for you
puts [tcl::unsupported::representation $MyString]
Don't use this to base functional decisions on; Tcl is very happy to mutate types out from under your feet. But it can help when finding out why your code is unexpectedly slow. (Note also that types attach to values, and not to variables.)
I have MS Access 2010 forms linking to a mySQL5 (utf8) database.
I have the following data stored in a varchar field:
"Jarosław Kot"
MS Access is just display this raw, as opposed to converting it to:
Jarosław Kot
Can anyone offer assistance?
Thanks Paul
The notation ł is a character reference in SGML, HTML, and XML. There is in general no reason to expect any software to treat it as anything but a literal of six characters “&”, “#”, etc., unless the software is interpreting the data as SGML, HTML, or XML.
So if you have data stored so that ł should be interpreted as a character reference, then you should convert the data, statically or dynamically. The specifics depend on what actual data there is—for example, do all the constructs use decimal notation (not hexadecimal), and is it certain that all numbers are to be interpreted as Unicode numbers for characters?
If I understand you correctly you can use Replace function:
Replace("Jarosław Kot", "ł", "ł")
Assuming that your mySQL database character set is effectively set to UTF8 and that all inserts and updates are utf8 compatible (I do not know that much about mySQL, but SQL Server has some specific syntax rules for utf8-compliant data...), you could then convert the available HTML data to plain UTF8 data.
You will not have any problem finding some conversion table (here for example), and, if you are lucky, you could even find a conversion function...
I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.
I'm developing an application for a native language learning. I need to store some characters as 'ẽũ'. My database is set to utf-8 charset with a default collation, also the table affected by this characters.
The problem is when I try to add a row using a regular SQL insert:
INSERT INTO text(spanish,guarani) VALUES('text','ẽũ');
This throws a warning:
Warning Code : 1366 Incorrect string value: '\xE1\xBA\xBD\xC5\xA9' for column 'guarani' at row 1
And the result is "??" where there are those characters.
Question: These characters are not covered by the UTF-8 charset? Which one I need?
Note: Same problem with latin-1
Thanks.
QUICK!!! Read http://www.joelonsoftware.com/articles/Unicode.html
It is required reading.
Once you have read that, you should ask yourself:
What encoding is the connection using.
What locale is collation using. (If applicable).
What encoding is the SQL statement in?
What encoding are the string literals in?
What encoding is the html form presented in?
As by other answer, you really should read and understand the basics of Unicode.
It's not difficult, (in one day you can grasp it), it's required knowledge for almost every programmer (and certainly for you), it's non ephemeral knowledge and will be your life simpler and happier.
These characters are not covered by
the UTF-8 charset?
UTF-8 is a Unicode charset, Unicode covers (practically) every character. MYSQL's 'utf8' encoding, on the other hand, is not true UTF-8, it leaves some characters out (thouse outside the BMP). But that is not your problem here.
http://www.fileformat.info/info/unicode/char/1ebd/index.htm
http://www.fileformat.info/info/unicode/char/169/index.htm
You see there that your two characters are valid Unicode, are inside the BMP (hence Mysql crippled 'utf8' should support them), and yu can even see it's UTF-8 encoding. And, as you see, \xE1\xBA\xBD\xC5\xA9 seems just right. So the problem seems to be elsewhere. Are you sure you DB is utf8?