Because my database is about postage stamps there are a lot of half symbols in the descriptions. They type into mysql with no problem but when I get the results of a query all the half symbols have been replaced by a question mark.
Is there a special code I should be using when I input the descriptions instead of using the half symbol? Otherwise, is there another solution like changing the character set. I'm using utf-8 at the moment.
It seems like you are facing character encoding issue. You should use UTF-8 everywhere:
Make sure the column that contains text data is encoded as utf8_...
Check that when you get information from database, you keep this encoding. You can force it by sending SET NAMES utf8; before any request to MySQL.
Check that when you display this information the encoding is UTF-8 (in a webpage, that means <meta charset='utf-8'> in <head>).
Related
I am pulling data from database into R. I use the following commands to do this:
drv = dbDriver("MySQL")
con = dbConnect(drv,<credentials>)
dbSendQuery(con,"SET character_set_results = utf8;")
<code to pull data>
The data is stored in UTF-8 encoding in the database.I pull a dataframe with a column containing words. Once i pull the data, i convert the encoding to ASCII//TRANSLIT using iconv(x,"UTF-8","ASCII//TRANSLIT") Everything is working fine except that for few words i see an extra character appearing after i change the encoding. For example when u look in database you see abc and when u import, you get abc. But once you change the encoding to ASCII//TRANSLIT this word changes to abc?. I used https://www.branah.com/unicode-converter to check for encoding. I copied word abc in first box named unicode text and i see abc⬠in box named utf-8 text.What are these special characters and how to use them in regex to filter these out?
SET character_set_results = utf8 is probably not sufficient. Change to SET NAMES utf8mb4.
What do you mean by "pull the data"? Is it put into a database table? If so, please provide SHOW CREATE TABLE.
To investigate strange characters, do SELECT HEX(...) ... to see what is actually there. From that, we might be able to deduce what happened.
It looks like ⬠is part of the Mojibake for one of these.
⬀,⬁,⬂,⬃,⬄,⬅,⬆,⬇,⬈,⬉,⬊,⬋,⬌,⬍,⬎,⬏,⬐,⬑,⬒,⬓,⬔,⬕,⬖,⬗,⬘,⬙,⬚,⬛,⬜,⬝,⬞,⬟,⬠,⬡,⬢,⬣,⬤,⬥,⬦,⬧,⬨,⬩,⬪,⬫,⬬,⬭,⬮,⬯,⬰,⬱,⬲,⬳,⬴,⬵,⬶,⬷,⬸,⬹,⬺,⬻,⬼,⬽,⬾,⬿
â¬, when treated as latin1, is hex E2AC
⬀ when treated as UTF-8 (utf8mb4), is hex E2AC80
⬁ is hex E2AC81,
etc
The causes of Mojibake are discussed here.
Instead of trying to filter them out, you should fix the code to preserve them.
I'm trying to load data from a MySQL DB from a varchar(35) / utf8_swedish_ci field through TBS (tinybutstrong) and PHP using the example (MySQL data merge). My issue is that data loads fine if only ascii characters are in the fields but as soon as I add a single scandinavian special character like ö or ä the field contents vanishes entirely and other fields in row display correctly.
My understanding is that the latest versions on TBS automatically use UTF-8 coding (I have 3.9.0 for PHP 5) so I assumed it would work out-of-the-box. To be safe, I even added the coding to template as so:
'$TBS->LoadTemplate('mysql.html','UTF-8');' but to no avail.
Could someone please advice what is causing this.
For a good UTF-8 processing, all elements of the chain must be UTF-8.
You have to ensure that your template is UTF-8 : check the entered text and the HTML element <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
You have to ensure that all your PHP scripts are UTF-8 and not Ansi.
You also have to ensure that your MySQL connection is set to receive UTF-8 queries and to return UTF-8 item data. This can be done for example by querying the SQL : SET NAMES 'UTF8'
I am currently working on a site that connects a DB and bring the information, some of this information has special characteres because is in polish languague, for example, in the database I have this one ę and I get e printed at my web,I already added the meta
<meta charset="ISO-8859-2">
but doesnt work, only if I write & #281; which is not pract and needs a lot of work, my question is if somebody did this , get the character, like ę, and print it just like that?
Thanks.
Make sure that:
the data really is in ISO-8859-2
the data isn't be corrupted by the configuration of the database
the HTTP headers aren't claiming the data is encoded a different way
whatever you are using to pull the data out of the database isn't transcoding it
You should also ditch ISO-8859-2 (as it is very legacy) and move to UTF-8.
Use a Unicode entity. &#xxxx; where xxxx is the Unicode value for the character.
I have a multilingual site and I am having a problem inserting Chinese meta tags. These are transformed into question marks.
Is there a way how I can achieve this?
Many thanks
--EDIT--
The table storing the SEF Urls is in the latin1_swedish_ci character set. How can I change this single table to utf8_general_ci without breaking the URLs?
Many thanks!
Make sure that:
The character encoding you are using includes those characters (UTF-8 is safe)
Your editor is configured to use that character encoding
Your database (if these details are stored in one) is configured to use that encoding
Your webserver is configured to output a charset parameter on the Content-type header (and it uses the correct encoding)
Your browser is not configured to ignore the specified encoding
Use numeric character references.
EDIT
wiki numeric character references
Convert Chinese characters to Unicode
Are you retrieving the data from a database?
If so ensure that you connection character set is also set to utf-8.
In MySQL for example you would need to issue this query before any other:
SET NAMES 'utf8';
It could be that you need to encode the Chinese characters to HTML entities, or specify a character set.
Have you checked your character set in your document headers? I usually use UTF-8 to achieve chinese character sets.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you're using a program like dreamweaver, make sure your files are actually being SAVED in the correct character set as well. We had a problem where characters in a dreamweaver file were coming through as ???? because the editor itself was set to iso-8859-1
Maybe your Browser - or more exactly, the font you selected to display the page - doesn't support chinese characters. What system and browser is this on?
I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"Bing® Blang™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?
There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)
I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)
Could it be that some of the columns have an explicitly different character set than the table default?
something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci