It's been a decade I have worked with Postgres and Perl.
One of my oldest still-operated applications, an dictionary of government addresses and departmental responsibilities, has issues handling query terms containing accented characters, for example köln. In other words, whenever a query term contains a accented character (mainly umlauts) there are 0 results returned.
I have to mention that this behavior is only happening using this application with Postgres as the database. If I switch to MySQL5 (same data) same queries are working correctly.
Trying to track the cause of this problem I have checked the following:
Postgres database is UTF-8 (using the command show server_encoding;)
Postgres client encoding is also UTF8 (using show client_encoding;)
If I use the Postgres monitor and execute the same SQL query as the application does, using accented characters in the query term, I get correct results
The Perl application itself is handling UTF-8, the HTML-Header is set correctly, contents of the output display correct and not garbled
All Perl code files, scripts, .pm package files and templates are UTF-8 encoded (I verified that with file --mime perl_file_name)
I fiddled with the database connection, setting $self->{dbh}->{pg_enable_utf8} = 1; or/and $self->{dbh}->do("SET CLIENT_ENCODING TO 'UTF8';"); or/and $self->{dbh}->do("SET NAMES 'UTF8';"); with no change
I've updated the DBD::Pg module to version 3.6.2, no change.
So I am pretty much out of ideas what else to check or try to get Postgres fully working. Like mentioned in my intro, same application just using MySQL as database works flawlessly.
2 years ago the application was changed to handle UTF-8 data, I did not do the changes myself, but as far as I can see in the code (compared to the code in my GIT repo) its just the HTML UTF8-Header print "Content-type: text/html; charset=utf-8\n\n"; and a few unrelated template parts. Perhaps that change somewhere is the origin for all the problems but I don't know what esp. to adjust for Postgres.
The current Perl version is 5.22.1, using Apache/2.2.22 (Ubuntu). The vhost configuration is simple:
AddHandler cgi-script .cgi .pl
ScriptAlias /...abs-path-to-app.../cgi-bin/
<Directory "/...abs-path-to-app.../cgi-bin/">
AllowOverride None
Options +Indexes +ExecCGI +MultiViews +SymLinksIfOwnerMatch
<IfVersion < 2.4>
Allow from all
</IfVersion>
<IfVersion >= 2.4>
Require all granted
</IfVersion>
Allow from all
</Directory>
Postgres is version 9.1.24.
Edit:
Collate and Ctype is set to en_US.UTF-8, Encoding is set to UTF-8 for the database in question.
Taking a look into the tables, all character varying columns use pg_catalog."default" collation. Executing show lc_collate; show already mentioned en_US.UTF-8.
Edit2:
Using the DBD::Pg flag pg_enable_utf8 and setting it to 0 seems to work out and I get the expected results. Using a value other than 0, for example '-1or1` does not work. I tried out that flag (once again) right after the database connect. Actually I have to verify this as I still do not really understand what's going on.
Related
We have a website which is dealing with Chinese characters and was hosted on AWS.
Here I can save Chinese characters in database without any problem.
Now we move to Google Cloud and I am facing issue saving Chinese characters in database.
They display as 一地兩檢
I am following all rules like "column should be utf8-unicode-ci" and "database connection as utf8".
It is working fine on localhost.
Any Idea what can be problem ?
Thanks.
If the data (column) in the database holds (similar) UTF8-encoded data in both cases and the code/platform which handles the data in the web-page is the same (meaning not python 2 vs python 3 for example), the difference might be the current local setting, either of the Google server (environment-variables), the SQL-client (UTF8-settings) or the php-settings.
Lets start with the sql-client:
Try to run the php - function
mysqli_character_set_name()
to get the encoding. If it is not UTF-8 then set it with
mysqli_set_charset('utf8')
If this is not working ensure the php-html stuff by setting the charset in the META html-tag to utf-8
charset=utf-8
and enforce it with
declare(encoding='utf8')
Looks like you have latin1 somewhere in the processing.
一地兩檢 is "Mojibake" for 一地兩檢
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored
Some Chinese characters take 4 bytes, not just 3 bytes. So, I recommend you use utf8mb4, not simply utf8.
I am using the terminal in a chromebook to ssh into a remote server. When I run a MySQL (5.6) select query, sometimes one of the fields will return nonsense unicode (when the field should return an email address) and change the MySQL prompt from:
mysql>
to
└≤⎽─┌>
and whatever text I type is converted into weird unicode. The problem persists even after I exit MySQL
One of the values in your database happened to have the sequence of bytes 0x1B, 0x28, 0x30 (ESC ) 0) in it. When you did the query, MySQL printed this byte sequence directly to your console. You can reproduce the effect by typing from python:
>>> print '\x1B\x28\x30'
Consoles use control characters (in particular 0x1B, ESC) as a way to allow applications to control aspects of the console other than pure text, such as colours and cursor movements. This behaviour is inherited from the old dumb-terminal devices that they are pretending to be (which is why they are also known as terminal emulators), along with some weirder tricks that we probably don't need any more. One of those is to switch permanently between different character sets (considered encodings, now, but this long predates Unicode).
One of those alternative character sets is the DEC Special Graphics Character Set which it looks like you have here. In this character set the byte 0x6D, usually used in ASCII for m, comes out as the graphical character └.
You could in principle reset your terminal to normal ASCII by printing a byte sequence 0x1B, 0x28, 0x42 (ESC ) B), but this tends to be a pain to arrange when your console is displaying rubbish.
There are potentially other ways your console can become confused; it's not, in general safe to print arbitrary binary data to the console. There even used to be nastier things you could do with the console by faking keyboard input, which made this a security problem, but today it's just an annoyance factor.
However, one wouldn't normally expect to have any control codes in an e-mail address field. I suggest the application using the database should be doing some validation on the input it receives, and dropping or blocking all control codes (other than potentially newlines where necessary).
As a quick hack to clean this field for the specific case of the ESC character, you could do something like:
UPDATE things SET email=REPLACE(email, CHAR(0x1B), '');
NOTE: I need to rewrite this question, as I didn't know what I was talking about previously. I am still confusing with encoding.
I have some markup like this in a yml file:
"the priestess\xE2\x80\x99s"
"what about \xE2\x80\x9D"
The above are UTF-8 (in literal)
I load the yml file by using File.open(SITE_PATH) without specifying any encoding method. Then I put the markup into a column in a mysql table. The column's collation is latin1_swedish_ci. (I am not sure why latin1_swedish_ci, but it is always there)
I open up phpmyadmin and find something like priestess’s
Does anyone has any idea what I should do? (I am using ruby 2.1)
Is there any good, straightforward way to connect to a MySQL database using MySQL's normal commandline client while connected using PuTTY and get it to render UTF8 fields that include non-Western characters properly?
My tables use UTF8 encoding, and in normal use the values come from an Android app and are displayed by an Android app. The problem is that occasionally, Things Go Wrong, and when they do, it's almost impossible for me to figure out what's going on because MySQL's commandline client forcibly casts UTF8 values to (what appears to be) ISO-8859-1 (ie, quasi-random gibberish when shown on the screen). For what it's worth, Toad for MySQL (both free and beta) seem to mangle UTF8 output the same way.
On a semi-related note, my favorite monospaced font is Andale Mono (yeah, I really like the forcibly-disambiguated 0/O and 1/l characters). I'm pretty sure it doesn't include CJK characters. Is there any (free) utility that can be used to rip the lower 127 or 256 characters from one Truetype font (like Andale Mono), and create a new Truetype font based on some UTF8 CJK Truetype font that replaces the lower 127 or 256 characters with the font data ripped from Andale Mono?
First you should make sure that your console encoding is set to UTF-8.
Using PuTTY you need to set the charset dropdown in "Window" > "Translation" to UTF-8
Second MySQL distincts the data charset and the connection charset.
When your data is UTF-8 encoded but your connection charset is set to e.g. "ISO-8859-1" MySQL will automatically convert the output.
The easiest way to set the charsets permanently is to update your client my.cnf with the following:
[client]
default-character-set=utf8
Detailed information about the connection charset you can find here:
http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html
When using the MySQL API functions ( PHP client e.g. ) you can set the connection charset by sending the query
SET NAMES utf8
Various implementations of the MySQL API also support setting the charset directly.
e.g. http://www.php.net/manual/en/mysqli.set-charset.php
I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.
Any thoughts how to port the regexp to MySQL?
Or maybe you know any other way to check if the string is valid UTF-8?
UPDATE:
I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.
I've managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.
I've used the fact that mysql changes the bytes that aren't utf-8 to '?' when converting to latin1.
Here is how the check looks like:
SELECT (
CONVERT(
CONVERT(
potentially_broken_column
USING latin1)
USING utf8))
!=
potentially_broken_column) AS INVALID ....
If you are in control of both the input and output side of this DB then you should be able to verify that your data is UTF-8 on whichever side you like and implement constraints as necessary. If you are dealing with a system where you don't control the input side then you are going to have to check it after you pull it out and possibly convert in your language of choice (Perl it sounds like).
The database is a REALLY good storage facility but should not be used aggressively for other applications. I think this is one spot where you should just let the MySQL hold the data until you need to do something further with it.
If you want to continue on the path you are on then check out this MySQL Manual Page: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
REGEX is normally VERY similar between languages (in fact I can almost always copy between JavaScript, PHP, and Perl with only minor adjustments for their wrapping functions) so if that is working REGEX then you should be able to port it easily.
GL!
EDIT: Look at this Stack article--you might want to use Stored Procedures considering you cannot using scripting to handle the data: Regular expressions in stored procedures
With Stored Procedures you can loop through the data and do a lot of handling without ever leaving MySQL. That second article is going to refer you right back to the one I listed though so I think you need to first prove out your REGEX and get it working, then look into Stored Procedures.