SQLAlchemy/MySQL binary blob is being utf-8 encoded? - mysql

I'm using SQLAlchemy and MySQL, with a files table to store files. That table is defined as follows:
mysql> show full columns in files;
+---------+--------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------+--------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
| id | varchar(32) | utf8_general_ci | NO | PRI | NULL | | select,insert,update,references | |
| created | datetime | NULL | YES | | NULL | | select,insert,update,references | |
| updated | datetime | NULL | YES | | NULL | | select,insert,update,references | |
| content | mediumblob | NULL | YES | | NULL | | select,insert,update,references | |
| name | varchar(500) | utf8_general_ci | YES | | NULL | | select,insert,update,references | |
+---------+--------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
The content column of type MEDIUMBLOB is where the files are stored. In SQLAlchemy that column is declared as:
__maxsize__ = 12582912 # 12MiB
content = Column(LargeBinary(length=__maxsize__))
I am not quite sure about the difference between SQLAlchemy's BINARY type and LargeBinary type. Or the difference between MySQL's VARBINARY type and BLOB type. And I am not quite sure if that matters here.
Question: Whenever I store an actual binary file in that table, i.e. a Python bytes or b'' object , then I get the following warning
.../python3.4/site-packages/sqlalchemy/engine/default.py:451: Warning: Invalid utf8 character string: 'BCB121'
cursor.execute(statement, parameters)
I don't want to just ignore the warning, but it seems that the files are in tact. How do I handle this warning gracefully, how can I fix its cause?
Side note: This question seems to be related, and it seems to be a MySQL bug that it tries to convert all incoming data to UTF-8 (this answer).

Turns out that this was a driver issue. Apparently the default MySQL driver stumbles with Py3 and utf8 support. Installing cymysql into the virtual Python environment resolved this problem and the warnings disappear.
The fix: Find out if MySQL connects through socket or port (see here), and then modify the connection string accordingly. In my case using a socket connection:
mysql+cymysql://user:pwd#localhost/database?unix_socket=/var/run/mysqld/mysqld.sock
Use the port argument otherwise.
Edit: While the above fixed the encoding issue, it gave rise to another one: blob size. Due to a bug in CyMySQL blobs larger than 8M fail to commit. Switching to PyMySQL fixed that problem, although it seems to have a similar issue with large blobs.

Not sure, but your problem might have the same roots as the one I had several years ago in python 2.7: https://stackoverflow.com/a/9535736/68998. In short, Mysql's interface does not let you be certain if you are working with a true binary string or a text in a binary collation (used because of a lack of case-sensitive utf8 collation). Therefore, a Mysql binding has the following options:
return all string fields as binary strings, and leave the decoding to you
decode only the fields that do not have a binary flag (so much fun when some of the fields are unicode and other are str)
have an option to force decoding to unicode for all string fields, even true binary
My guess is that in your case, the third option is somewhere enabled in the underlying Mysql binding. And the first suspect is your connection string (connection params).

Related

Insert Japanese characters into latin1_swedish_ci collated mysql table column

Japanese characters are getting replaced by ??? I am not allowed to change the collation for the table/column. How can I insert these values?
MariaDB [company]> show full columns from test_table_latin1;
+-------+-------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+-------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| id | int(5) | NULL | YES | | NULL | | select,insert,update,references | |
| data | varchar(20) | latin1_swedish_ci | YES | | NULL | | select,insert,update,references | |
+-------+-------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
2 rows in set (0.00 sec)
MariaDB [company]> insert into test_table_latin1 values (4,'Was sent 検索キーワード - 自然');
Query OK, 1 row affected, 1 warning (0.00 sec)
MariaDB [company]> select * from test_table_latin1 where id=4;
+------+----------------------+
| id | data |
+------+----------------------+
| 4 | Was sent ??????? - ? |
+------+----------------------+
1 row in set (0.00 sec)
Japanese data is already there
It can't be, or if it is, it is garbled beyond recognition. For one thing, DB throws warnings if you try (INSERT INTO test_table_latin1 (data) VALUES ('キーワード'); with "Incorrect string value: '\xE3\x82\xAD\xE3\x83\xBC...' for column 'data'".
Same if you force it (CONVERT('キーワード' USING latin1)), you get the question marks as it does the best it can with an impossible request. It tried to warn you when you were doing it accidentally, but now that you're doing it explicitly it will comply, and just mark the problem spots with '?'. The data is lost, the Japanese is no longer there, and there's nothing you can do to convert ????? to キーワード.
The best of horrible options is pretending all is well: INSERT INTO test_table_latin1 (data) VALUES (CONVERT('キーワード' USING binary)), which gets you キーワード. Total garbage, but garbage that can be converted back to original: SELECT CONVERT(CONVERT(data USING binary) USING utf8) FROM test_table_latin1; should give you `キーワード'. Problem is, this only works when there's no actual Swedish, because either you encode the characters above 0x7f as if they were Unicode (which they are not), or if you avoid them, then you break UTF8 and you won't be able to convert back. So it's again a very bad case.
Finally, you could make your own way of signifying "treat this part differently", like "Was sent [[Base64:UTF8:5qSc57Si44Kt44O844Ov44O844OJ]] - [[Base64:UTF8:6Ieq54S2]]" and decode it on the client.
All of these are bad, bad alternatives to the single correct one: make the column Unicode. I understand that you might be unable to do so (company policy, legacy, compatibility, whatever), but it doesn't change the facts that anything else is no longer suited for this multicultural world we live in.

Ejabberd: Migrating Mnesia "passwd" table to MySQL "user" table

I have an (old) ejabberd instance that still uses 'internal' as authentication method. I installed a shiny new server (including MySQL) and am planning to migrate to it ASAP. I would like to avoid using Mnesia as authentication DB from then on.
Since my users' passwords are still stored in the Mnesia-database, I need to import them into the (new) MySQL DB on the new server. I succeeded in dumping the 'passwd' table and it is filled with entries like this one:
{passwd,{<<"flowie">>,<<"server.com">>},
{scram,<<"pHHeHwc5yaarPAshse7Ijuygtre=">>,
<<"4Qiv9ygiMLlzeZXUG6Bpyhygtgr=">>,
<<"dylctQFXYGXemMii1Pswe==">>,4096}}
To be able to correctly import these entries into the MySQL DB I need to figure out which field corresponds to which in the MySQL 'users' table:
+----------------+--------------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+-------------------+-------+
| username | varchar(191) | NO | PRI | NULL | |
| password | text | NO | | NULL | |
| serverkey | varchar(64) | NO | | | |
| salt | varchar(64) | NO | | | |
| iterationcount | int(11) | NO | | 0 | |
| created_at | timestamp | NO | | CURRENT_TIMESTAMP | |
+----------------+--------------+------+-----+-------------------+-------+
6 rows in set (0.00 sec)
I obviously know what the 'username' field is (and I think I can guess what the 'iterationcount' would be), but I want to make sure I get the others in the right order.
In one phrase: in what order are the 'password', 'serverkey' and 'salt' fields stored in an ejabberd Mnesia DB ? Where can I find info about this ? In the code perhaps ?
Note for the aspiring hackers among you: I did change the values, using a random character generator ;)
I configured ejabberd 18.03 with the option
auth_password_format: scram
and created an account. Its authentication information is stored like this in Mnesia:
{passwd,{<<"user1">>,<<"localhost">>},
{scram,<<"Eu9adR8M5NPIBoVKK917UKJQTtE=">>,
<<"0mRs0DKWvb8C0/fcVmTRP2elKOA=">>,
<<"UclT113AyXYlUAZgv3q0vA==">>,4096}}
Later I exported Mnesia to a SQL file using the command:
ejabberdctl export2sql localhost /tmp/localhost.sql
and the resulting file contains this line:
INSERT INTO users(username, password, serverkey, salt, iterationcount)
VALUES ('user1',
'Eu9adR8M5NPIBoVKK917UKJQTtE=',
'0mRs0DKWvb8C0/fcVmTRP2elKOA=',
'UclT113AyXYlUAZgv3q0vA==', 4096);

PuTTY outputs weird stuff when selecting in MySQL

I've encountered a strange problem when I was using PuTTY to query the following MySQL command: select * from gts_camera
The output seems extremely weird:
As you can see, putty outputs loads of "PuTTYPuTTYPuTTY..."
Maybe it's because of the table attribute set:
mysql> describe gts_kamera;
+---------+----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+----------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| datum | datetime | YES | | CURRENT_TIMESTAMP | |
| picture | longblob | YES | | NULL | |
+---------+----------+------+-----+-------------------+----------------+
This table stores some big pictures and their date of creation.
(The weird ASCII-characters you can see on top of the picture is the content.)
Does anybody know why PuTTY outputs such strange stuff, and how to solve/clean this?
Cause I can't type any other commands afterwards. I have to reopen the session again.
Sincerely,
Michael.
The reason this happens is because of the contents of the file (as you have a column defined with longblob). It may have some characters that Putty will not understand, therefore it will break as it is happening with you.
There is a configuration that may help though.
You can also not select every column in that table (at least not the *blob ones) as:
select id, datum from gts_camera;
Or If you still want to do it use the MySql funtion HEX:
select id, datum, HEX(picture) as pic from gts_camera;

MySQL Screwed Up Output

I'm see some very strange outputs from MySQL, and I don't know whether it's my console or my data that's causing this. Here are some screenshots:
Any ideas?
edit:
mysql> describe transformed_step_a1_sfdc_lead_history;
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| old_value | varchar(255) | YES | | NULL | |
| new_value | varchar(255) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
Max
To verify if there is any control characters, you can use -s option, see http://dev.mysql.com/doc/refman/5.5/en/mysql-command-options.html#option_mysql_raw
It's impossible to tell exactly what the problem is from your screenshots, but the text in your database contains control characters. The usual culprit is CR, which moves the cursor back to the beginning of the line and starts overwriting text already there.
If you have programmatic access to your database then you will be able to dump the values with control characters expressed as pintables so that you can see what is actually in there.

viewing mysql blob with putty

I am saving a serialized object to a mysql database blob.
After inserting some test objects and then trying to view the table, i am presented with lots of garbage and "PuTTYPuTTY" several times.
I believe this has something to do with character encoding and the blob containing strange characters.
I am just wanting to check and see if this is going to cause problems with my database, or if this is just a problem with putty showing the data?
Description of the QuizTable:
+-------------+-------------+-------------------+------+-----+---------+----------------+---------------------------------+-------------------------------------------------------------------------------------------------------------------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------------+-------------+-------------------+------+-----+---------+----------------+---------------------------------+-------------------------------------------------------------------------------------------------------------------+
| classId | varchar(20) | latin1_swedish_ci | NO | | NULL | | select,insert,update,references | FK related to the ClassTable. This way each Class in the ClassTable is associated with its quiz in the QuizTable. |
| quizId | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | This is the quiz number associated with the quiz. |
| quizObject | blob | NULL | NO | | NULL | | select,insert,update,references | This is the actual quiz object. |
| quizEnabled | tinyint(1) | NULL | NO | | NULL | | select,insert,update,references | |
+-------------+-------------+-------------------+------+-----+---------+----------------+---------------------------------+-------------------------------------------------------------------------------------------------------------------+
What i see when i try to view the table contents:
select * from QuizTable;
questionTextq ~ xp sq ~ w
t q1a1t q1a2xt 1t q1sq ~ sq ~ w
t q2a1t q2a2t q2a3xt 2t q2xt test3 | 1 |
+-------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
3 rows in set (0.00 sec)
I believe you can use the hex function on blobs as well as strings. You can run a query like this.
Select HEX(quizObject) From QuizTable Where....
Putty is reacting to what it thinks are terminal control character strings in your output stream. These strings allow the remote host to change something about the local terminal without redrawing the entire screen, such as setting the title, positioning the cursor, clearing the screen, etc..
It just so happens that when trying to 'display' something encoded like this, that a lot of binary data ends up sending these characters.
You'll get this reaction catting binary files as well.
blob will completely ignore any character encoding settings you have. It's really intended for storing binary objects like images or zip files.
If this field will only contain text, I'd suggest using a text field.