How to edit invalid UTF-8 strings in mysql database - mysql

I have some utf-8 strings in my database, they are stored as varbinary. (Generally, it's mediawiki database, but that's not important, i think). I found that some strings are not in a good shape, then i make
SELECT log_comment, CONVERT( log_comment
USING utf8 ) AS
COMMENT
FROM `logging`
WHERE log_id = %somevalue%
i have output table in phpmyadmin like this:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| d093d09ed0a1d0a220d0a020d098d0a1d09e2fd09cd0add09a20393239342d39332e20c2abd098d0bdd184d0bed180d0bcd0b0d186d0b8d0bed0bdd0bdd0b0d18f20d182d0b5d185d0bdd0bed0bbd0bed0b3d0b8d18f2e2e2e |NULL |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
What i need is to make this string readible, or upload new string with correct data. But this is varbinary field, how can i manage data inside it?
UPD:
found that phpmyadmin automatically added 2e2e2e for three dots at the end of each line - they were too long to show. Original binary data are, if somebody interested,
d09fd0a02035302e312e3031392d3230303020d09ed181d0bdd0bed0b2d0bdd18bd0b520d0bfd0bed0bbd0bed0b6d0b5d0bdd0b8d18f20d0b5d0b4d0b8d0bdd0bed0b920d181d0b8d181d182d0b5d0bcd18b20d0bad0bbd0b0d181d181d0b8d184d0b8d0bad0b0d186d0b8d0b820d0b820d0bad0bed0b4d0b8d180d0bed0b2d0b0d0bdd0b8d18f20d182d0b5d185d0bdd0b8d0bad0be2dd18dd0bad0bed0bdd0bed0bcd0b8d187d0b5d181d0bad0bed0b920d0b820d181d0bed186d0b8d0b0d0bbd18cd0bdd0bed0b920d0b8d0bdd184d0bed180d0bcd0b0d186d0b8d0b820d0b820d183d0bdd0b8d184d0b8d186d0b8d180d0bed0b2d0b0d0bdd0bdd18bd1
anyway those strings contains non-utf symbols at the line end, as it seems from
SELECT log_comment,CAST(log_comment AS CHAR CHARACTER SET utf8) AS COMMENT
FROM `logging`
WHERE log_id = %somevalue%
because last symbol is � - for me it seems as black rhomb with white question in it, and last 20-30 characters are missing

SELECT log_comment,CAST(log_comment AS CHAR CHARACTER SET utf8) AS COMMENT
FROM `logging`
WHERE log_id = %somevalue%

As it was said in Joni's comment,
"The length of the text is exactly 255 bytes, which is the limit of a
MySQL tinytext/tinyblob field, and also often used by programmers as
the size for varchar/varbinary. It looks like your original data has
been clipped. The last D1 in your original data starts a new UTF-8
character, but the second byte is missing; that's why the last
character is broken in the converted text."
In the MediaWiki DB in the field [log_comment] of the table [logging] should be stored headers of pages that was altered. Some of them appeared to be longer than 255 symbols, so while being logged they were clipped. That confused me; I thought that there was kind of database error, so i should just alter those strings - add to them missing symbols. Now i see it is slightly possible, so i just can gather necessary information from other fields.

try this:
SELECT log_comment,
CONVERT(log_comment,VARCHAR(65535)) AS COMMENT
FROM `logging`
WHERE log_id = %somevalue%

Related

Special characters show as 'BLOB' when typing SELECT CHAR(128,129,130,131,132,133,134,135,136,137);

I'm using MySQL 8.0.31 and learning using the Sakila dataset. I tried typing
SELECT CHAR(128,129,130,131,132,133,134,135,136,137); but the result shows
image
I also checked the default character set and it is 'utf8mb4'
I don't see a lot of answers and I'm a beginner. Please help
Edit:
I am expecting this result:
image2
This is taken from Learning SQL book by Alan B.
From the Book:
the following examples show the
location of the accented characters along with other special characters, such as currency symbols:
mysql> SELECT CHAR(128,129,130,131,132,133,134,135,136,137);
result: Çüéâäàåçêë
"BLOB" is a datatype used in databases to contain binary data (that is, not representable as text).
The string of characters you built is not representable in the default charset (UTF8), so MySQL does not know how to print it out, and just says is binary content.
The example in the book you are reading surely is assuming the default DB charset is ASCII. Since it is not, you must specify it:
SELECT CHAR(128,129,130,131,132,133,134,135,136,137 USING ascii);

I have a text file that contains non english word and i need to put it into mysql how can i?

I have a text file that contains non english word and i need to put it into mysql how can i ?
203851 ኣብ
70351 ናይ
56687 ካብ
46018 እቲ
41928 ምስ
40221 ከም
38702 ድማ
29739 ናብ
28806 እዩ
23066 ከኣ
21459 ግን
21013 እዚ
20638 ሓደ
If by that you mean how can you insert a non-unicode string the answer is: the same way as you would put any other string. Just be careful when you create the column to select an encoding that supports it (for example utf8-mb4 - that means utf8 on 4 bytes, because the default mysql utf8 only uses 3 and... long story here actually :) - should do the trick for mostly anyhting you want to put there).
Then... INSERT INTO tableName (columnName) VALUES ('yourString');
Important note here: You should convert that string to the encoding you use before (again... utf8 would be your best choice). If you don't do that and respect the next paragraph (check below) you are still fine, but if you don't you will end up with messed up stuff.
Just be careful on data processing unfortunately mysql has so many points of failure regarding encoding. It has 1 encoding for db, 1 for table, 1 for column, 1 for data saving, even 1 for the connection so... just try to use the same for all to avoid surprises.
You have a text file with 2 columns of data? Are they separated by tabs or what?
Use LOAD DATA INFILE with suitable parameters.

MySQL strange characters replace with <BR

I inherited a MySQL table (MyISAM utf8_general_ci encoding) that has a strange character looks like this in myPHPAdmin: •
I assume this a bullet point of some type?
When rendered on a HTML page it looks like this: �
How do I replace this value with a <BR><LI> so I can turn it into a line break with a properly formatted list item?
I've tried a standard UPDATE query but it does not replace these values? I assume I need to escape them somehow?
Query attempted:
UPDATE `FL_Regs` SET `Remarks` = "<BR><LI>" WHERE `Remarks` = "•"
You did not showed your query, so I'm only guessing.
If you're having hard times with your client encoding characters for you (I imagine you may use phpmyadmin, which involve a lot of steps between your browser and the actual server), you may try by giving the string to search as sequence of bytes.
It happen that • is U+2022, a character named "BULLET" in Unicode, which is encoded as e2 80 a2 in UTF8. So you can use X'E280A2' instead of '•' in your query.
Typically:
> select X'E280A2';
+-----------+
| X'E280A2' |
+-----------+
| • |
+-----------+
You can, if you want to better understand what's happening, try to use the HEX() function, first maybe to check what's MySQL is receiving when your're sending a bullet:
SELECT HEX('•');
Typically I'm getting E280A2 which is as previously seen the UTF8 encoding of the BULLET character.
And so see what's actually stored in your table:
SELECT HEX(your_column) FROM your_table;
Try to limit the search to a single raw to make it almost readable.

MySQL does not identify character '?' on select

I got a table in MySQL with the following columns:
id name email address borningDate
I have a form in a HTML page that submits this data to a servlet, responsible for saving it at the database. Due to charset issues (already fixed), I saved a row like this, when trying to store letters with accents:
19 ? ? ? 2015-03-01
and now I want to delete this row.
Yeah, doing this:
DELETE FROM table WHERE id=19;
works nice. My didatic question is: why, if I try something like this:
DELETE FROM table WHERE name='?';
it returns 0 rows affected, like if it can't see ? as a valid character?
Try doing
SELECT id, HEX(name), HEX(email), HEX(address), borningDate FROM table
This will tell you what's actually in the database. It probably isn't actually ASCII question marks. The question marks are probably substitution characters applied when MySQL tries to convert the column's character set to the connection's character set.
To manage this more specifically, do SHOW CREATE TABLE table and look for the character set being used for the text columns. This probably shows up at the end of the table definition as something like DEFAULT CHARSET utf8 or some such thing. But it might be specified in the column definition.
Once you know the character set, issue the command SET NAMES charset, for example, SET NAMES utf8. Then reissue your commands and see if you get better results than the ? substitution character. That assumes, of course, that the client program you are using can handle the character set mentioned.

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )