MySQL strange characters replace with <BR - mysql

I inherited a MySQL table (MyISAM utf8_general_ci encoding) that has a strange character looks like this in myPHPAdmin: •
I assume this a bullet point of some type?
When rendered on a HTML page it looks like this: �
How do I replace this value with a <BR><LI> so I can turn it into a line break with a properly formatted list item?
I've tried a standard UPDATE query but it does not replace these values? I assume I need to escape them somehow?
Query attempted:
UPDATE `FL_Regs` SET `Remarks` = "<BR><LI>" WHERE `Remarks` = "•"

You did not showed your query, so I'm only guessing.
If you're having hard times with your client encoding characters for you (I imagine you may use phpmyadmin, which involve a lot of steps between your browser and the actual server), you may try by giving the string to search as sequence of bytes.
It happen that • is U+2022, a character named "BULLET" in Unicode, which is encoded as e2 80 a2 in UTF8. So you can use X'E280A2' instead of '•' in your query.
Typically:
> select X'E280A2';
+-----------+
| X'E280A2' |
+-----------+
| • |
+-----------+
You can, if you want to better understand what's happening, try to use the HEX() function, first maybe to check what's MySQL is receiving when your're sending a bullet:
SELECT HEX('•');
Typically I'm getting E280A2 which is as previously seen the UTF8 encoding of the BULLET character.
And so see what's actually stored in your table:
SELECT HEX(your_column) FROM your_table;
Try to limit the search to a single raw to make it almost readable.

Related

mysql full text search strange behaviour with some words [duplicate]

I have a few column fulltext indexed and i'm testing some string to search. My db contains cars components so my researches could be for example "Engine 1.6". The problem is that when I use string with point (like 1.6) query returns no results.
Here's my variables
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
+--------------------------+----------------+
| ft_max_word_len | 84 |
+--------------------------+----------------+
| ft_min_word_len | 4 |
+--------------------------+----------------+
| ft_query_expansion_limit | 20 |
+--------------------------+----------------+
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
I don't know why but even if the ft_min_word_len is 4, a search like "Engine 24V" works. The query for matching is like this:
WHERE MATCH(sdescr,udescr) AGAINST ('+engine +1.6' IN BOOLEAN MODE)
I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'
We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.
We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.14, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.10.7, “Adding a User-Defined Collation for Full-Text Indexing”.
First things first:
We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period.
https://www.eso.org/~ndelmott/ascii.html
Now, we need to find the collation files in the database server.
SSH into your server.
Login into your MySQL/MariaDB: mysql -u root -p
Run Show VARIABLES LIKE 'character_sets_dir'
The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.
At this point, I opened a second terminal, but this is necessary.
Back in the server, outside of the MySQL/MariaDB command line:
Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
Follow the first step in the MySQL Documentation
NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.
Continue to the second step in the MySQL Documentation
After the changes, Restart your server.
Assign the User-defined Collation to our database/table/column.
All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
Here are some links that might be helpful:
https://mariadb.com/kb/en/setting-character-sets-and-collations/
https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html
This should solve your problem if you don't have any existing data in the table.
If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.
The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.
Run the following query in the mysql/mariadb command line:
UPDATE table_name SET fulltext_column = CONVERT(CAST(CONVERT(fulltext_column USING utf8) AS BINARY) USING latin1);
You'll need to convert the values of every column which are causing the issue. In my case, it was just one.
Then follow it with:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
We are done. We can now search a term with our character, and our database engine will match against it.
InnoDB solves this problem, MyISAM still persists with this feature/behaviour.
MyISAM works with words like "Node.js" but not with words like "ASP.NET"
The working here
UPDATED: Later I found I might be wrong. MySAM works with the words "Node.js" because at least four characters are required for MySAM while InnoDB requires at least 3 characters.
I found a link here with the below explanation:
Note: Some words are ignored in full-text searches.
The minimum length of the word for full-text searches as of follows :
Three characters for InnoDB search indexes.
Four characters for MyISAM search indexes.
Stop words are words that are very common such as 'on', 'the' or 'it', appear in almost every document. These type of words are ignored during searching.

How to clean data with special characters in MySQL

How can one clean data that looks like this Réation, l’Oreal to look like this R'action and L'Oreal respectively in MySQL?
That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .
Réation -> Réation after undoing the double-encoding.
Yet you say R'action -- I wonder if you were typing é as e' or 'e??
I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)
First, we need to verify that it is actually an ordinary double-encoding.
SELECT col, HEX(col) FROM ... WHERE ...
should give you this for the hex for Réation:
52 E9 6174696F6E -- latin1 encoding
52 C3A9 6174696F6E -- utf8 encoding
52 C383 C2A9 6174696F6E -- double encoding
(Ignore the spacing.)
If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.
Now, see if the double-encoding fix will fix it (before fixing it):
SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
BINARY(CONVERT(col USING latin1)) USING utf8mb4)
USING latin1)) USING utf8mb4)
FROM tbl;
You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!
Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:
First, let's assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
Do that for each column in each table with the problem.
(In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)
from Comment
CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β' (Greek beta)
CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'β'
My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".
If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.

MYSQL/Coldfusion replace registration symbol not working

I'd like to make all registration symbols superscript by wrapping them with a <sup> HTML tag. So, I can do this in SQL no problem:
SELECT s.id,
Replace(s.name,'®','<sup>®</sup>') AS name
FROM staff s
WHERE name LIKE '%®%'
Result:
id | name
1 | Name1 CFP<sup>®</sup>, CDFA
2 | Jeffrey test CFP<sup>®</sup>
3 | Matthew hello CFP<sup>®</sup> CFA
But when I run it in Coldfusion from a cfquery tag, it looks as if the ® character is interpreted as ®.
<cfquery name="getStaff" dataSource="#this.dsn#">
SELECT s.id,
Replace(s.name,'®','<sup>®</sup>') AS name
FROM staff s
WHERE 1=1
<cfif isDefined("arguments.permalink")>
AND s.permalink=<cfqueryparam value="#arguments.permalink#" />
</cfif>
</cfquery>
Is there a better way to approach this? I originally did this in Coldfusion using <cfset getStaff.name = Replace(getStaff.name,Chr(174),'<sup>®</sup>') />, which worked fine until I switched to Mustache templating.
I'd definitely prefer to use the CHAR() function if I could figure out what numeric character ® is in Mysql. (Note, using utf8_general_ci on this and all DB tables) I tried CHAR(174) in Mysql, but it won't work because (as far as I can tell) Mysql isn't using the same character set - SELECT CHAR(174) returns a blob.
UPDATE:
I'd definitely prefer to use the CHAR() function if I could figure out
what numeric character ® is in Mysql. (Note, using utf8_general_ci on
this and all DB tables) I tried CHAR(174) in Mysql, but it won't work
because (as far as I can tell) Mysql isn't using the same character
set - SELECT CHAR(174) returns a blob.
As mentioned in the comments, it sounds like the default charset for your database is utf8. So presumably it failed because the decimal 174 is not the correct way to represent the registered sign in utf8. That symbol requires two bytes. Using the proper hex or decimal value for your default charset (ie utf8) it works as expected:
Hex: CHAR(0xC2AE)
Decimal: CHAR(194,174)
Though it would be better to specify the charset explicitly with USING:
Hex: CHAR(194,174 USING utf8)
Decimal: CHAR(0xC2AE USING utf8)
Is the symbol hard-coded into the .cfm script? If so, it is probably an issue with the character encoding of the script. When interpreting literal characters within the file, the page encoding is what matters. Try:
Adding <cfprocessingdirective pageEncoding="utf-8"> to the top of the script.
Note: For CFC's, the cfprocessingdirective tag must follow the cfcomponent tag
IF the default charset for your database is utf8, try using the CF equivalent function, ie #chr(174)#. However, IMO it is better to use the MySQL Char() function instead.
Side note about cfqueryparam, it is a good practice to always specify a cfsqltype. If omitted, it defaults to CF_SQL_CHAR, which may force implicit conversion and cause wrong/unintended results in some cases (numbers, dates, etcetera). Even for string values it is a good idea to specify the type, as there may be slight differences with how CHAR and VARCHAR types are treated on the database side.
It is possible to do something like ColdFusion Char() in SQL
<cfquery name="getStaff" dataSource="#this.dsn#">
SELECT s.id,
REPLACE(s.name, CHAR(174), '<sup>®</sup>') AS name
FROM staff s
WHERE 1=1
<cfif isDefined("arguments.permalink")>
AND s.permalink=<cfqueryparam value="#arguments.permalink#" />
</cfif>
</cfquery>
For MySQL:
See: http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_char
For SQL Server:
See: https://msdn.microsoft.com/en-us/library/ms187323.aspx

How to edit invalid UTF-8 strings in mysql database

I have some utf-8 strings in my database, they are stored as varbinary. (Generally, it's mediawiki database, but that's not important, i think). I found that some strings are not in a good shape, then i make
SELECT log_comment, CONVERT( log_comment
USING utf8 ) AS
COMMENT
FROM `logging`
WHERE log_id = %somevalue%
i have output table in phpmyadmin like this:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| d093d09ed0a1d0a220d0a020d098d0a1d09e2fd09cd0add09a20393239342d39332e20c2abd098d0bdd184d0bed180d0bcd0b0d186d0b8d0bed0bdd0bdd0b0d18f20d182d0b5d185d0bdd0bed0bbd0bed0b3d0b8d18f2e2e2e |NULL |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
What i need is to make this string readible, or upload new string with correct data. But this is varbinary field, how can i manage data inside it?
UPD:
found that phpmyadmin automatically added 2e2e2e for three dots at the end of each line - they were too long to show. Original binary data are, if somebody interested,
d09fd0a02035302e312e3031392d3230303020d09ed181d0bdd0bed0b2d0bdd18bd0b520d0bfd0bed0bbd0bed0b6d0b5d0bdd0b8d18f20d0b5d0b4d0b8d0bdd0bed0b920d181d0b8d181d182d0b5d0bcd18b20d0bad0bbd0b0d181d181d0b8d184d0b8d0bad0b0d186d0b8d0b820d0b820d0bad0bed0b4d0b8d180d0bed0b2d0b0d0bdd0b8d18f20d182d0b5d185d0bdd0b8d0bad0be2dd18dd0bad0bed0bdd0bed0bcd0b8d187d0b5d181d0bad0bed0b920d0b820d181d0bed186d0b8d0b0d0bbd18cd0bdd0bed0b920d0b8d0bdd184d0bed180d0bcd0b0d186d0b8d0b820d0b820d183d0bdd0b8d184d0b8d186d0b8d180d0bed0b2d0b0d0bdd0bdd18bd1
anyway those strings contains non-utf symbols at the line end, as it seems from
SELECT log_comment,CAST(log_comment AS CHAR CHARACTER SET utf8) AS COMMENT
FROM `logging`
WHERE log_id = %somevalue%
because last symbol is � - for me it seems as black rhomb with white question in it, and last 20-30 characters are missing
SELECT log_comment,CAST(log_comment AS CHAR CHARACTER SET utf8) AS COMMENT
FROM `logging`
WHERE log_id = %somevalue%
As it was said in Joni's comment,
"The length of the text is exactly 255 bytes, which is the limit of a
MySQL tinytext/tinyblob field, and also often used by programmers as
the size for varchar/varbinary. It looks like your original data has
been clipped. The last D1 in your original data starts a new UTF-8
character, but the second byte is missing; that's why the last
character is broken in the converted text."
In the MediaWiki DB in the field [log_comment] of the table [logging] should be stored headers of pages that was altered. Some of them appeared to be longer than 255 symbols, so while being logged they were clipped. That confused me; I thought that there was kind of database error, so i should just alter those strings - add to them missing symbols. Now i see it is slightly possible, so i just can gather necessary information from other fields.
try this:
SELECT log_comment,
CONVERT(log_comment,VARCHAR(65535)) AS COMMENT
FROM `logging`
WHERE log_id = %somevalue%

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )