Non-Standard Characters Cause Line to Appear in Hexadecimal - mysql

I have recently moved from a classic Linux server (using phpMyAdmin 3.3.5) to a cPanel server (using phpMyAdmin 4.0.10) and after multiple attempts was able to import my MediaWiki database. However, I noticed any value that contained non-English characters (Japanese letters or other symbols) appeared entirely in hexadecimal. I have tried pasting in a non-hex equivalent to a hex line, but the result is in 0 lines changing and the value changing back to hexadecimal.
Example of working names on the previous server, seen with values "KrzesłaKonferencyjne" and "Zgłoszenie szkody warta".
Example of non-working names on the current server, seen with values "4b727a6573c582614b6f6e666572656e63796a6e65" and "5a67c5826f737a656e696520737a6b6f6479207761727461".
What can be done to fix this?

Related

Copy Pasting and Apostrophes in MySQL Command Line

I'm using the MySQL Command Lind Tool on MacOS using the command: mysql -u root -p;
To avoid repetition and to save time, I tried using TextEdit (MacOS's Notepad counterpart) to type queries and then copy them into the MySQL command prompt.
Here I noticed a problem when I was copy-pasting syntactically correct queries.
e.g. select * from club where COACHNAME=‘coolname’;
I finally found that the problem was with the apostrophes.
In any text editor, the apostrophes are tilted in different directions: ‘See the left and right tilted apostrophes on both sides’
But the same doesn't take place in the terminal, or in MySQL (not sure whose default it is).
Wherein the same apostrophes are used for both the start and the end: 'Notice the un-tilted and uniform apostrophes on both sides'
Hence when I was copy pasting: select * from club where COACHNAME=‘coolname’;
I was meant to change the titled-apostrophes to the uniform kind that are built-in, i.e.
select * from club where COACHNAME='coolname';
You can even notice that the apostrophes on Stack Overflow by default are the uniform kind,
whereas those in TextEdit and Notes (and maybe also in Notepad, if somebody could confirm that) are the tilted kind.

How to store a Superscript value manually in MySQL using phpMyAdmin

I want to manually enter some formulas in my database using phpMyAdmin using its GUI. Please note that I am NOT USING any php script to store the result. I just want to enter it MANUALLY. The formulas contain use of subscript and superscript (x2 and x2). if I try to copy and paste a formula, it shows x2 instead of x2 and x2.
Current Setting of the column in which I want to enter the data is utf8_unicode_ci.
Unicode contains code points for superscript numerals
U+2070 - U+2079 is the code point range. Not every font has glyphs for these characters, but many do. See this for example: http://www.fileformat.info/info/unicode/char/2070/index.htm
If you can figure out how to type these Unicode characters into your web browser, you'll be able to get phpMyAdmin to insert them into your utf8 varchar() columns. The typing of unicode characters is browser specific.

Character Encoding error when copying double quotes from word or other source

I am using JSP servlets and have a mysql database. I have an input field "Introduction". The error is when a user copy pastes a para from word then the character "(double quotes) is entered as ? in my table but only when the character is copied from a word or some other source. Also, if a user copies two paragraph's with spaces in between then a buggy character enters my sql table and the JS which is trying to load the introduction in my jsp page fails. i have also attached the screenshot for this. Please help me how can i resolve this.
MicroSoft, in its infinite wisdom, decided to have non-standard double quotes -- a left version and a right version. But that should be fixable, since those quotes do exist somewhere in the huge world of utf8 characters.
However, the data from your 'copy' was probably not copied in utf8 encoding. Since is is unclear how that is being done, we can't give you complete details on fixing it.
The "best" plan is to establish "utf8" at all stages of data/client/server/database/table/column/etc.
The quick-and-dirty fix is to replace the funny quotes with ascii quotes.

MSSQL to MySQL migration - char encoding issues with UCS-2 surrogate pairs, how can I remove these from MSSQL database?

I have been tasked with migrating a Microsoft SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.
-MSSQL source database has latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.
-MySQL target database wants the character set UTF-8
I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the MSSQL database.
The migration toolkit copytable program did not provide a very useful error message: "Error during charset conversion of wstring: No error". It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking at the binary data for the field and comparing byte pairs (little endian) with data that was being migrated successfully).
When this surrogate pair was removed from the MSSQL database the row was migrated successfully to MySQL.
Here is the problem:
I have tried to search for these characters in a test MSSQL table (this chartest table is just various test strings an nvarchar field) to prepare a replacement script and keep getting strange results... I must be doing something incorrectly.
Searching for
SELECT * FROM chartest WHERE text LIKE NCHAR(0xdc83)
Will return any surrogate pair character (whether or not it uses DC83), but obviously, only if it is the only character (or part of the pair) in that field. This isn't a big deal since I would like to remove any instance of these anyway (I dont like to remove data like this but I think we can afford it).
Searching for
SELECT * FROM chartest WHERE text LIKE '%' + (NCHAR(0xdc83))+ '%'
Will return every row! Regardless of whether it even has a unicode character present in the field let alone the DC83 character. Is there a better way to find and replace these characters? Or something else I should try?
I have also tried setting the target databse, table, and field character set to UCS-2 but it seems as though it does not make a difference.
I should also mention that this migration is using live data (~50GB database!) while one of the sites that feeds it is taken offline so any solutions to this need to have a quick running time...
I would appreciate any suggestions very much! Please let me know if there is any information I have left out.
I had this error, and now I have discovered the source of the problem. I had a hard time finding out, so maybe this will be useful to someone, even though I realize, my problem and workaround may not be spot on matching op's original trouble.
I am migrating data from MSSQL to MySQL, and the content being migrated is html-content from Sitecore CMS (target CMS is Drupal, btw).
I've found, that I get this error when converting the database and hitting records, that contain Instagram-embeds. Instagram-embeds work in the way, that the embedded post data is copied to the embed code (instead of being loaded async., et.c. - even the image is included as base64-css...), and the young people nowadays tend to put a lot of emoji's in their image-descriptions (using their iPhones with Emoji keyboard). Emoji's are represented by 4-byte encoded characters, but MySQL utf8 only allows for 3-byte encoded unicode characters.
My initial error from running wbcopytables.exe (which is the non-GUI way of doing Migration Wizard in MySQL Workbench) was the
Error during charset conversion of wstring: No error
but upgrading MySQL Workbench to recent version (from 5.something to 6.x) makes the error a bit more descriptive, hinting table and column (alas, not row):
ERROR: Could not successfully convert UCS-2 string to UTF-8 in table
[MyDatabase].[dbo].[MyTable] (column MyColumn).
Original string: ...
Anyway - a solution *could* be to use utf8mb4 which would allow for the emoji's. Read more here.
But it looks like, it's a bad idea to do this in e.g. my case with Drupal.
So - the solution I ended up with was simply to strip these characters in my migrate-script. There is no point in keeping these for users of the site in question, since they are being displayed as rectangles on the webpage anyway. Since you can't search-and-replace with regex in SQL Server, I processed the data using a DAL and c# .NET, and I found the help here (thanks a ton, Jon Skeet) - turns out there is a regex-pattern for matching one half of a surrogate pair in UTF-16. See below (and use the pattern in another language if needed).
var noUcs2SurrogatePairsString = Regex.Replace(stringWithUcs2SurrogatePairs, #"\p{Cs}", string.Empty);
I had a very similar problem today, and I found that it was caused by empty strings, replaced them with NULLs or a value representing no data and the migration worked fine.
I solved just editing the "import data script.cmd" where it reads columns "As NVARCHAR" by replacing those with "VARCHAR" only.
Note: My table columns was VARCHAR type already, so... for some stupid reason the migration script improperly cast it to UNICODE (NVARCHAR) type.
This issue has now been resolved. I used user Remus Rusanu's suggestion here for finding the rows with these surrogate pair characters using CHARINDEX and have decided to use SUBSTRING to exclude the troublesome characters like so:
UPDATE test
SET a = SUBSTRING(a, 1, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 - 1) -- string before the unwanted character
+ SUBSTRING(a, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 +1, LEN(a) ) -- string after the unwanted character
WHERE CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000))) % 2 = 1 -- only odd numbered charindexes (to signify match at beginning of byte pair character)

Removing strange characters from MySQL data

Somewhere along the way, between all the imports and exports I have done, a lot of the text on a blog I run is full of weird accented A characters.
When I export the data using mysqldump and load it into a text editor with the intention of using search-and-replace to clear out the bad characters, searching just matches every "a" character.
Does anyone know any way I can successfully hunt down these characters and get rid of them, either directly in MySQL or by using mysqldump and then reimporting the content?
This is an encoding problem; the  is a non-breaking space (HTML entity ) in Unicode being displayed in Latin1.
You might try something like this... first we check to make sure the matching is working:
SELECT * FROM some_table WHERE some_field LIKE BINARY '%Â%'
This should return any rows in some_table where some_field has a bad character. Assuming that works properly and you find the rows you're looking for, try this:
UPDATE some_table SET some_field = REPLACE( some_field, BINARY 'Â', '' )
And that should remove those characters (based on the page you linked, you don't really want an nbsp there as you would end up with three spaces in a row between sentences etc, you should only have one).
If it doesn't work then you'll need to look at the encoding and collation being used.
EDIT: Just added BINARY to the strings; this should hopefully make it work regardless of encoding.
The accepted answer did not work for me.
From here http://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/ I have found that the binary code for  character is c2a0 (by converting the column to VARBINARY and looking what it turns to).
Then here http://www.oneminuteinfo.com/2013/11/mysql-replace-non-ascii-characters.html found the actual solution to remove (replace) it:
update entry set english_translation = unhex(replace(hex(english_translation),'C2A0','20')) where entry_id = 4008;
The query above replaces it to a space, then a normal trim can be applied or simply replace to '' instead.
I have had this problem and it is annoying, but solvable. As well as  you may find you have a whole load of characters showing up in your data like these:
“
This is connected to encoding changes in the database, but so long as you do not have any of these characters in your database that you want to keep (e.g. if you are actually using a Euro symbol) then you can strip them out with a few MySQL commands as previously suggested.
In my case I had this problem with a Wordpress database that I had inherited, and I found a useful set of pre-formed queries that work for Wordpress here http://digwp.com/2011/07/clean-up-weird-characters-in-database/
It's also worth noting that one of the causes of the problem in the first place is opening a database in a text editor which might change the encoding in some way. So if you can possibly manipulate the database using MySQL only and not a text editor this will reduce the risk of causing further trouble.