How to fix double-encoded UTF8 characters (in an utf-8 table)

How to fix double-encoded UTF8 characters (in an utf-8 table) - mysql

A previous LOAD DATA INFILE was run under the assumption that the CSV file is latin1-encoded. During this import the multibyte characters were interpreted as two single character and then encoded using utf-8 (again).
This double-encoding created anomalies like ÃƒÂ± instead of ñ.
How to correct these strings?

The following MySQL function will return the correct utf8 string after double-encoding:
CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)
It can be used with an UPDATE statement to correct the fields:
UPDATE tablename SET
field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8);

The above answer worked for some of my data, but resulted in a lot of NULL columns after running. My thought is if the conversion wasn't successful it returns null. To avoid that, I added a small check.
UPDATE
tbl
SET
col =
CASE
WHEN CONVERT(CAST(CONVERT(col USING latin1) AS BINARY) USING utf8) IS NULL THEN col
ELSE CONVERT(CAST(CONVERT(col USING latin1) AS BINARY) USING utf8)
END

well it is very important to use "utf8mb4" instead of "utf8" since mysql will strip out all the data after an unrecognized character. So the safer method is;
UPDATE tablename SET
field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8mb4);
be careful about this.

I meet this issue too, here a solution for Oracle:
update tablename t set t.colname = convert(t.colname, 'WE8ISO8859P1', 'UTF8') where t.colname like '%Ã%'
And another one for Java:
public static String fixDoubleEncoded(String text) {
final Pattern pattern = Pattern.compile("^.*Ã[^0-9a-zA-Z\\ \t].*$");
try {
while (pattern.matcher(text).matches())
text = new String(text.getBytes("iso-8859-1"), "utf-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return text;
}

Related

unescape diactrics in \u0 format (json) in ms sql (SQL Server)

I'm getting json file, which I load to Azure SQL databese. This json is direct output from API, so there is nothing I can do with it before loading to DB.
In that file, all Polish diactircs are escaped to "C/C++/Java source code" (based on: http://www.fileformat.info/info/unicode/char/0142/index.htm
So for example:
ł is \u0142
I was trying to find some method to convert (unescape) those to proper Polish letters.
In worse case scenario, I can write function which will replace all combinations
Repalce(Replace(Replace(string,'\u0142',N'ł'),'\u0144',N'ń')))
And so on, making one big, terrible function...
I was looking for some ready functions like there is for URLdecode, which was answered here on stack in many topics, and here: https://www.codeproject.com/Articles/1005508/URL-Decode-in-T-SQL
Using this solution would be possible but I cannot figure out cast/convert with proper collation and types in there, to get result I'm looking for.
So if anyone knows/has function that would make conversion in string for unescaping that \u this would be great, but I will manage to write something on my own if I would get right conversion. For example I tried:
select convert(nvarchar(1), convert(varbinary, 0x0142, 1))
I made assumption that changing \u to 0x will be the answer but it gives some Chinese characters. So this is wrong direction...
Edit:
After googling more I found exactly same question here on stack from #Pasetchnik: Json escape unicode in SQL Server
And it looks this would be the best solution that there is in MS SQL.
Onlty thing I needed to change was using NVARCHAR instead of VARCHAR that is in linked solution:
CREATE FUNCTION dbo.Json_Unicode_Decode(#escapedString nVARCHAR(MAX))
RETURNS nVARCHAR(MAX)
AS
BEGIN
DECLARE #pos INT = 0,
#char nvarCHAR,
#escapeLen TINYINT = 2,
#hexDigits TINYINT = 4
SET #pos = CHARINDEX('\u', #escapedString, #pos)
WHILE #pos > 0
BEGIN
SET #char = NCHAR(CONVERT(varbinary(8), '0x' + SUBSTRING(#escapedString, #pos + #escapeLen, #hexDigits), 1))
SET #escapedString = STUFF(#escapedString, #pos, #escapeLen + #hexDigits, #char)
SET #pos = CHARINDEX('\u', #escapedString, #pos)
END
RETURN #escapedString
END

Instead of nested REPLACE you could use:
DECLARE #string NVARCHAR(MAX)= N'\u0142 \u0144\u0142';
SELECT #string = REPLACE(#string,u, ch)
FROM (VALUES ('\u0142',N'ł'),('\u0144', N'ń')) s(u, ch);
SELECT #string;
DBFiddle Demo

SQL - Incorrect string value: '\xEF\xBF\xBD' [duplicate]

I have a table I need to handle various characters. The characters include Ø, ® etc.
I have set my table to utf-8 as the default collation, all columns use table default, however when I try to insert these characters I get error: Incorrect string value: '\xEF\xBF\xBD' for column 'buyerName' at row 1
My connection string is defined as
string mySqlConn = "server="+server+";user="+username+";database="+database+";port="+port+";password="+password+";charset=utf8;";
I am at a loss as to why I am still seeing errors. Have I missed anything with either the .net connector, or with my MySQL setup?
--Edit--
My (new) C# insert statement looks like:
MySqlCommand insert = new MySqlCommand( "INSERT INTO fulfilled_Shipments_Data " +
"(amazonOrderId,merchantOrderId,shipmentId,shipmentItemId,"+
"amazonOrderItemId,merchantOrderItemId,purchaseDate,"+ ...
VALUES (#amazonOrderId,#merchantOrderId,#shipmentId,#shipmentItemId,"+
"#amazonOrderItemId,#merchantOrderItemId,#purchaseDate,"+
"paymentsDate,shipmentDate,reportingDate,buyerEmail,buyerName,"+ ...
insert.Parameters.AddWithValue("#amazonorderId",lines[0]);
insert.Parameters.AddWithValue("#merchantOrderId",lines[1]);
insert.Parameters.AddWithValue("#shipmentId",lines[2]);
insert.Parameters.AddWithValue("#shipmentItemId",lines[3]);
insert.Parameters.AddWithValue("#amazonOrderItemId",lines[4]);
insert.Parameters.AddWithValue("#merchantOrderItemId",lines[5]);
insert.Parameters.AddWithValue("#purchaseDate",lines[6]);
insert.Parameters.AddWithValue("#paymentsDate",lines[7]);
insert.ExecuteNonQuery();
Assuming that this is the correct way to use parametrized statements, it is still giving an error
"Incorrect string value: '\xEF\xBF\xBD' for column 'buyerName' at row 1"
Any other ideas?

\xEF\xBF\xBD is the UTF-8 encoding for the unicode character U+FFFD. This is a special character, also known as the "Replacement character". A quote from the wikipedia page about the special unicode characters:
The replacement character � (often a black diamond with a white question mark) is a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table. It is used to indicate problems when a system is not able to decode a stream of data to a correct symbol. It is most commonly seen when a font does not contain a character, but is also seen when the data is invalid and does not match any character:
So it looks like your data source contains corrupted data. It is also possible that you try to read the data using the wrong encoding. Where do the lines come from?
If you can't fix the data, and your input indeed contains invalid characters, you could just remove the replacement characters:
lines[n] = lines[n].Replace("\xFFFD", "");

Mattmanser is right, never write a sql query by concatenating the parameters directly in the query. An example of parametrized query is:
string lastname = "Doe";
double height = 6.1;
DateTime date = new DateTime(1978,4,18);
var connection = new MySqlConnection(connStr);
try
{
connection.Open();
var command = new MySqlCommand(
"SELECT * FROM tblPerson WHERE LastName = #Name AND Height > #Height AND BirthDate < #BirthDate", connection);
command.Parameters.AddWithValue("#Name", lastname);
command.Parameters.AddWithValue("#Height", height);
command.Parameters.AddWithValue("#Name", birthDate);
MySqlDataReader reader = command.ExecuteReader();
...
}
finally
{
connection.Close();
}

To those who have a similar problem using PHP, try the function utf8_encode($string). It just works!

I have this some problem, when my website encoding is utf-u and I tried to send in form CP-1250 string (example taken by listdir dictionaries).
I think you must send string encoded like website.

Extract first character of each word in MySQL using a RegEx

In my MySQL database I have a column of strings in UTF-8 format for which I want to extract the first character using a RegEx, for example.
Assuming a RegEx which ONLY extracts the following characters:
ਹਮਜਰਣਚਕਨਖਲਨ
And given the following string:
ਹੁਕਮਿ ਰਜਾਈ ਚਲਣਾ ਨਾਨਕ ਲਿਖਿਆ ਨਾਲਿ ॥੧॥
The only characters extracted would be:
ਹਰਚਨਲਨ
I know the following steps would be required to solve this problem:
Break the string into individual words (substrings) by using space as the delimiter
For each word extract the first letter (substring of a substring) if it matches what is in the regex of valid characters
I have looked at all the similar questions/answers on SO and none have been able to solve my problem thus far.

I realy don't know MySql Regex Syntax and restrictions(never used), but you can add leading space before string, and match with something simple like this: " ([ਮਜਰਣਚਕਨਖਲਨ]{1})"
So, if you concatenate matched groups you will have this string "ਰਚਨਲਨ"(only "ਹ" not matched, because it's not exists in sample")
in C# it may look like this(working sample):
namespace TestRegex
{
using System.Linq;
using System.Text.RegularExpressions;
using System.Windows.Forms;
class Program
{
static void Main(string[] args)
{
// leading space(to match first word too)
// + sample string
var sample = " ";
sample += "ਹੁਕਮਿ ਰਜਾਈ ਚਲਣਾ ਨਾਨਕ ਲਿਖਿਆ ਨਾਲਿ ॥੧॥";
// Regex pattern that will math space, and
// if next character in set - add it to "match group 1"
var pattern = " ([ਮਜਰਣਚਕਨਖਲਨ]{1})";
// select every "match group 1" from matches as array
var result = from Match m in Regex.Matches(sample, pattern)
select m.Groups[1];
// concatenate array content into one string and
// show it in message box to user, for example..
MessageBox.Show(string.Concat(result));
}
}
}
in most non-query languages it will be look almost same. For example in php you need to do preg_match_all, and in foreach loop add "$match[i][1]"(every "match group 1") from every match to end of one single string.
well.. pretty simple. but not for mysql...

I finally achieved this with the help of a programmer friend of mine. I directly pasted the following piece of code into the SQL section of my database in PhpMyAdmin:
delimiter $$
drop function if exists `initials`$$
CREATE FUNCTION `initials`(str text, expr text) RETURNS text CHARSET utf8
begin
declare result text default '';
declare buffer text default '';
declare i int default 1;
if(str is null) then
return null;
end if;
set buffer = trim(str);
while i <= length(buffer) do
if substr(buffer, i, 1) regexp expr then
set result = concat( result, substr( buffer, i, 1 ));
set i = i + 1;
while i <= length( buffer ) and substr(buffer, i, 1) regexp expr do
set i = i + 1;
end while;
while i <= length( buffer ) and substr(buffer, i, 1) not regexp expr do
set i = i + 1;
end while;
else
set i = i + 1;
end if;
end while;
return result;
end$$
drop function if exists `acronym`$$
CREATE FUNCTION `acronym`(str text) RETURNS text CHARSET utf8
begin
declare result text default '';
set result = initials( str, '[ੴਓੳਅੲਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼]' );
return result;
end$$
delimiter ;
UPDATE scriptures SET search = acronym(scripture)
Just to explain the last line:
scriptures is the table I want to update
search is a new empty column I created inside the table to store the result
scripture is an existing column inside the scriptures table with all the strings I want to extract from
acronym is the function previously declared which is looking to match the first letter of each word with a character from the RegEx [ੴਓੳਅੲਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼]
So this final line of the code will go through each row of the column scripture, apply the function acronym to it and store the result in the new search column.
Perfect! Exactly what I was looking for :)

Illegal mix of collations for operation 'like' while searching with Ignited-Datatables

I have successfully implemented Ignited-Datatables. However, while searching with database when typing "non-latin" characters like "İ,ş,ğ,.."
POST http://vproject.dev/module/user/ign_listing 500 (Internal Server Error)
Details are:
Illegal mix of collations for operation 'like' while searching
... (u.id_user LIKE '%Ä°%' OR u.first_name LIKE '%Ä°%' OR u.last_name LIKE '%Ä°%' OR ue.email LIKE '%Ä°%' OR u.last_login LIKE '%Ä°%' ) ...
%Ä°% part changes according to the non-latin character you typed.
Any idea for solving this?

I figured out the problem. It seems it is DATETIME fields that causes the problem.
.. ue.last_login '%ayşenur%'
gives error for Illegal mix of collations for operation 'like'. When I remove LIKE partials DATETIME fields, there are no error any more. I hope this helps.

Try the following:
u.id_user LIKE '%Ä°%' OR ... OR ... '%Ä°%' COLLATE utf8_bin
Refer to MySQL Unicode Character Sets
Also you can refer to MySQL _bin and binary Collations for more information on utf8_bin:
Nonbinary strings (as stored in the CHAR, VARCHAR, and TEXT data
types) have a character set and collation. A given character set can
have several collations, each of which defines a particular sorting
and comparison order for the characters in the set. One of these is
the binary collation for the character set, indicated by a _bin suffix
in the collation name. For example, latin1 and utf8 have binary
collations named latin1_bin and utf8_bin.

The question is a little bit old.
Finally I find a solution change "LIKE " TO "LIKE binary "

I was having the same problem in Datatable search ssp.class.php
i solved by converting to UTF8 like :
CONVERT(`user_datetime` USING utf8)
fix in ssp.class.php:
$globalSearch[] = "CONVERT(`".$column['db']."` USING utf8) LIKE ".$binding;
My final code was :
static function filter ( $request, $columns, &$bindings )
{
$globalSearch = array();
$columnSearch = array();
$dtColumns = self::pluck( $columns, 'dt' );
if ( isset($request['search']) && $request['search']['value'] != '' ) {
$str = $request['search']['value'];
for ( $i=0, $ien=count($request['columns']) ; $i<$ien ; $i++ ) {
$requestColumn = $request['columns'][$i];
$columnIdx = array_search( $requestColumn['data'], $dtColumns );
$column = $columns[ $columnIdx ];
if ( $requestColumn['searchable'] == 'true' ) {
$binding = self::bind( $bindings, '%'.$str.'%', PDO::PARAM_STR );
$globalSearch[] = "CONVERT(`".$column['db']."` USING utf8) LIKE ".$binding;
}
}
}

i know that this is far too late, but, here my workaround.
SELECT * FROM (SELECT DATE_FORMAT(some_date,'%d/%m/%Y') AS some_date FROM some_table)tb1
WHERE some_date LIKE '% $some_variable %'
datetime/date column gives error for Illegal mix of collations for operation 'like', therefore, by converting it, as another table entity, previous column type will be replace with varchar type.
also, make sure to convert any column before convert it to temporary table, to make matching process more easier.

I met a similar error when LIKE was applied to the DateTime column.
So now, instead of using simple date_col LIKE '2019%' I use CAST(date_col AS CHAR) LIKE '2019%'.
The solution was found on the official MySQL bugs website.

MYSQL case sensitive search (using hibernate) for utf8

I have Login Table that have utf8 charset and utf8 collation when I want check user name and retrieve other information for this specific user name the hql query give me the same result with lowercase and uppercase.
what should l do for my HQL query that work case sesitive
I use Mysql 5 and java hibernarte
this is my query:
return queryManager.executeQueryUniqueResult("select b.login from BranchEntity b where b.userName = ?", username);

The easiest way is to change your column's definition to use case-insensitive collation like utf8_bin.
Details are here

Add class
public class CollateDialect extends MySQLDialect {
#Override
public String getTableTypeString() {
return " COLLATE utf8_bin";
}
}
and use it:
cfg.setProperty("hibernate.dialect", "org.kriyak.hbm.CollateDialect");
this will make all queries case-sensitive that makes much more sence.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to fix double-encoded UTF8 characters (in an utf-8 table) - mysql

well it is very important to use "utf8mb4" instead of "utf8" since mysql will strip out all the data after an unrecognized character. So the safer method is; UPDATE tablename SET field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8mb4); be careful about this.

Related

unescape diactrics in \u0 format (json) in ms sql (SQL Server)

SQL - Incorrect string value: '\xEF\xBF\xBD' [duplicate]

Extract first character of each word in MySQL using a RegEx

Illegal mix of collations for operation 'like' while searching with Ignited-Datatables

MYSQL case sensitive search (using hibernate) for utf8

Categories

Resources