MySQL strict string comparison the semantically correct way? - mysql

The **compare-on-binary way** is **NOT semantically-correct**
For example, when you want to do strict-string-comparison with different encoded strings, the compare-on-binary way's mistake comes out. The following test case illustrates why:
In this case, I want to replace the string '北京 ' (with a tailling space) in the field 城市 into string '北京111', but keep the string '北京' unchanged, so I write down the following sql:
SELECT CASE WHEN BINARY `城市` = BINARY '北京 ' THEN '北京111' ELSE `城市` END
FROM `中文测试表1`
GROUP BY BINARY CASE WHEN BINARY `城市` = BINARY '北京 ' THEN '北京111' ELSE `城市` END
The underlying table definition and data (the session encoding is setted to 'utf8mb4'):
CREATE TABLE `中文测试表1` (
`城市` varchar(50) CHARACTER SET gbk DEFAULT NULL,
`销量` int(11) DEFAULT NULL
) ENGINE=InnoDB;
INSERT INTO `中文测试表1` VALUES ('杭州', '111');
INSERT INTO `中文测试表1` VALUES ('北京', '345');
INSERT INTO `中文测试表1` VALUES ('北京 ', '123');
And what actually happened is that the string '北京 ' is not replaced by '北京111' and still keep what it was in the result set.
The reason is, the string literal '北京 ' is encoded using utf8mb4 (which is decided by the session) and the string value '北京 ' from field 城市 is encoded using gbk (which is decided by the table definition), and when they are converted to binary, they are not the same every byte, but the two string is indeed semantically equal every character (no matter what the underlying encoding method is used).
So, what is the semantically correct way to compare strings strictly in MySQL?

See the TRIM() function for removing spaces from the start/end of strings.
Converting between gbk and utf8mb4 leaves you at the mercy of the conversion tables; you may (or may not) get the desired transliteration.
'北京' is HEX E58C97 E4BAAC for utf8/utf8mb4
'北京 ' is HEX E58C97 E4BAAC 20 for utf8/utf8mb4 -- as found in the query
'北京' is HEX B1B1 BEA9 for gbk
'北京 ' is HEX B1B1 BEA9 20 for gbk -- as found in the table
When you say SELECT ... BINARY '北京 ' ..., the encoding for the string is based on the connection, not the column encoding. So it is utf8mb4.
Instead of ... WHEN BINARY 城市 = BINARY '北京 ' THEN ..., do one of these:
Plan A, Let the conversions happen automatically: ... WHEN 城市 = '北京 ' THEN ...
Plan B, Explicitly convert: ... WHEN 城市 = CONVERT('北京 ' USING gbk) THEN ...
Plan C, Use HEX: ... WHEN HEX(城市) = HEX(CONVERT('北京 ' USING gbk)) THEN ...
Plan D, closer to your attempt: ... WHEN BINARY 城市 = BINARY(CONVERT('北京 ' USING gbk)) THEN ...
There are other ways, using COLLATE utf8_bin, COLLATE gbk_bin, etc.

Related

Update and replace unquoted JSON strings

I have the following table in my database:
Type | Name
-------------------------------------------------
INT(10) UNSIGNED | id
LONGTEXT | settings
The settings column holds JSON strings such as the following:
'[
{"value":"1","label":"user_type"},
{"value":"2","label":"email_type"}
]'
I have some corrupt data that doesn't correspond to the required format as the requirements have now changed.
'[
{"value": 8,"label":"should_receive_notifications"},
]'
Notice how the value is unquoted compared to the first example which is how I need them.
Is there a way I can do a find and replace on all JSON strings within the settings column to update all unquoted values in the JSON string and wrap them in quotes?
You may use the next procedure:
CREATE PROCEDURE quote_value(max_amount INT)
BEGIN
REPEAT
UPDATE test
SET settings = JSON_REPLACE(settings, CONCAT('$[', max_amount, '].value'), CAST(JSON_UNQUOTE(JSON_EXTRACT(settings, CONCAT('$[', max_amount, '].value'))) AS CHAR));
SET max_amount = max_amount - 1;
UNTIL max_amount < 0 END REPEAT;
END
max_amount parameter defines the amount of objects in the array to be updated (do not forget that the array elements are counted from zero). So set it to max objects per array amount value.
https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=166f43d44e57b62da034bd9530713beb
This is under assumption that there are no spaces between the characters in json string, simple but data needs to be verified for this.
update tablename
set settings = replace(settings, '"value\":' , '"value":\"')
where settings not like '%"value":"%'
update tablename
set settings = replace(settings, ',"' , '","')
where settings not like '%","%'

Teradata Masking - Retain all chararcters at position 1,4,8,12,16 .... in a string and mask remaining characters with 'X'

I have a requirement where I need to mask all but characters in position 1,4,8,12,16.. for a variable length string with 'X'
For example:
Input string - 'John Doe'
Output String - 'JXXn xxE'
SPACE between the two strings must be retained.
Kindly help or reach out for more details if required.
I think maybe an external function would be best here, but if that's too much to bite off, you can get crafty with strtok_split_to_table, xml_agg and regexp_replace to rip the string apart, replace out characters using your criteria, and stitch it back together:
WITH cte AS (SELECT REGEXP_REPLACE('this is a test of this functionality', '(.)', '\1,') AS fullname FROM Sys_Calendar.calendar WHERE calendar_date = CURRENT_DATE)
SELECT
REGEXP_REPLACE(REGEXP_REPLACE((XMLAGG(tokenout ORDER BY tokennum) (VARCHAR(200))), '(.) (.)', '\1\2') , '(.) (.)', '\1\2')
FROM
(
SELECT
tokennum,
outkey,
CASE WHEN tokennum = 1 OR tokennum mod 4 = 0 OR token = ' ' THEN token ELSE 'X' END AS tokenout
FROM TABLE (strtok_split_to_table(cte.fullname, cte.fullname, ',')
RETURNS (outkey VARCHAR(200), tokennum integer, token VARCHAR(200) CHARACTER SET UNICODE)) AS d
) stringshred
GROUP BY outkey
This won't be fast on a large data set, but it might suffice depending on how much data you have to process.
Breaking this down:
WITH cte AS (SELECT REGEXP_REPLACE('this is a test of this functionality', '(.)', '\1,') AS fullname FROM Sys_Calendar.calendar WHERE calendar_date = CURRENT_DATE)
This CTE is just adding a comma between every character of our incoming string using that regexp_replace function. Your name will come out like J,o,h,n, ,D,o,e. You can ignore the sys_calendar part, I just put that in so it would spit out exactly 1 record for testing.
SELECT
tokennum,
outkey,
CASE WHEN tokennum = 1 OR tokennum mod 4 = 0 OR token = ' ' THEN token ELSE 'X' END AS tokenout
FROM TABLE (strtok_split_to_table(cte.fullname, cte.fullname, ',')
RETURNS (outkey VARCHAR(200), tokennum integer, token VARCHAR(200) CHARACTER SET UNICODE)) AS d
This subquery is the important bit. Here we create a record for every character in your incoming name. strtok_split_to_table is doing the work here splitting that incoming name by comma (which we added in the CTE)
The Case statement just runs your criteria swapping out 'X' in the correct positions (record 1, or a multiple of 4, and not a space).
SELECT
REGEXP_REPLACE(REGEXP_REPLACE((XMLAGG(tokenout ORDER BY tokennum) (VARCHAR(200))), '(.) (.)', '\1\2') , '(.) (.)', '\1\2')
Finally we use XMLAGG to combine the many records back into one string in a single record. Because XMLAGG adds a space in between each character we have to hit it a couple of times with regexp_replace to flip those spaces back to nothing.
So... it's ugly, but it does the job.
The code above spits out:
tXXs XX X XeXX oX XhXX fXXXtXXXaXXXy
I couldn't think of a solution, but then #JNevill inspired me with his idea to add a comma to each character :-)
SELECT
RegExp_Replace(
RegExp_Replace(
RegExp_Replace(inputString, '(.)(.)?(.)?(.)?', '(\1(\2[\3(\4', 2)
,'(\([^ ])', 'X')
,'(\(|\[)')
,'this is a test of this functionality' AS inputString
tXXs XX X XeXX oX XhXX fXXXtXXXaXXXy
The 1st RegExp_Replace starts at the 2nd character (keep the 1st character as-is) and processes groups of (up to) 4 characters adding either a ( (characters #1,#2,#4, to be replaced by X unless it's a space) or [ (character #3, no replacement), which results in :
t(h(i[s( (i(s[ (a( (t[e(s(t( [o(f( (t[h(i(s( [f(u(n(c[t(i(o(n[a(l(i(t[y(
Of course this assumes that both characters don't exists in your input data, otherwise you have to choose different ones.
The 2nd RegExp_Replace replaces the ( and the following character with X unless it's a space, which results in:
tXX[s( XX[ X( X[eXX( [oX( X[hXX( [fXXX[tXXX[aXXX[y(
Now there are some (& [ left which are removed by the 3rd RegExp_Replace.
As I still consider me as a beginner in Regular Expressions, there will be better solutions :-)
Edit:
In older Teradata versions not all parameters were optional, then you might have to add values for those:
RegExp_Replace(
RegExp_Replace(
RegExp_Replace(inputString, '(.)(.)?(.)?(.)?', '(\1(\2[\3(\4', 2, 0 'c')
,'(\([^ ])', 'X', 1, 0 'c')
,'(\(|\[)', '', 1, 0 'c')

invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd

I have been importing some data from MySQL to Postgres, the plan should have been simple- manually re-create the tables with their equivalent data types, divise a way to output as CSV, transfer over the data, copy it into Postgres. Done.
mysql -u whatever -p whatever -d the_database
SELECT * INTO OUTFILE '/tmp/the_table.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' FROM the_table;
send and import to postgres
psql -etcetc -d other_database
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' );
It had been too long, I had forgotten that '0000-00-00' was a thing...
so first of all I had to come up with some way of addressing weird data types, preferably at the MySQL end and so wrote this script for the 20 or so tables I planned to import to address any imcompatabilities and list out the columns accordingly
with a as (
select
'the_table'::text as tblname,
'public'::text as schname
), b as (
select array_to_string( array_agg( x.column_name ), ',' ) as the_cols from (
select
case
when udt_name = 'timestamp'
then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
when udt_name = 'date'
then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
else column_name::text
end as column_name
from information_schema.columns, a
where table_schema = a.schname
and table_name = a.tblname
order by ordinal_position
) x
)
select 'SELECT '|| b.the_cols ||' INTO OUTFILE ''/tmp/'|| a.tblname ||'.csv'' FIELDS TERMINATED BY '','' OPTIONALLY ENCLOSED BY ''"'' ESCAPED BY ''\\'' FROM '|| a.tblname ||';' from a,b;
Generate CSV, ok. Transfer across, ok - Once over...
BEGIN;
ALTER TABLE the_table SET( autovacuum_enabled = false, toast.autovacuum_enabled = false );
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' ); -- '
ALTER TABLE the_table SET( autovacuum_enabled = true, toast.autovacuum_enabled = true );
COMMIT;
and it was all going well, until I came across this message:
ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd
CONTEXT: COPY new_table, line 12345678
a second table also encountered the same error however every other one imported successfully.
Now all columns and tables in the MySQL db were set to utf8, the first offending table containing messages was along the lines of
CREATE TABLE whatever(
col1 int(11) NOT NULL AUTO_INCREMENT,
col2 date,
col3 int(11),
col4 int(11),
col5 int(11),
col6 int(11),
col7 varchar(64),
PRIMARY KEY(col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
So presumably the data should be utf... right? to make sure there were no major errors I edited the my.cnf to ensure everything I could think of to include the encoding
[character sets]
default-character-set=utf8
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
I altered my initial "query generating query" case statement to convert columns for the sake of converting
case
when udt_name = 'timestamp'
then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
when udt_name = 'date'
then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
when udt_name = 'text'
then 'CONVERT('|| column_name::text || ' USING utf8)'
else column_name::text
end as column_name
and still no luck. After googling "0xed 0xa0 0xbd" I am still none the wiser, character sets are not really my thing.
I even opened the 3 gig csv file to the line it mentioned and there didn't appear to be anything out of place, looking with a hex editor I could not see those byte values (edit: maybe I didn't look hard enough) so I am starting to run out of ideas. Am I missing something really simple, and worryingly, is it possible that some of the other tables may have been more "silently" corrupted too?
The MySQL version is 5.5.44 on a ubuntu 14.04 operating system and the Postgres is 9.4
Without any further things to try I went for the simplest solution, just alter the files
iconv -f utf-8 -t utf-8 -c the_file.csv > the_file_iconv.csv
there were about 100 bytes between the new files and the originals, so there must've been invalid bytes in there somewhere that I could not see, they imported "properly" so I suppose that is good, however it would be nice to know if there were some way to enforce proper encoding when creating the files before discovering about it on import.

Removing white spaces in sql field

I have a table with three fields (ID , machine_name, carpark_id) two of the machine names has (311A__) _ =spaces and (311B__) how to select and insert rows into machine name which have spaces in it.
sql_local = """SELECT id FROM customer_1.pay_machines WHERE machine_name="%s" """ % machine
sql_local = """INSERT INTO customer_1.pay_machines (machine_name, carpark_id) VALUES ("%s", 0)""" % machine
sql_local = """SELECT id FROM customer_1.pay_machines WHERE machine_name="%s" """ % machine
sql_local = """INSERT INTO customer_1.pay_and_display (plate, machine_id, ticket_datetime, expiry_datetime, ticket_name, ticket_price) VALUES ("%s", "%s", "%s", "%s", "%s", "%s") """ % (plate, machineId, entryDatetime, expiryDatetime, ticketName, ticketPrice)
I know what problem you are having!
MySQL always auto-trims your string, so inserting 'a ' will actually be just 'a'.
https://dev.mysql.com/doc/refman/5.0/en/char.html
"For VARCHAR columns, trailing spaces in excess of the column length are truncated prior to insertion and a warning is generated, regardless of the SQL mode in use. For CHAR columns, truncation of excess trailing spaces from inserted values is performed silently regardless of the SQL mode."
You can try using a blob, which will not ignore whitespaces
If you want to continue to use CHAR or VARCHAR fields, you can use LIKE 'String ' to include whitespaces, WHERE col = 'String ' will not work
Filtering machine with space: where machine like '% %';
Insert: What is the problem? Just put the name with the space between the single quotes.
I notice you are using double quotes instead of single quote, the usual string delimiter in MySQL.

SQL: search/replace but only the first time a value appears in record

I have html content in the post_content column.
I want to search and replace A with B but only the first time A appears in the record as it may appear more than once.
The below query would obviously replace all instances of A with B
UPDATE wp_posts SET post_content = REPLACE (post_content, 'A', 'B');
This should actually be what you want in MySQL:
UPDATE wp_post
SET post_content = CONCAT(REPLACE(LEFT(post_content, INSTR(post_content, 'A')), 'A', 'B'), SUBSTRING(post_content, INSTR(post_content, 'A') + 1));
It's slightly more complicated than my earlier answer - You need to find the first instance of the 'A' (using the INSTR function), then use LEFT in combination with REPLACE to replace just that instance, than use SUBSTRING and INSTR to find that same 'A' you're replacing and CONCAT it with the previous string.
See my test below:
SET #string = 'this is A string with A replace and An Answer';
SELECT #string as actual_string
, CONCAT(REPLACE(LEFT(#string, INSTR(#string, 'A')), 'A', 'B'), SUBSTRING(#string, INSTR(#string, 'A') + 1)) as new_string;
Produces:
actual_string new_string
--------------------------------------------- ---------------------------------------------
this is A string with A replace and An Answer this is B string with A replace and An Answer
Alternatively, you could use the functions LOCATE(), INSERT() and CHAR_LENGTH() like this:
INSERT(originalvalue, LOCATE('A', originalvalue), CHAR_LENGTH('A'), 'B')
Full query:
UPDATE wp_posts
SET post_content = INSERT(originalvalue, LOCATE('A', originalvalue), CHAR_LENGTH('A'), 'B');
With reference to https://dba.stackexchange.com/a/43919/200937 here is another solution:
UPDATE wp_posts
SET post_content = CONCAT( LEFT(post_content , INSTR(post_content , 'A') -1),
'B',
SUBSTRING(post_content, INSTR(post_content , 'A') +1))
WHERE INSTR(post_content , 'A') > 0;
If you have another string, e.g. testing then you need to change the +1 above to the according string length. We can use LENGTH() for this purpose. By the way, leave the -1 untouched.
Example: Replace "testing" with "whatever":
UPDATE wp_posts
SET post_content = CONCAT( LEFT(post_content , INSTR(post_content , 'testing') -1),
'whatever',
SUBSTRING(post_content, INSTR(post_content , 'testing') + LENGTH("testing"))
WHERE INSTR(post_content , 'testing') > 0;
By the way, helpful to see how many rows will be effected:
SELECT COUNT(*)
FROM post_content
WHERE INSTR(post_content, 'A') > 0;
If you are using an Oracle DB, you should be able to write something like :
UPDATE wp_posts SET post_content = regexp_replace(post_content,'A','B',1,1)
See here for more informations : http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions130.htm
Note : you really should take care of post_content regarding security issue since it seems to be an user input.
Greg Reda's solution did not work for me on strings longer than 1 character because of how the REPLACE() was written (only replacing the first character of the string to be replaced). Here is a solution that I believe is more complete and covers every use case of the problem when defined as How do I replace the first occurrence of "String A" with "String B" in "String C"?
CONCAT(LEFT(buycraft, INSTR(buycraft, 'blah') - 1), '', SUBSTRING(buycraft FROM INSTR(buycraft, 'blah') + CHAR_LENGTH('blah')))
This assumes that you are sure that the entry ALREADY CONTAINS THE STRING TO BE REPLACED! If you try replacing 'dog' with 'cat' in the string 'pupper', it will give you 'per', which is not what you want. Here is a query that handles that by first checking to see if the string to be replaced exists in the full string:
IF(INSTR(buycraft, 'blah') <> 0, CONCAT(LEFT(buycraft, INSTR(buycraft, 'blah') - 1), '', SUBSTRING(buycraft FROM INSTR(buycraft, 'blah') + CHAR_LENGTH('blah'))), buycraft)
The specific use case here is replacing the first instance of 'blah' inside column 'buycraft' with an empty string ''. I think a pretty intuitive and natural solution:
Find the index of the first occurrence of the string that is to be replaced.
Get everything to the left of that, not including the index itself (thus '-1').
Concatenate that with whatever you are replacing the original string with.
Calculate the ending index of the part of the string that is being replaced. This is easily done by finding the index of the first occurrence again, and adding the length of the replaced string. This will give you the index of the first char after the original string
Concatenate the substring starting at the ending index of the string
An example walkthrough of replacing "pupper" in "lil_puppers_yay" with 'dog':
Index of 'pupper' is 5.
Get left of 5-1 = 4. So indexes 1-4, which is 'lil_'
Concatenate 'dog' for 'lil_dog'
Calculate the ending index. Start index is 5, and 5 + length of 'pupper' = 11. Note that index 11 refers to 's'.
Concatenate the substring starting at the ending index, which is 's_yay', to get 'lil_dogs_yay'.
All done!
Note: SQL has 1-indexed strings (as an SQL beginner, I didn't know this before I figured this problem out). Also, SQL LEFT and SUBSTRING seem to work with invalid indexes the ideal way (adjusting it to either the beginning or end of the string), which is super convenient for a beginner SQLer like me :P
Another Note: I'm a total beginner at SQL and this is pretty much the hardest query I've ever written, so there may be some inefficiencies. It gets the job done accurately though.
I made the following little function and got it:
CREATE DEFINER=`virtueyes_adm1`#`%` FUNCTION `replace_first`(
`p_text` TEXT,
`p_old_text` TEXT,
`p_new_text` TEXT
)
RETURNS text CHARSET latin1
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
COMMENT 'troca a primeira ocorrencia apenas no texto'
BEGIN
SET #str = p_text;
SET #STR2 = p_old_text;
SET #STR3 = p_new_text;
SET #retorno = '';
SELECT CONCAT(SUBSTRING(#STR, 1 , (INSTR(#STR, #STR2)-1 ))
,#str3
,SUBSTRING(#STR, (INSTR(#str, #str2)-1 )+LENGTH(#str2)+1 , LENGTH(#STR)))
INTO #retorno;
RETURN #retorno;
END
Years have passed since this question was asked, and MySQL 8 has introduced REGEX_REPLACE:
REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])
Replaces occurrences in the string expr that match the regular
expression specified by the pattern pat with the replacement string
repl, and returns the resulting string. If expr, pat, or repl is NULL,
the return value is NULL.
REGEXP_REPLACE() takes these optional arguments:
pos: The position in expr at which to start the search. If omitted, the default is 1.
occurrence: Which occurrence of a match to replace. If omitted, the default is 0 (which means “replace all occurrences”).
match_type: A string that specifies how to perform matching. The meaning is as described for REGEXP_LIKE().
So, assuming you can use regular expressions in your case:
UPDATE wp_posts SET post_content = REGEXP_REPLACE (post_content, 'A', 'B', 1, 1);
Unfortunately for those of us on MariaDB, its REGEXP_REPLACE flavor is missing the occurrence parameter. Here's a regex-aware version of Andriy M's solution, conveniently stored as a reusable function as suggested by Luciano Seibel:
DELIMITER //
DROP FUNCTION IF EXISTS replace_first //
CREATE FUNCTION `replace_first`(
`i` TEXT,
`s` TEXT,
`r` TEXT
)
RETURNS text CHARSET utf8mb4
BEGIN
SELECT REGEXP_INSTR(i, s) INTO #pos;
IF #pos = 0 THEN RETURN i; END IF;
RETURN INSERT(i, #pos, CHAR_LENGTH(REGEXP_SUBSTR(i, s)), r);
END;
//
DELIMITER ;
It's simpler
UPDATE table_name SET column_name = CONCAT('A',SUBSTRING(column_name, INSTR(column_name, 'B') + LENGTH('A')));
For MYSQL version pre-5.6 and 8.0, I've used this pattern to fix my issue, it's a bit gross, but I hope it helps some of you guys:
SET #string = 'I love shop it is a terrific shop, I love eveything about it';
SET #shop_code = 'shop';
SET #shop_date = CONCAT(#shop_code, '__', DATE_FORMAT(NOW(), '%Y_%m_%d__%Hh%im%ss'));
SET #part1 = SUBSTRING_INDEX(#string, #shop_code, 1);
SET #shop_nb = ROUND( (LENGTH(#string) - LENGTH(REPLACE(#string, #shop_code,''))) / LENGTH(#shop_code) );
SET #part2 = SUBSTRING_INDEX(#string, #shop_code, -#shop_nb);
SET #string = CONCAT(#part1, #shop_date, #part2);
SELECT #string;
To keep the sample of gjreda a bit more simple use this:
UPDATE wp_post
SET post_content =
CONCAT(
REPLACE(LEFT(post_content, 1), 'A', 'B'),
SUBSTRING(post_content, 2)
)
WHERE post_content LIKE 'A%';