I have a column with both English and Chinese text.
Example: The hills have eyes. 隔山有眼
Expected results: The hills have eyes.
How can I extract the English text from that string using sql, please.
Thanks for help.
A quick-and-dirty way simply converts the string to ASCII and removes the '?' -- which is the representation of the other characters:
select replace(convert(t.str using ascii), '?', '')
from t;
The only downside is that you lose '?' characters in the original string as well.
Here is a db<>fiddle.
For more control over the replacement, you can use regexp_replace():
select regexp_replace(t.str, '[^a-zA-Z0-9.?, ]', '')
from t;
Unfortunately, I am not aware of a character class for ASCII-only characters.
One option you have is to use a function that returns just the english only text.
Additionally, you could make it dual-purpose with another parameter to determine if you want the English text or Non-English text to switch the <127 comparison.
CREATE FUNCTION `EnglishOnly`(String VARCHAR(100))
RETURNS varchar(100)
NO SQL
BEGIN
DECLARE output VARCHAR(100) DEFAULT '';
DECLARE i INTEGER DEFAULT 1;
DECLARE ch varchar(1);
IF LENGTH(string) > 0 THEN
WHILE(i <= LENGTH(string)) DO
SET ch=SUBSTRING(string, i, 1);
IF ASCII(ch)<127 then
set output = CONCAT(output,ch);
END IF;
SET i = i + 1;
END WHILE;
END IF;
RETURN output;
END;
You can then sinply use it like so
select EnglishOnly ("The hills have eyes 隔山有眼that see all.")
Output
The hills have eyes that see all.
Example Fiddle
Related
My Webapp (PHP/jQuery/MySQL) has features which enable me to send out nicely formatted html email notifications to my customers based on certain events. The code works nicely and merges data from my Database into form fields although I need to enhance it to be able to provide enriched/localised/reformatted data in some circumstances.
For example:
- Provide date/time values in a user's own timezone
- Provide monetary values formatted to a user's locale
This requires me to do another pass of the email content to detect whether any fields remain unmerged before sending the email off to the user and if so, to format those field values appropriately before sending the email. Therefore what I want to do is extract a list of all delimited fieldnames from a table field value and return that list in comma delimited form.
I can already count how many times a delimeter appears
I can also find the position of the first delimeter
It looks like it would be easy to split the values if I was using the same opening and closing delimeters but because I have many email templates already in use, this isn't currently viable
I don't have any code for this yet. I'm just trying to avoid writing my own MySQL function to do this, by using existing MySQL functions if they are capable of doing this.
I've tried using various combinations of SUBSTRING, SUBSTRING_INDEX, LOCATE.
So what I need to be able to do is something like this:
SELECT msg_id, values_found_between(msg_content,"<",">") AS comma_delimited_list;
So for example, with source data of...
msg_id | msg_content
-------+------------
1 | The quick brown <fox> jumps over the lazy <dog>
2 | The quick brown fox jumps over the lazy dog
I can get a resulting recordset such as this:
msg_id | comma_seperated_list
-------+------------
1 | fox,dog
Alright, I had a crack and this seems to work well:
CREATE FUNCTION db.`FN_find_values_between`(`in_haystack` VARCHAR(10000), `in_opening_delimiter` VARCHAR(1),`in_closing_delimiter` VARCHAR(1)) RETURNS varchar(1000) CHARSET utf8
BEGIN
DECLARE numFoundOpen INT DEFAULT 0;
DECLARE numFoundClose INT DEFAULT 0;
DECLARE numFoundTarget INT DEFAULT 0;
DECLARE numCurrentIndex INT DEFAULT 0;
DECLARE strOutput VARCHAR(1000) DEFAULT "";
DECLARE numSearchFromPos INT DEFAULT 1;
DECLARE numCurrentCharPosStart INT DEFAULT 1;
DECLARE numCurrentCharPosEnd INT DEFAULT 1;
DECLARE strCurrentFieldname VARCHAR(50) DEFAULT "";
DECLARE numLength INT DEFAULT 0;
SET numFoundOpen=
(SELECT
ROUND ((LENGTH(in_haystack)- LENGTH( REPLACE (in_haystack, in_opening_delimiter, ""))) / LENGTH(in_opening_delimiter)));
SET numFoundClose=
(SELECT
ROUND ((LENGTH(in_haystack)- LENGTH( REPLACE (in_haystack, in_closing_delimiter, ""))) / LENGTH(in_closing_delimiter)));
IF (numFoundOpen=numFoundClose) THEN
SET numFoundTarget=numFoundOpen;
END IF;
WHILE numCurrentIndex < numFoundTarget DO
SET numCurrentIndex=numCurrentIndex+1;
SET numCurrentCharPosStart = LOCATE(in_opening_delimiter, in_haystack, numSearchFromPos);
SET numCurrentCharPosEnd = LOCATE(in_closing_delimiter, in_haystack, numSearchFromPos);
SET numLength=1+(numCurrentCharPosEnd-numCurrentCharPosStart);
SET strCurrentFieldname=SUBSTRING(in_haystack,numCurrentCharPosStart,numLength);
SET strOutput=CONCAT(strOutput,strCurrentFieldname,",");
SET strCurrentFieldname="";
SET numSearchFromPos=numCurrentCharPosEnd+1;
END WHILE;
IF (strOutput <> "") THEN
SET strOutput=LEFT(strOutput,LENGTH(strOutput)-1);
END IF;
RETURN strOutput;
END;
As per the code above, I managed to write my own MySQL function to do this.
The intended result is to store the notes of edits to a field, in another field.
I want the new notes to APPEND to the storage field, and since the is not function that does this I am attmpting to find a way to work this out without adding more layers of code like functions and stored procedures.
/* Before Update Trigger */
DECLARE v_description VARCHAR(255);
DECLARE v_permnotes MEDIUMTEXT;
DECLARE v_oldnote VARCHAR(500);
DECLARE v_now VARCHAR(25);
SET v_now = TRIM(DATE_FORMAT(NOW(), '%Y-%m-%d %k:%i:%s'));
SET v_oldnote = OLD.notes;
IF (NEW.permanent_notes IS NULL) THEN
SET v_permnotes = '';
ELSE
SET v_permnotes = OLD.permanent_notes;
END IF;
SET NEW.permanent_notes = CONCAT_WS(CHAR(10), v_permnotes, v_now,": ", v_description);
I'm aiming to have the results in the permanent field look like this
<datetime value>: Some annotation from the notes field.
<a different datetime>: A new annotation
etc....
What I get from my current trigger:
2018-12-30 17:15:50
:
Test 17: Start from scratch.
2018-12-30 17:35:51
:
Test 18: Used DATE_FORMAT to sxet the time
2018-12-30 17:45:52
:
Test 19. Still doing a carriage return after date and after ':'
I can't figure out why there is a newline after the date, and then again after the ':'.
If I leave out CHAR(10), I get:
Test 17: Start from scratch.
2018-12-30 17:35:51
:
Test 18: Used DATE_FORMAT to sxet the time
2018-12-30 17:45:52
:
Test 19. Still doing a carriage return after date and after ':'Test 20. Still doing a carriage return after date and after ':'
Some fresh/more experienced eyes would be really helpful in debugging this.
Thanks.
I think you should just be using plain CONCAT here:
DECLARE separator VARCHAR(1);
IF (NEW.permanent_notes IS NULL) THEN
SET separator = '';
ELSE
SET separator = CHAR(10)
END IF;
-- the rest of your code as is
SET
NEW.permanent_notes = CONCAT(v_permnotes, separator, v_now, ": ", v_description);
The logic here is that we conditionally print a newline (CHAR(10)) before each new log line, so long as that line is not the very first. You don't really want CONCAT_WS here, which is mainly for adding a separator in between multiple terms. You only want a single newline in between each logging statement.
I'm getting json file, which I load to Azure SQL databese. This json is direct output from API, so there is nothing I can do with it before loading to DB.
In that file, all Polish diactircs are escaped to "C/C++/Java source code" (based on: http://www.fileformat.info/info/unicode/char/0142/index.htm
So for example:
ł is \u0142
I was trying to find some method to convert (unescape) those to proper Polish letters.
In worse case scenario, I can write function which will replace all combinations
Repalce(Replace(Replace(string,'\u0142',N'ł'),'\u0144',N'ń')))
And so on, making one big, terrible function...
I was looking for some ready functions like there is for URLdecode, which was answered here on stack in many topics, and here: https://www.codeproject.com/Articles/1005508/URL-Decode-in-T-SQL
Using this solution would be possible but I cannot figure out cast/convert with proper collation and types in there, to get result I'm looking for.
So if anyone knows/has function that would make conversion in string for unescaping that \u this would be great, but I will manage to write something on my own if I would get right conversion. For example I tried:
select convert(nvarchar(1), convert(varbinary, 0x0142, 1))
I made assumption that changing \u to 0x will be the answer but it gives some Chinese characters. So this is wrong direction...
Edit:
After googling more I found exactly same question here on stack from #Pasetchnik: Json escape unicode in SQL Server
And it looks this would be the best solution that there is in MS SQL.
Onlty thing I needed to change was using NVARCHAR instead of VARCHAR that is in linked solution:
CREATE FUNCTION dbo.Json_Unicode_Decode(#escapedString nVARCHAR(MAX))
RETURNS nVARCHAR(MAX)
AS
BEGIN
DECLARE #pos INT = 0,
#char nvarCHAR,
#escapeLen TINYINT = 2,
#hexDigits TINYINT = 4
SET #pos = CHARINDEX('\u', #escapedString, #pos)
WHILE #pos > 0
BEGIN
SET #char = NCHAR(CONVERT(varbinary(8), '0x' + SUBSTRING(#escapedString, #pos + #escapeLen, #hexDigits), 1))
SET #escapedString = STUFF(#escapedString, #pos, #escapeLen + #hexDigits, #char)
SET #pos = CHARINDEX('\u', #escapedString, #pos)
END
RETURN #escapedString
END
Instead of nested REPLACE you could use:
DECLARE #string NVARCHAR(MAX)= N'\u0142 \u0144\u0142';
SELECT #string = REPLACE(#string,u, ch)
FROM (VALUES ('\u0142',N'ł'),('\u0144', N'ń')) s(u, ch);
SELECT #string;
DBFiddle Demo
In my MySQL database I have a column of strings in UTF-8 format for which I want to extract the first character using a RegEx, for example.
Assuming a RegEx which ONLY extracts the following characters:
ਹਮਜਰਣਚਕਨਖਲਨ
And given the following string:
ਹੁਕਮਿ ਰਜਾਈ ਚਲਣਾ ਨਾਨਕ ਲਿਖਿਆ ਨਾਲਿ ॥੧॥
The only characters extracted would be:
ਹਰਚਨਲਨ
I know the following steps would be required to solve this problem:
Break the string into individual words (substrings) by using space as the delimiter
For each word extract the first letter (substring of a substring) if it matches what is in the regex of valid characters
I have looked at all the similar questions/answers on SO and none have been able to solve my problem thus far.
I realy don't know MySql Regex Syntax and restrictions(never used), but you can add leading space before string, and match with something simple like this: " ([ਮਜਰਣਚਕਨਖਲਨ]{1})"
So, if you concatenate matched groups you will have this string "ਰਚਨਲਨ"(only "ਹ" not matched, because it's not exists in sample")
in C# it may look like this(working sample):
namespace TestRegex
{
using System.Linq;
using System.Text.RegularExpressions;
using System.Windows.Forms;
class Program
{
static void Main(string[] args)
{
// leading space(to match first word too)
// + sample string
var sample = " ";
sample += "ਹੁਕਮਿ ਰਜਾਈ ਚਲਣਾ ਨਾਨਕ ਲਿਖਿਆ ਨਾਲਿ ॥੧॥";
// Regex pattern that will math space, and
// if next character in set - add it to "match group 1"
var pattern = " ([ਮਜਰਣਚਕਨਖਲਨ]{1})";
// select every "match group 1" from matches as array
var result = from Match m in Regex.Matches(sample, pattern)
select m.Groups[1];
// concatenate array content into one string and
// show it in message box to user, for example..
MessageBox.Show(string.Concat(result));
}
}
}
in most non-query languages it will be look almost same. For example in php you need to do preg_match_all, and in foreach loop add "$match[i][1]"(every "match group 1") from every match to end of one single string.
well.. pretty simple. but not for mysql...
I finally achieved this with the help of a programmer friend of mine. I directly pasted the following piece of code into the SQL section of my database in PhpMyAdmin:
delimiter $$
drop function if exists `initials`$$
CREATE FUNCTION `initials`(str text, expr text) RETURNS text CHARSET utf8
begin
declare result text default '';
declare buffer text default '';
declare i int default 1;
if(str is null) then
return null;
end if;
set buffer = trim(str);
while i <= length(buffer) do
if substr(buffer, i, 1) regexp expr then
set result = concat( result, substr( buffer, i, 1 ));
set i = i + 1;
while i <= length( buffer ) and substr(buffer, i, 1) regexp expr do
set i = i + 1;
end while;
while i <= length( buffer ) and substr(buffer, i, 1) not regexp expr do
set i = i + 1;
end while;
else
set i = i + 1;
end if;
end while;
return result;
end$$
drop function if exists `acronym`$$
CREATE FUNCTION `acronym`(str text) RETURNS text CHARSET utf8
begin
declare result text default '';
set result = initials( str, '[ੴਓੳਅੲਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼]' );
return result;
end$$
delimiter ;
UPDATE scriptures SET search = acronym(scripture)
Just to explain the last line:
scriptures is the table I want to update
search is a new empty column I created inside the table to store the result
scripture is an existing column inside the scriptures table with all the strings I want to extract from
acronym is the function previously declared which is looking to match the first letter of each word with a character from the RegEx [ੴਓੳਅੲਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼]
So this final line of the code will go through each row of the column scripture, apply the function acronym to it and store the result in the new search column.
Perfect! Exactly what I was looking for :)
Is there a way of enabling a long strings to be put onto multiple lines so that when viewed on screen or printed the code is easier to read?
Perhaps I could be clearer.
Have a stored procedure with lines like
IF ((select post_code REGEXP '^([A-PR-UWYZ][A-HK-Y]{0,1}[0-9]{1,2} [0-9][ABD-HJLNP-UW-Z]{2})|([A-PR-UWYZ][0-9][A-HJKMPR-Y] [0-9][ABD-HJLNP-UW-Z]{2})|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]) [0-9][ABD-HJLNP-UW-Z]{2})$') = 0)
Would like to be able to modify the string so that I can view it within 80 character width. Anybody got any ideas of how to do this.
PS: It is the regular expression for UK postcodes
For example,
-- a very long string in one block
set my_str = 'aaaabbbbcccc';
can be also written as
-- a very long string, as a concatenation of smaller parts
set my_str = 'aaaa' 'bbbb' 'cccc';
or even better
-- a very long string in a readable format
set my_str = 'aaaa'
'bbbb'
'cccc';
Note how the spaces and end of line between the a/b/c parts are not part of the string itself, because of the placement of quotes.
Also note that the string data here is concatenated by the parser, not at query execution time.
Writing something like:
-- this is broken
set my_str = 'aaaa
bbbb
cccc';
produces a different result.
See also
http://dev.mysql.com/doc/refman/5.6/en/string-literals.html
Look for "Quoted strings placed next to each other are concatenated to a single string"
You could split it up into the front and back components of the postcode and then dump the whole lot into a UDF.
This will keep the ugliness in one place and means you'll only have to make changes to one block of code when/if Royal Mail decide to change the format of UK postcodes ;-)
Something like this should do the trick:
DELIMITER $$
CREATE FUNCTION `isValidUKPostcode`(candidate varchar(255)) RETURNS BOOLEAN READS SQL DATA
BEGIN
declare back varchar(3);
declare front varchar(10);
declare v_out boolean;
set back = substr(candidate,-3);
set front = substr(candidate,1,length(candidate)-3);
set v_out = false;
IF (back REGEXP '^[0-9][ABD-HJLNP-UW-Z]{2}$'= 1) THEN
CASE
WHEN front REGEXP '^[A-PR-UWYZ][A-HK-Y]{0,1}[0-9]{1,2} $' = 1 THEN set v_out = true;
WHEN front REGEXP '^[A-PR-UWYZ][0-9][A-HJKMPR-Y] $' = 1 THEN set v_out = true;
WHEN front REGEXP '^[A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y] $' = 1 THEN set v_out = true;
END CASE;
END IF;
return v_out;
END