How to crop text between braces - mysql
I have data in MySQL in one string field with below structure:
{language *lang_code*}text{language}{language *lang_code*}text{language}
And here is example:
{language en}text in english{language}{language de}text in german{language}
The ideal output for this would be in this case
text in english
So we want to disregard the other languages, just want to extract the first one, and put it into new column, because it's often the title of the product, with translations, and for us the first one is the most important.
The values in first braces may be different, so for example here the first one is english, but in other example it might be in german, so the lang code might also be dynamic.
I am wondering if it's possible to extract the text value between two first braces through SQL query?
This is really horrible but it works for your simple example -
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language en\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
or
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language de\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
to retrieve the german text.
To retrieve the first text in the string regardless of language you can use -
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language [a-z]{2}\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
Note this version assumes the language code is always 2 x a-z chars - [a-z]{2}
Here is an example of the above wrapped in a stored function -
DELIMITER $$
CREATE FUNCTION `ExtractLangString`(content TEXT, lang CHAR(8))
RETURNS text
DETERMINISTIC
BEGIN
-- if lang is not 2 chars in length or lang not found return first language string
IF LENGTH(lang) <> 2 OR content NOT LIKE CONCAT('%{language ', lang, '}%') THEN
SET lang = '[a-z]{2}';
END IF;
RETURN SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR(content, CONCAT('\\{language ', lang, '\\}(.*?)\\{language\\}')), '}', -2), '{', 1);
END$$
DELIMITER ;
There is probably a cleaner way of doing it but I cannot think of it right now.
Obviously, the better solution would be to normalise the data that is currently serialised into this column.
Related
Extracting a substring after finding a different substring
I've been playing around with Substring, left, right, charindex and can't quite get this to work If this is the value in column name 'Data' (this is all one line) {"email":{"RecipientId":"usertest","RecipientEmail":"test#test.com","Subject":"This is a test subject heading","RecipientSubject":"A recipient subject"}} How do I do a SELECT statement to find the 'Subject' heading and then get the data 'This is a test subject'? The Subject value is different for every record so I just can't look for 'This is a test subject'. So the end result should be This is a test subject for that SELECT result
The following query should do what you want: declare #string varchar(max); set #string = '{"email":{"RecipientId":"usertest","RecipientEmail":"test#test.com","Subject":"This is a test subject heading","RecipientSubject":"A recipient subject"}}'; select substring(#string,charindex('"Subject":',#string)+11,charindex('"RecipientSubject"',#string)-charindex('"Subject"',#string)-13);
The plain and easy-cheesy approach is this: SELECT SUBSTRING( t.YourString ,A.StartPosition ,CHARINDEX('"' ,t.YourString ,A.StartPosition+1) - A.StartPosition ) FROM #dummyTable t CROSS APPLY(SELECT CHARINDEX('"Subject":"',t.YourString)+11) A(StartPosition) I use APPLY to calculate a value and use it like you'd use a variable. The idea is: Find the starting point and look for the closing quote from there. But this will break, whenever the content includes an (escaped) quote like in "Subject":"This is \"quoted\" internally" A more generic approach Starting with v2016 JSON-support was introduced. With this (or a higher) version this is really simple: Use this mockup-table for testing DECLARE #dummyTable TABLE (YourString VARCHAR(1000)); INSERT INTO #dummyTable VALUES('{"email":{"RecipientId":"usertest","RecipientEmail":"test#test.com","Subject":"This is a test subject heading","RecipientSubject":"A recipient subject"}}'); --The OPENJSON-method will read this for you: SELECT JsonContent.* FROM #dummyTable t CROSS APPLY OPENJSON(t.YourString,'$.email') WITH(RecipientId VARCHAR(100) ,RecipientEmail VARCHAR(100) ,[Subject] VARCHAR(100) ,RecipientSubject VARCHAR(100)) JsonContent; But with a lower version you will need to trick this out. It is the easiest, to tranform JSON to attribute centered XML like here: <email RecipientId="usertest" RecipientEmail="test#test.com" Subject="This is a test subject heading" RecipientSubject="A recipient subject" /> We can achieve this by some string methods and I must warn you, that there are several pit-falls with forbidden characters and other stuff... Just try it out: SELECT Casted.ToXml.value('(/email/#RecipientId)[1]','varchar(1000)') AS RecipientId ,Casted.ToXml.value('(/email/#RecipientEmail)[1]','varchar(1000)') AS RecipientEmail ,Casted.ToXml.value('(/email/#Subject)[1]','varchar(1000)') AS [Subject] ,Casted.ToXml.value('(/email/#RecipientSubject)[1]','varchar(1000)') AS RecipientSubject ,Casted.ToXml.query('.') LookHowThisWasTransformed FROM #dummyTable t CROSS APPLY ( SELECT CAST(CONCAT('<email ' ,REPLACE(REPLACE(REPLACE(REPLACE(t.YourString,'{"email":{"',''),'}}',''),'","','" '),'":"',' ="') ,' />') AS XML) ) Casted(ToXml);
Using the trim function to narrow down results set
I need to gather data from two columns, concat them so that it's only the first six of the first column and the last six of the second column, separated by ' + '. Some have been input with weird spaces in front or in back, so we must also use the trim feature and get rid of all NULL. I haven't had any issues with the first part, but am struggling to use the trim feature in a way that gives the desired output. Output needs to look like this: Input Data sample: The following code returns results, but the output doesn't match so I know the trim is wrong: SELECT CONCAT(SUBSTRING(baseball, 1, 6), ' + ', SUBSTRING(football, -6)) AS MYSTRING FROM datenumtest2 WHERE baseball IS NOT NULL AND football IS NOT NULL; I also tried the following, but get an error message about the parameters being incorrect: SELECT CONCAT(SUBSTRING(LTRIM(baseball, 1, 6)), ' + ', SUBSTRING(RTRIM(football, -6))) AS MYSTRING FROM datenumtest2 WHERE baseball IS NOT NULL AND football IS NOT NULL; I'm still new to this site and learning, but I have tried to include as much as I can! If there is other information that I can add to help, please let me know.
You just need to use Trim() on the column(s), before using Substring() function on them: SELECT CONCAT(SUBSTRING(TRIM(baseball), 1, 6), ' + ', SUBSTRING(TRIM(football), -6)) AS MYSTRING FROM datenumtest2 WHERE baseball IS NOT NULL AND football IS NOT NULL;
Strip special characters and space of a DB column to compare in rails
I have 4 types of last_name: "Camp Bell" "CAMPBELL" "CampBellJr." "camp bell jr." Now, in rails when an user is searched by it's last name like camp bell, I want to show all the 4 records. So, I tried: RAILS stripped_name = params[last_name].gsub(/\W/, '') #=> "campbell" User.where("LOWER(REPLACE(last_name, '/\W/', '')) LIKE ?", "#{stripped_name}%") Give me only 2 records with following last_name: "CAMPBELL" "CampBellJr." I guess, this is because, the mysql REPLACE is not working correctly with regex. Any ideas? EDIT Guys, sorry for the confusion. My idea is to strip off all special characters including space. So I'm trying to use \W regex. For example, the input can be: camp~bell... But, it should still fetch result.
You can check for both stripped_name without space and ones that include both names seperated with space like this. stripped_name = params[last_name].gsub(/\W/, '') split_names = params[last_name].split(" ") User.where('name LIKE ? OR (name LIKE ? AND name LIKE ?)', "%#{stripped_name}%", "%#{split_names[0]}%", "%#{split_names[1]}%") Next step would to search for complete array of split names not just first two.
Here my solution: User.where("REPLACE(last_name, ' ', '') ILIKE CONCAT ('%', REPLACE('?', ' ', ''),'%')", stripped_name) ILIKE is like LIKE but the I is for insensitive case. To understand easily step by step: lastname ILIKE '%campbell% you need % because you want lastname contain this string, not necessary at the begin or the end of you string. 'campbell%' => search string who begin by campbell '%campbell' => search string who finish by campbell We need generate '%campbell%, so we use CONCAT for that I just use a simply REPLACE, but maybe you should use a regex.
MySQL query to append key:value to JSON string
My table has a column with a JSON string that has nested objects (so a simple REPLACE function cannot solve this problem) . For example like this: {'name':'bob', 'blob': {'foo':'bar'}, 'age': 12}. What is the easiest query to append a value to the end of the JSON string? So for the example, I want the end result to look like this: {'name':'bob', 'blob': {'foo':'bar'}, 'age': 12, 'gender': 'male'} The solution should be generic enough to work for any JSON values.
What about this UPDATE table SET table_field1 = CONCAT(table_field1,' This will be added.'); EDIT: I personally would have done the manipulation with a language like PHP before inserting it. Much easier. Anyway, Ok is this what you want? This should work providing your json format that is being added is in the format {'key':'value'} UPDATE table SET col = CONCAT_WS(",", SUBSTRING(col, 1, CHAR_LENGTH(col) - 1),SUBSTRING('newjson', 2));
I think you can use REPLACE function to achieve this UPDATE table SET column = REPLACE(column, '{\'name\':\'bob\', \'blob\': {\'foo\':\'bar\'}, \'age\': 12}', '{\'name\':\'bob\', \'blob\': {\'foo\':\'bar\'}, \'age\': 12, \'gender\': \'male\'}') Take care to properly escape all quotes inside json Upon you request of nested json, i think you can just remove last character of the string with SUBSTRING function and then append whatever you need with CONCAT UPDATE table SET column = CONCAT(SUBSTRING(column, 0, -1), 'newjsontoappend')
modify Jack's answer. Works perfectly even column value is empty on first update. update table set column_name = case when column_name is null or column_name ='' then "{'foo':'bar'}" else CONCAT_WS(",", SUBSTRING(column_name, 1, CHAR_LENGTH(column_name) - 1),SUBSTRING("{'foo':'bar'}", 2)) end
PostgreSQL replace HTML entities function
I've found this very interesting function on internet: CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$ SELECT regexp_replace(regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), E'(?x)(< [^>]*? >)', '', 'g') $$ LANGUAGE SQL; But it doesn't remove html codes like: " Is it possible to remove them using regexp_replace?
Yes it is possible to replace HTML or other character entities with the respective characters using a function. First create a character entity table: create table character_entity( name text primary key, ch char(1) unique ); insert into character_entity (ch, name) values (E'\u00C6','AElig'),(E'\u00C1','Aacute'),(E'\u00C2','Acirc'),(E'\u00C0','Agrave'),(E'\u0391','Alpha'),(E'\u00C5','Aring'),(E'\u00C3','Atilde'),(E'\u00C4','Auml'),(E'\u0392','Beta'),(E'\u00C7','Ccedil'), (E'\u03A7','Chi'),(E'\u2021','Dagger'),(E'\u0394','Delta'),(E'\u00D0','ETH'),(E'\u00C9','Eacute'),(E'\u00CA','Ecirc'),(E'\u00C8','Egrave'),(E'\u0395','Epsilon'),(E'\u0397','Eta'),(E'\u00CB','Euml'), (E'\u0393','Gamma'),(E'\u00CD','Iacute'),(E'\u00CE','Icirc'),(E'\u00CC','Igrave'),(E'\u0399','Iota'),(E'\u00CF','Iuml'),(E'\u039A','Kappa'),(E'\u039B','Lambda'),(E'\u039C','Mu'),(E'\u00D1','Ntilde'), (E'\u039D','Nu'),(E'\u0152','OElig'),(E'\u00D3','Oacute'),(E'\u00D4','Ocirc'),(E'\u00D2','Ograve'),(E'\u03A9','Omega'),(E'\u039F','Omicron'),(E'\u00D8','Oslash'),(E'\u00D5','Otilde'),(E'\u00D6','Ouml'), (E'\u03A6','Phi'),(E'\u03A0','Pi'),(E'\u2033','Prime'),(E'\u03A8','Psi'),(E'\u03A1','Rho'),(E'\u0160','Scaron'),(E'\u03A3','Sigma'),(E'\u00DE','THORN'),(E'\u03A4','Tau'),(E'\u0398','Theta'), (E'\u00DA','Uacute'),(E'\u00DB','Ucirc'),(E'\u00D9','Ugrave'),(E'\u03A5','Upsilon'),(E'\u00DC','Uuml'),(E'\u039E','Xi'),(E'\u00DD','Yacute'),(E'\u0178','Yuml'),(E'\u0396','Zeta'),(E'\u00E1','aacute'), (E'\u00E2','acirc'),(E'\u00B4','acute'),(E'\u00E6','aelig'),(E'\u00E0','agrave'),(E'\u2135','alefsym'),(E'\u03B1','alpha'),(E'\u0026','amp'),(E'\u2227','and'),(E'\u2220','ang'),(E'\u00E5','aring'), (E'\u2248','asymp'),(E'\u00E3','atilde'),(E'\u00E4','auml'),(E'\u201E','bdquo'),(E'\u03B2','beta'),(E'\u00A6','brvbar'),(E'\u2022','bull'),(E'\u2229','cap'),(E'\u00E7','ccedil'),(E'\u00B8','cedil'), (E'\u00A2','cent'),(E'\u03C7','chi'),(E'\u02C6','circ'),(E'\u2663','clubs'),(E'\u2245','cong'),(E'\u00A9','copy'),(E'\u21B5','crarr'),(E'\u222A','cup'),(E'\u00A4','curren'),(E'\u21D3','dArr'), (E'\u2020','dagger'),(E'\u2193','darr'),(E'\u00B0','deg'),(E'\u03B4','delta'),(E'\u2666','diams'),(E'\u00F7','divide'),(E'\u00E9','eacute'),(E'\u00EA','ecirc'),(E'\u00E8','egrave'),(E'\u2205','empty'), (E'\u2003','emsp'),(E'\u2002','ensp'),(E'\u03B5','epsilon'),(E'\u2261','equiv'),(E'\u03B7','eta'),(E'\u00F0','eth'),(E'\u00EB','euml'),(E'\u20AC','euro'),(E'\u2203','exist'),(E'\u0192','fnof'), (E'\u2200','forall'),(E'\u00BD','frac12'),(E'\u00BC','frac14'),(E'\u00BE','frac34'),(E'\u2044','frasl'),(E'\u03B3','gamma'),(E'\u2265','ge'),(E'\u003E','gt'),(E'\u21D4','hArr'),(E'\u2194','harr'), (E'\u2665','hearts'),(E'\u2026','hellip'),(E'\u00ED','iacute'),(E'\u00EE','icirc'),(E'\u00A1','iexcl'),(E'\u00EC','igrave'),(E'\u2111','image'),(E'\u221E','infin'),(E'\u222B','int'),(E'\u03B9','iota'), (E'\u00BF','iquest'),(E'\u2208','isin'),(E'\u00EF','iuml'),(E'\u03BA','kappa'),(E'\u21D0','lArr'),(E'\u03BB','lambda'),(E'\u2329','lang'),(E'\u00AB','laquo'),(E'\u2190','larr'),(E'\u2308','lceil'), (E'\u201C','ldquo'),(E'\u2264','le'),(E'\u230A','lfloor'),(E'\u2217','lowast'),(E'\u25CA','loz'),(E'\u200E','lrm'),(E'\u2039','lsaquo'),(E'\u2018','lsquo'),(E'\u003C','lt'),(E'\u00AF','macr'), (E'\u2014','mdash'),(E'\u00B5','micro'),(E'\u00B7','middot'),(E'\u2212','minus'),(E'\u03BC','mu'),(E'\u2207','nabla'),(E'\u00A0','nbsp'),(E'\u2013','ndash'),(E'\u2260','ne'),(E'\u220B','ni'), (E'\u00AC','not'),(E'\u2209','notin'),(E'\u2284','nsub'),(E'\u00F1','ntilde'),(E'\u03BD','nu'),(E'\u00F3','oacute'),(E'\u00F4','ocirc'),(E'\u0153','oelig'),(E'\u00F2','ograve'),(E'\u203E','oline'), (E'\u03C9','omega'),(E'\u03BF','omicron'),(E'\u2295','oplus'),(E'\u2228','or'),(E'\u00AA','ordf'),(E'\u00BA','ordm'),(E'\u00F8','oslash'),(E'\u00F5','otilde'),(E'\u2297','otimes'),(E'\u00F6','ouml'), (E'\u00B6','para'),(E'\u2202','part'),(E'\u2030','permil'),(E'\u22A5','perp'),(E'\u03C6','phi'),(E'\u03C0','pi'),(E'\u03D6','piv'),(E'\u00B1','plusmn'),(E'\u00A3','pound'),(E'\u2032','prime'), (E'\u220F','prod'),(E'\u221D','prop'),(E'\u03C8','psi'),(E'\u0022','quot'),(E'\u21D2','rArr'),(E'\u221A','radic'),(E'\u232A','rang'),(E'\u00BB','raquo'),(E'\u2192','rarr'),(E'\u2309','rceil'), (E'\u201D','rdquo'),(E'\u211C','real'),(E'\u00AE','reg'),(E'\u230B','rfloor'),(E'\u03C1','rho'),(E'\u200F','rlm'),(E'\u203A','rsaquo'),(E'\u2019','rsquo'),(E'\u201A','sbquo'),(E'\u0161','scaron'), (E'\u22C5','sdot'),(E'\u00A7','sect'),(E'\u00AD','shy'),(E'\u03C3','sigma'),(E'\u03C2','sigmaf'),(E'\u223C','sim'),(E'\u2660','spades'),(E'\u2282','sub'),(E'\u2286','sube'),(E'\u2211','sum'), (E'\u2283','sup'),(E'\u00B9','sup1'),(E'\u00B2','sup2'),(E'\u00B3','sup3'),(E'\u2287','supe'),(E'\u00DF','szlig'),(E'\u03C4','tau'),(E'\u2234','there4'),(E'\u03B8','theta'),(E'\u03D1','thetasym'), (E'\u2009','thinsp'),(E'\u00FE','thorn'),(E'\u02DC','tilde'),(E'\u00D7','times'),(E'\u2122','trade'),(E'\u21D1','uArr'),(E'\u00FA','uacute'),(E'\u2191','uarr'),(E'\u00FB','ucirc'),(E'\u00F9','ugrave'), (E'\u00A8','uml'),(E'\u03D2','upsih'),(E'\u03C5','upsilon'),(E'\u00FC','uuml'),(E'\u2118','weierp'),(E'\u03BE','xi'),(E'\u00FD','yacute'),(E'\u00A5','yen'),(E'\u00FF','yuml'),(E'\u03B6','zeta'), (E'\u200D','zwj'),(E'\u200C','zwnj') ; This is the function: create or replace function entity2char(t text) returns text as $body$ declare r record; begin for r in select distinct ce.ch, ce.name from character_entity ce inner join ( select name[1] "name" from regexp_matches(t, '&([A-Za-z]+?);', 'g') r(name) ) s on ce.name = s.name loop t := replace(t, '&' || r.name || ';', r.ch); end loop; for r in select distinct hex[1] hex, ('x' || repeat('0', 8 - length(hex[1])) || hex[1])::bit(32)::int codepoint from regexp_matches(t, '&#x([0-9a-f]{1,8}?);', 'gi') s(hex) loop t := regexp_replace(t, '&#x' || r.hex || ';', chr(r.codepoint), 'gi'); end loop; for r in select distinct chr(codepoint[1]::int) ch, codepoint[1] codepoint from regexp_matches(t, '&#([0-9]{1,10}?);', 'g') s(codepoint) loop t := replace(t, '&#' || r.codepoint || ';', r.ch); end loop; return t; end; $body$ language plpgsql immutable; Use it like this: select entity2char('HH■XXXÆYYY×ZZZ■UUU'); entity2char -------------------- HH■XXXÆYYY×ZZZ■UUU It only works for UTF-8.
This classic quote may apply here: Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. Regex are useful, but HTML parsing is not a job they're well suited for. Jeff Atwood explains this well. To strip tags from HTML correctly some kind of parsing is necessary. What I would recommend is that you use a more powerful PL like PL/Perl or PL/Pythonu to invoke mature and well tested HTML-stripping libraries. For example, you could use Perl's HTML::Strip via a plperl function that accepts text and returns text. The quick and dirty way to handle this would be to use another layer of regexp_replace expressions to convert entities. This will rapidly lead you down the path alluded to by Igor though, and is best avoided by using tools that aready exist. For example, if you use HTML::Strip it'll use HTML::Entities to convert entities for you as part of the process.
I've been using this successfully for while - thanks for the solution. However I've just discovered that this doesn't seem to work with HTML items such as ² (superscript 2 = ² ), and I suspect any other HTML item that has digit just before the closing ";". I believe the line from regexp_matches(t, '&([A-Za-z]+?);', 'g') r(name) should be from regexp_matches(t, '&([A-Za-z]+[0-9]?);', 'g') r(name) I've tried this with a few examples and it seems to work.