I wish to replace cca 3 500 000 values in Mysql Table. Each value is string in the form of AB00123012 and I wish to remove leading zeroes after letters - i.e. get AB123012 (zeroes inside number should be kept).
The value has always exactly 10 characters.
Since Mysql does not allow replace by regex, I have used following function:
DELIMITER $$
CREATE FUNCTION fn_RawRingNumber (rn CHAR(10))
RETURNS CHAR(10) DETERMINISTIC
BEGIN
DECLARE newrn CHAR(10);
DECLARE pos INT(8);
DECLARE letters CHAR(2);
DECLARE nr CHAR(8);
IF (CHAR_LENGTH(rn) = 10) THEN
SET pos = (SELECT POSITION('0' IN rn));
SET letters = (SELECT SUBSTRING_INDEX(rn, '0', 1));
SET nr = (SELECT TRIM(LEADING '0' FROM SUBSTRING(rn,pos)));
SET newrn = (SELECT CONCAT(letters, nr));
ELSE
SET newrn = rn;
END IF;
RETURN newrn;
END$$
DELIMITER ;
While this works, it is rather slow and I am wondering, if there is not a better way to do this?
If you can afford to take your site offline for a few minutes, the fastest way would be to dump, process and re import. Since the current operation makes queries/inserts on that table pretty slow, so you are probably better off with a dump/process/import anyway.
Step 1 dump.
SELECT INTO OUTFILE is your friend here
Step 2 process
Use your favourite programming language or if you are lucky to be on linux, something like sed or even cut. If you need help with the regex post a comment.
Step 3 reimport
After clearing out the table. Do a LOAD DATA INFILE.
these three steps should all be reasonably quick. Specially if you have a n index on that column.
Try this
Note: I not tested this with many rows and don't know how this is efficient.
Also, if this is fast, please before using this, think all possible variations, which may occurs with your string, may be I missed some variants, not sure 100%.
select case
when INSTR(col, '0') = 2 then concat( substr(col, 1, 1), substr(col, 2) * 1)
when INSTR(col, '0') = 3 and substr(col, 2, 1) not in('1','2','3','4','5','6','7','8','9') then concat( substr(col, 1, 2), substr(col, 3) * 1)
else col
end
from (
select 'AB00123012' as col union all
select 'A010000123' as col union all
select 'A1000000124' as col union all
select 'A0000000124' as col union all
select '.E00086425' as col
) t
Related
I've been asked to create a VIEW off a table that includes a varchar(MAX) column containing a JSON string. Unfortunately, some of the entries contain double quotes that aren't escaped.
Example (invalid in Notes):
{"Eligible":"true","Reason":"","Notes":"Left message for employee to "call me"","EDate":"08/16/2021"}
I don't have access to correct wherever this is being inserted so I just have to work with the data as is.
So in my view I need to find a way to escape those double quotes.
I'm pulling the data like so:
JSON_VALUE(JsonData, '$.Notes') as Notes
However, I get the following error:
JSON text is not properly formatted. Unexpected character '"' is found at position 102.
I can't do a simple replace on the whole field because that would create invalid JSON also.
I tried JSON_MODIFY but run into the problem of getting the notes field to replace itself.
JSON_MODIFY(JsonData, '$.Notes', REPLACE(JSON_VALUE(JsonData, '$.Notes'), '"', '\"'))
Maybe I'm missing something obvious, but I can't figure out how to handle this. Is there a way to escape those double quotes in my query?
So this is incredibly hacky and there are probably several examples that could break it as is, but if you absolutely can't fix your source data output or simply flag bad JSON for manual adjustment, this may be the route you need to take and further flesh out.
Based on your example and a couple extras I have thrown in, with the help of a custom string splitting table valued function that maintains sort order, you can achieve the output as follows:
Query
declare #t table (JsonData nvarchar(max));
insert into #t values('{"Eligible":true,"Reason":"","Notes":"Left message for employee to "call me"","EDate":"08/16/2021","Test": "999","Another Test":"Value with " character"}');
with q as
(
select t.JsonData
,s.rn
,case when right(trim(lag(s.item,1) over (order by s.rn)),1) in('{',':',',')
then '"'
else ''
end -- Do we need a starting double quote?
+ s.item -- Value from the split text
+ case when right(trim(lead(s.item,1) over (order by s.rn)),1) not in('}',':',',')
and right(trim(s.item),1) not in('{','}',':',',')
then '\"'
else ''
end -- Do we need an escaped double quote?
+ case when left(trim(lead(s.item,1) over (order by s.rn)),1) in('}',':',',')
then '"'
else ''
end -- Do we need an ending double quote?
as Quoted
from #t as t
cross apply dbo.fn_StringSplit4k(t.JsonData,'"',null) as s -- By splitting on " characters, we know where they all are even though they are removed, so we can add them back in as required based on the remaining text
)
,j as
(
select JsonData
,string_agg(Quoted,'') within group (order by rn) as JsonFixed
from q
group by JsonData
)
select json_value(JsonFixed, '$.Eligible') as Eligible
,json_value(JsonFixed, '$.Reason') as Reason
,json_value(JsonFixed, '$.Notes') as Notes
,json_value(JsonFixed, '$.EDate') as EDate
,json_value(JsonFixed, '$.Test') as Test
,json_value(JsonFixed, '$."Another Test"') as AnotherTest
from j;
Output
Eligible
Reason
Notes
EDate
Test
AnotherTest
true
Left message for employee to "call me"
08/16/2021
999
Value with " character
String Splitter
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
I would like to suggest a stored procedure along these lines:
CREATE FUNCTION dbo.clearJSon(#v nvarchar(max)) RETURNS nvarchar(max)
AS
BEGIN
DECLARE #i AS int
DECLARE #security int
SET #i=PATINDEX('%[^{:,]"[^,:}]%',#v)
SET #security=0 -- just to prevent an endless loop
WHILE #i>0 and #security<100
BEGIN
SET #v = LEFT(#v,#i)+''''+SUBSTRING(#v,#i+2,len(#v))
SET #i=PATINDEX('%[^{:,]"[^,:}]%',#v)
SET #security = #security+1
END
RETURN #v
END
which returns
{"Eligible":"true","Reason":"","Notes":"Left message for employee to 'call me'","EDate":"08/16/2021"} as the result of dbo.clearJSon(JsonData)
I have to admit though, that the above code would fail, if the unescaped quotes would be followed by one of ,:} or if it would trail one of {:,
I'm trying to make only the odd indexes of a string in uppercase (whereas the even indexes to be in lowercase) in MySQL.
For example: StackOverflow -> StAcKoVeRfLoW or hello -> HeLlO.
I found a way to this by extracting a letter at a time using the mid function, then concatenating based on which index the letter is at:
SET #x='hello';
SELECT #x as Initial,
Concat(ucase(mid(#x,1,1)),lcase(mid(#x,2,1)),ucase(mid(#x,3,1)),lcase(mid(#x,4,1)),ucase(mid(#x,5,1)))
as Final;
However I'm interested if there is a way to simplify this, since if the string would be larger then we would have some problems. So basically is there a way to modify it to something like:
Concat(ucase(mid(#x,odd index,1)),lcase(mid(#x,even index,1)))?
This is probably most simply done in your application, but can be achieved in MySQL. For MySQL 8+ you can use a recursive CTE to extract the individual letters from the string and GROUP_CONCAT to put them back together, changing the case on an alternating basis:
WITH RECURSIVE INITIAL AS (
SELECT 'StackOverflow' AS x
),
CTE AS (
SELECT 1 AS upper, SUBSTRING(x, 1, 1) AS letter, SUBSTRING(x, 2) AS remainder
FROM INITIAL
UNION ALL
SELECT 1 - upper, SUBSTRING(remainder, 1, 1), SUBSTRING(remainder, 2)
FROM CTE
WHERE LENGTH(remainder) > 0
)
SELECT GROUP_CONCAT(CASE WHEN upper THEN UPPER(letter) ELSE LOWER(letter) END SEPARATOR '') AS new
FROM CTE
Output:
StAcKoVeRfLoW
In versions lower than 8, you can use a user-defined function:
DELIMITER //
CREATE FUNCTION AlterCase(initial TEXT)
RETURNS TEXT
DETERMINISTIC
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE l CHAR(1);
DECLARE new TEXT DEFAULT '';
WHILE i <= LENGTH(initial) DO
SET l = SUBSTRING(initial, i, 1);
SET new = CONCAT(new,
CASE WHEN i % 2 = 1 THEN UPPER(l) ELSE LOWER(l) END);
SET i = i + 1;
END WHILE;
RETURN new;
END //
DELIMITER ;
And call it as
SELECT AlterCase('StackOverflow')
Output:
StAcKoVeRfLoW
Note the function will work in MySQL 8+ too.
Demo on dbfiddle
I tried to write a SQL-function that generates an unused unique ID in a range between 1000000 and 4294967295. I need numeric values, so UUID() alike is not a solution. It doesn't sound that difficult, but for some reason, the code below does not work when called within an INSERT-statement on a table as value for the primary key (not auto_increment, of course). The statement is like INSERT INTO table (id, content) VALUES ((SELECT getRandomID(0,0)), 'blabla bla');
(Since default values are not allowed in such functions, I shortly submit 0 for each argument and set it in the function to the desired value.)
Called once and separated from INSERT or Python-code, everything is fine. Called several times, something weird happens and not only the whole process but also the server might hang within REPEAT. The process is then not even possible to kill/restart; I have to reboot the machine -.-
It also seems to only have some random values ready for me, since the same values appear again and again after some calls, allthough I actually thought that the internal rand() would be a sufficient start/seed for the outer rand().
Called from Python, the loop starts to hang after some rounds although the very first one in my tests always produces a useful, new ID and therefore should quit after the first round. Wyh? Well, the table is empty...so SELECT COUNT(*)... returns 0 which actually is the signal for leaving the loop...but it doesn't.
Any ideas?
I'm running MariaDB 10.something on SLES 12.2. Here is the exported source code:
DELIMITER $$
CREATE DEFINER=`root`#`localhost` FUNCTION `getRandomID`(`rangeStart` BIGINT UNSIGNED, `rangeEnd` BIGINT UNSIGNED) RETURNS bigint(20) unsigned
READS SQL DATA
BEGIN
DECLARE rnd BIGINT unsigned;
DECLARE i BIGINT unsigned;
IF rangeStart is null OR rangeStart < 1 THEN
SET rangeStart = 1000000;
END IF;
IF rangeEnd is null OR rangeEnd < 1 THEN
SET rangeEnd = 4294967295;
END IF;
SET i = 0;
r: REPEAT
SET rnd = FLOOR(rangeStart + RAND(RAND(FLOOR(1 + rand() * 1000000000))*10) * (rangeEnd - rangeStart));
SELECT COUNT(*) INTO i FROM `table` WHERE `id` = rnd;
UNTIL i = 0 END REPEAT r;
RETURN rnd;
END$$
DELIMITER ;
A slight improvement:
SELECT COUNT(*) INTO i FROM `table` WHERE `id` = rnd;
UNTIL i = 0 END REPEAT r;
-->
UNTIL NOT EXISTS( SELECT 1 FROM `table` WHERE id = rnd ) REPEAT r;
Don't pass any argument to RAND -- that is for establishing a repeatable sequence of random numbers.
mysql> SELECT RAND(123), RAND(123), RAND(), RAND()\G
*************************** 1. row ***************************
RAND(123): 0.9277428611440052
RAND(123): 0.9277428611440052
RAND(): 0.5645420109522921
RAND(): 0.12561983719991504
1 row in set (0.00 sec)
So simplify to
SET rnd = FLOOR(rangeStart + RAND() * (rangeEnd - rangeStart));
If you want to include rangeEnd in the possible outputs, add 1:
SET rnd = FLOOR(rangeStart + RAND() * (rangeEnd - rangeStart + 1));
In a recent post Sql server rtrim not working for me, suggestions?, I got some good help getting a csv string out of a select query. It's behaving unexpectedly though, and I can't find any similar examples or documentation on it. The query returns 802 records without the coalesce statement, as a normal select. With the coalesce, I'm getting back just 81. I get this same result if I output to text, or output to file. This query returns 800+ rows:
declare #maxDate date = (select MAX(TradeDate) from tblDailyPricingAndVol)
select p.Symbol, ','
from tblDailyPricingAndVol p
where p.Volume > 1000000 and p.Clse <= 40 and p.TradeDate = #maxDate
order by p.Symbol
But when I attempt to concatenate those values, many are missing:
declare #maxDate date = (select MAX(TradeDate) from tblDailyPricingAndVol)
declare #str VARCHAR(MAX)
SELECT #str = COALESCE(#str+',' ,'') + LTRIM(RTRIM((p.Symbol)))
FROM tblDailyPricingAndVol p
WHERE p.Volume > 1000000 and p.Clse <= 40 and p.TradeDate = #maxDate
ORDER by p.Symbol
SELECT #str
This should be working fine, however here is how I would do it:
DECLARE #str VARCHAR(MAX) = '';
SELECT #str += ',' + LTRIM(RTRIM(Symbol))
FROM dbo.tblDailyPricingAndVol
WHERE Volume > 1000000 AND Clse <= 40 AND radeDate = #maxDate
ORDER by Symbol;
SET #str = STUFF(#str, 1, 1, '');
To determine whether the string is complete, stop looking at the output in Management Studio. This is always going to be truncated if you exceed the number of characters Management Studio will show. You can run a couple of tests to check the variable without inspecting it in its entirety:
A. Compare the datalength of the individual parts to the datalength of the result.
SELECT SUM(DATALENGTH(LTRIM(RTRIM(Symbol)))) FROM dbo.tblDailyPricingAndVol
WHERE ...
-- concatenation query here
SELECT DATALENGTH(#str);
-- these should be equal or off by one.
B. Compare the end of the variable to the last element in the set.
SELECT TOP 1 Symbol FROM dbo.tblDailyPricingAndVol
WHERE ...
ORDER BY Symbol DESC;
-- concatenation query here
SELECT RIGHT(#str, 20);
-- is the last element in the set represented at the end of the string?
I'd like some help in optimizing the following query:
SELECT DISTINCT TOP (#NumberOfResultsRequested) dbo.FilterRecentSearchesTitles(OriginalSearchTerm) AS SearchTerms
FROM UserSearches
WHERE WebsiteID = #WebsiteID
AND LEN(OriginalSearchTerm) > 20
--AND dbo.FilterRecentSearchesTitles(OriginalSearchTerm) NOT IN (SELECT KeywordUrl FROM PopularSearchesBaseline WHERE WebsiteID = #WebsiteID)
GROUP BY OriginalSearchTerm, GeoID
It runs fine without the line that is commented out. I have an index set on UserSearches.OriginalSearchTerm, WebsiteID, and PopularSearchesBaseline.KeywordUrl, but the query still runs slow with this line in there.
-- UPDATE --
The function used is as follows:
ALTER FUNCTION [dbo].[FilterRecentSearchesTitles]
(
#SearchTerm VARCHAR(512)
)
RETURNS VARCHAR(512)
AS
BEGIN
DECLARE #Ret VARCHAR(512)
SET #Ret = dbo.RegexReplace('[0-9]', '', REPLACE(#SearchTerm, '__s', ''), 1, 1)
SET #Ret = dbo.RegexReplace('\.', '', #Ret, 1, 1)
SET #Ret = dbo.RegexReplace('\s{2,}', ' ', #Ret, 1, 1)
SET #Ret = dbo.RegexReplace('\sv\s', ' ', #Ret, 1, 1)
RETURN(#Ret)
END
Using the Reglar Expression Workbench code.
However, as I mentioned - without the line that is currently commented out it runs fine.
Any other suggestions?
I am going to guess that dbo.FilterRecentSearchesTitles(OriginalSearchTerm) is a function. My suggestion would be to see about rewriting it into a table valued function so you can return a table that could be joined on.
Otherwise you are calling that function for each row you are trying to return which is going to cause your problems.
If you cannot rewrite the function, then why not create a stored proc that will only execute it once, similar to this:
SELECT DISTINCT TOP (#NumberOfResultsRequested) dbo.FilterRecentSearchesTitles(OriginalSearchTerm) AS SearchTerms
INTO #temp
WHERE WebsiteID = #WebsiteID
SELECT *
FROM #temp
WHERE SearchTerms NOT IN (SELECT KeywordUrl
FROM PopularSearchesBaseline
WHERE WebsiteID = #WebsiteID)
Then you get your records into a temp table after executing the function once and then you select on the temp table.
I might try to use a persisted computed column in this case:
ALTER TABLE UserSearches ADD FilteredOriginalSearchTerm AS dbo.FilterRecentSearchesTitles(OriginalSearchTerm) PERSISTED
You will probably have to add WITH SCHEMABINDING to your function (and the RegexReplace function) like so:
ALTER FUNCTION [dbo].[FilterRecentSearchesTitles]
(
#SearchTerm VARCHAR(512)
)
RETURNS VARCHAR(512)
WITH SCHEMABINDING -- You will need this so the function is considered deterministic
AS
BEGIN
DECLARE #Ret VARCHAR(512)
SET #Ret = dbo.RegexReplace('[0-9]', '', REPLACE(#SearchTerm, '__s', ''), 1, 1)
SET #Ret = dbo.RegexReplace('\.', '', #Ret, 1, 1)
SET #Ret = dbo.RegexReplace('\s{2,}', ' ', #Ret, 1, 1)
SET #Ret = dbo.RegexReplace('\sv\s', ' ', #Ret, 1, 1)
RETURN(#Ret)
END
This makes your query look like this:
SELECT DISTINCT TOP (#NumberOfResultsRequested) FilteredOriginalSearchTerm AS SearchTerms
FROM UserSearches
WHERE WebsiteID = #WebsiteID
AND LEN(OriginalSearchTerm) > 20
AND FilteredOriginalSearchTerm NOT IN (SELECT KeywordUrl FROM PopularSearchesBaseline WHERE WebsiteID = #WebsiteID)
GROUP BY OriginalSearchTerm, GeoID
Which could potentially be optimized for speed (if necessary) with a join instead of not in, or maybe different indexing (perhaps on the computed column, or some covering indexes). Also, DISTINCT with a GROUP BY is somewhat of a code smell to me, but it could be legit.
Instead of using using the function on SELECT, I modified the INSERT query to include this function. That way, I avoid calling the function for every row when I later want to retrieve the data.