How to improve performance for REGEXP string matching in MySQL? - mysql

Preface:
I've done quite a bit of (re)searching on this, and found the following SO post/answer: https://stackoverflow.com/a/5361490/6095216 which was pretty close to what I'm looking for. The same code, but with somewhat more helpful comments, appears here: http://thenoyes.com/littlenoise/?p=136 .
Problem Description:
I need to split 1 column of MySQL TEXT data into multiple columns, where the original data has this format (N <= 7):
{"field1":"value1","field2":"value2",...,"fieldN":"valueN"}
As you might guess, I only need to extract the values, putting each one into a separate (predefined) column. The problem is that the number and order of the fields is not guaranteed to be the same for all records. Thus, solutions using SUBSTR/LOCATE, etc. don't work, and I need to use regular expressions. Another restriction is that 3rd party libraries such as LIB_MYSQLUDF_PREG (suggested in the answer from my 1st link above) cannot be used.
Solution/Progress so far:
I've modified the code from the above links such that it returns the first/shortest match, left-to-right; otherwise, NULL is returned. I also refactored it a bit and made the identifiers more reader/maintainer-friendly :)
Here's my version:
CREATE FUNCTION REGEXP_EXTRACT_SHORTEST(string TEXT, exp TEXT)
RETURNS TEXT DETERMINISTIC
BEGIN
DECLARE adjustStart, adjustEnd BOOLEAN DEFAULT TRUE;
DECLARE startInd INT DEFAULT 1;
DECLARE endInd, strLen INT;
DECLARE candidate TEXT;
IF string NOT REGEXP exp THEN
RETURN NULL;
END IF;
IF LEFT(exp, 1) = '^' THEN
SET adjustStart = FALSE;
ELSE
SET exp = CONCAT('^', exp);
END IF;
IF RIGHT(exp, 1) = '$' THEN
SET adjustEnd = FALSE;
ELSE
SET exp = CONCAT(exp, '$');
END IF;
SET strLen = LENGTH(string);
StartIndLoop: WHILE (startInd <= strLen) DO
IF adjustEnd THEN
SET endInd = startInd;
ELSE
SET endInd = strLen;
END IF;
EndIndLoop: WHILE (endInd <= strLen) DO
SET candidate = SUBSTRING(string FROM startInd FOR (endInd - startInd + 1));
IF candidate REGEXP exp THEN
RETURN candidate;
END IF;
IF adjustEnd THEN
SET endInd = endInd + 1;
ELSE
LEAVE EndIndLoop;
END IF;
END WHILE EndIndLoop;
IF adjustStart THEN
SET startInd = startInd + 1;
ELSE
LEAVE StartIndLoop;
END IF;
END WHILE StartIndLoop;
RETURN NULL;
END;
I then added a helper function to avoid having to repeat the regex pattern, which, as you can see from above, is the same for all the fields. Here is that function (I left my attempt to use a lookbehind - unsupported in MySQL - as a comment):
CREATE FUNCTION GET_MY_FLD_VAL(inputStr TEXT, fldName TEXT)
RETURNS TEXT DETERMINISTIC
BEGIN
DECLARE valPattern TEXT DEFAULT '"[^"]+"'; /* MySQL doesn't support lookaround :( '(?<=^.{1})"[^"]+"'*/
DECLARE fldNamePat TEXT DEFAULT CONCAT('"', fldName, '":');
DECLARE discardLen INT UNSIGNED DEFAULT LENGTH(fldNamePat) + 2;
DECLARE matchResult TEXT DEFAULT REGEXP_EXTRACT_SHORTEST(inputStr, CONCAT(fldNamePat, valPattern));
RETURN SUBSTRING(matchResult FROM discardLen FOR LENGTH(matchResult) - discardLen);
END;
Currently, all I'm trying to do is a simple SELECT query using the above code. It works correctly, BUT IT. IS. SLOOOOOOOW... There are only 7 fields/columns to split into, max (not all records have all 7)! Limited to 20 records, it takes about 3 minutes - and I have about 40,000 records total (not very much for a database, right?!) :)
And so, finally, we get to the actual question: [how] can the above algorithm/code (pretty much a brute search at this point) be improved SIGNIFICANTLY performance-wise, such that it can be run on the actual database in a reasonable amount of time? I started looking into the major known pattern-matching algorithms, but quickly got lost trying to figure out what would be appropriate here, in large part due to the number of available options and their respective restrictions, conditions for use, etc. Plus, it seems like implementing one of these in SQL just to see if it would help, might be a lot of work.
Note: this is my first post ever(!), so please let me know (nicely) if something is not clear, etc. and I will do my best to fix it. Thanks in advance.

I was able to solve this by parsing the JSON, as suggested by tadman and Matt Raines above. Being new to the concept of JSON, I just didn't realize it could be done this way at all...a little embarrassing, but lesson learned!
Anyway, I used the get_option function in the common_schema framework: https://code.google.com/archive/p/common-schema/ (found through this post, which also demonstrates how to use the function: Parse JSON in MySQL ). As a result, my INSERT query took about 15 minutes to run, vs the 30+ hours it would've taken with the REGEXP solution. Thanks, and until next time! :)

Don't do it in SQL; do it in PHP or some other language that has builtin tools for parsing JSON.

Related

How to omit html tags in a mysql table attribute while doing a select

I have a table where each row consist of an attribute which consist of html data with like this.
<div className="single_line"><p>New note example</p></div>
I need to omit the html tags and extract only the data inside the tags using sql query. Any idea on how to achieve this?. I tried out different regex but they didnt work.
There are 2 solutions based on mysql version.
If you are using MySQL 8.0 then you can use REGEXP_REPLACE() directly inside the select statement.
SELECT REGEXP_REPLACE('<div><p>New note example</p></div>', '(<[^>]*>)|( )', '');
If you are using MySQL 5.7 then you have to create a user define function in database to strip html tags.
DROP FUNCTION IF EXISTS fn_strip_html_tags;
CREATE FUNCTION fn_strip_html_tags( html_text TEXT ) RETURNS TEXT
BEGIN
DECLARE start,end INT DEFAULT 1;
DECLARE text_without_nbsp TEXT;
LOOP
SET start = LOCATE("<", html_text, start);
IF (!start) THEN RETURN html_text; END IF;
SET end = LOCATE(">", html_text, start);
IF (!end) THEN SET end = start; END IF;
SET text_without_nbsp = REPLACE(html_text, " ", " ");
SET html_text = INSERT(text_without_nbsp, start, end - start + 1, "");
END LOOP;
END
For example
SELECT fn_strip_html_tags('<div><p>New note example</p></div>');

FFT implementation in Verilog : Error using nested for loop

This post is related to the my previous post related to FFT.
FFT implemetation in Verilog: Assigning Wire input to Register type array
I want to assign output of first stage to input of second stage of FFT butterfly modules. I have to re-order the output of first stage according to input of second stage. Here is my code to implement the swapping.
always# (posedge y_ndd[0] or posedge J)
begin
if(J==1'b1)
begin
for (idx=0; idx<N/2; idx=idx+1)
begin
IN[2*idx] <= X[idx*2*X_WDTH+: 2*X_WDTH];
IN[2*idx+1] <= X[(idx+N/2)*2*X_WDTH+: 2*X_WDTH];
end
end
else
begin
level=level+1;
modulecount=0;
for(jj=0;jj<N;jj=jj+(2**(level+1)))
begin
for (jx=jj; jx<jj+(2**level); jx=jx+1)//jj+(2**level)
begin
IN[modulecount] <=OUT[jx];
IN[modulecount+1] <=OUT[jx+(2**level)];
modulecount=modulecount+1;
end
end
end
end
When I synthesize this, It gives 2 errors.
ERROR:Xst:891 - "Network.v" line 161: For Statement is only supported when the new step evaluation is constant increment or decrement of the loop variable.
ERROR:Xst:2634 - "Network.v" line 161: For loop stop condition should depend on loop variable or be static.
Can't we use non-constant increment and non-static stop coditions?
If that so, how we handle this.
Any help is appreciated.
Thanks in advance.
Synthesis tools unroll loops in order to synthesize the circuit. Therefore, only loops that iterate a constant number of times, whose constant is known at compile/elaboration time are synthetisable.
When the stop value is not known, you can assume a maximum number of iterations and use that as the stop condition. Then add the original stop condition as a conditional statement inside the loop:
for (jx=jj; jx < MAX_LOOP_ITERATION; jx=jx+1)//jj+(2**level)
begin
if (jx<jj+(2**level)) // <---------- Add stop condition here
begin
IN[modulecount] <=OUT[jx];
IN[modulecount+1] <=OUT[jx+(2**level)];
modulecount=modulecount+1;
end
end
If N is not a constant, the outer loop should also be fixed using a similar conditional statement. You also need to fix the increment value and each time add a constant value. Use a conditional statement to check if jj==jj+(2**(level+1))
Obviously, you need to be careful as a high max number may increase your worst case delay and the minimum clock cycle time.
//ll,level,K has to be declare.
always# (posedge y_ndd[0] or posedge J)
begin
if(J==1'b1)
begin
for (idx=0; idx<N/2; idx=idx+1)
begin
IN[2*idx] <= X[idx*2*X_WDTH+: 2*X_WDTH];
IN[2*idx+1] <= X[(idx+N/2)*2*X_WDTH+: 2*X_WDTH];
end
end
else
begin
ll=ll+1;
modulecount=0;
for(level=0;level<K;level=level+1) //K time you need to execute
begin
if(ll==level)
begin
for(jj=0;jj<N;jj=jj+(2**(level+1)))
begin
for (jx=jj; jx<jj+(2**level); jx=jx+1)
begin
IN[modulecount] <=OUT[jx];
IN[modulecount+1] <=OUT[jx+(2**level)];
modulecount=modulecount+1;
end
end
end
end
//ll=ll+1;
end
end
You can try this. It has to work. But problem is outer loop will execute K times.

Replace chars in a HTML string - Except Tags

I need to go through a HTML string and replace characters with 0 (zero), except tags, spaces and line breaks. I created this code bellow, but it is too slow. Please, can someone help me to make it faster (optimize)?
procedure TForm1.btn1Click(Sender: TObject);
var
Txt: String;
Idx: Integer;
Tag: Boolean;
begin
Tag := False;
Txt := mem1.Text;
For Idx := 0 to Length(Txt) - 1 Do
Begin
If (Txt[Idx] = '<') Then
Tag := True Else
If (Txt[Idx] = '>') Then
Begin
Tag := False;
Continue;
end;
If Tag Then Continue;
If (not (Txt[Idx] in [#10, #13, #32])) Then
Txt[Idx] := '0';
end;
mem2.Text := Txt;
end;
The HTML text will never have "<" or ">" outside tags (in the middle of text), so I do not need to worry about this.
Thank you!
That looks pretty straightforward. It's hard to be sure without profiling the code against the data you're using, (which is always a good idea; if you need to optimize Delphi code, try running it through Sampling Profiler first to get an idea where you're actually spending all your time,) but if I had to make an educated guess, I'd guess that your bottleneck is in this line:
Txt[Idx] := '0';
As part of the compiler's guarantee of safe copy-on-write semantics for the string type, every write to an individual element (character) of a string involves a hidden call to the UniqueString routine. This makes sure that you're not changing a string that something else, somewhere else, holds a reference to.
In this particular case, that's not necessary, because you got the string fresh in the start of this routine and you know it's unique. There's a way around it, if you're careful.
CLEAR AND UNAMBIGUOUS WARNING: Do not do what I'm about to explain without making sure you have a unique string first! The easiest way to accomplish this is to call UniqueString manually. Also, do not do anything during the loop that could assign this string to any other variable. While we're doing this, it's not being treated as a normal string. Failure to heed this warning can cause data corruption.
OK, now that that's been explained, you can use a pointer to access the characters of the string directly, and get around the compiler's safeguards, like so:
procedure TForm1.btn1Click(Sender: TObject);
var
Txt: String;
Idx: Integer;
Tag: Boolean;
current: PChar; //pointer to a character
begin
Tag := False;
Txt := mem1.Text;
UniqueString(txt); //very important
if length(txt) = 0 then
Exit; //If you don't check this, the next line will raise an AV on a blank string
current := #txt[1];
dec(current); //you need to start before element 1, but the compiler won't let you
//assign to element 0
For Idx := 0 to Length(Txt) - 1 Do
Begin
inc(current); //put this at the top of the loop, to handle Continue cases correctly
If (current^ = '<') Then
Tag := True Else
If (current^ = '>') Then
Begin
Tag := False;
Continue;
end;
If Tag Then Continue;
If (not (current^ in [#10, #13, #32])) Then
current^ := '0';
end;
mem2.Text := Txt;
end;
This changes the metaphor. Instead of indexing into the string as an array, we're treating it like a tape, with the pointer as the head, moving forward one character at a time, scanning from beginning to end, and changing the character under it when appropriate. No redundant calls to UniqueString, and no repeatedly calculating offsets, which means this can be a lot faster.
Be very careful when using pointers like this. The compiler's safety checks are there for a good reason, and using pointers steps outside of them. But sometimes, they can really help speed things up in your code. And again, profile before trying anything like this. Make sure that you know what's slowing things down, instead of just thinking you know. If it turns out to be something else that's running slow, don't do this; find a solution to the real problem instead.
Edit: Looks like I was wrong - UniqueString is not the problem. The actual bottleneck seems to be accessing the string by character. Given that my entire answer was irrelevent, I've completely replaced it.
If you use a PChar to avoid recalculating the string offset, while still updating the string via Txt[Idx], the method is much faster (5 seconds down to 0.5 seconds in my test of 1000 runs).
Here's my version:
procedure TForm1.btn1Click(Sender: TObject);
var
Idx: Integer;
Tag: Boolean;
p : PChar;
Txt : string;
begin
Tag := False;
Txt := Mem1.Text;
p := PChar(txt);
Dec(p);
For Idx := 0 to Length(Txt) - 1 Do
Begin
Inc(p);
If (not Tag and (p^ = '<')) Then begin
Tag := True;
Continue;
end
Else If (Tag and (p^ = '>')) Then
Begin
Tag := False;
Continue;
end;
If Tag Then Continue;
If (not (p^ in [#10, #13, #32])) Then begin
Txt[Idx] := '0';
end;
end;
mem2.Text := Txt;
end;
I did some profiling and came up with this solution.
A test for > #32 instead of [#10,#13,#32] gains some speed (thanks #DavidHeffernan).
A better logic in the loop also gives a bit extra speed.
Accessing the string exclusively with the help of a PChar is more effective.
procedure TransformHTML( var Txt : String);
var
IterCnt : Integer;
PTxt : PChar;
tag : Boolean;
begin
PTxt := PChar(Txt);
Dec(PTxt);
tag := false;
for IterCnt := 0 to Length(Txt)-1 do
begin
Inc(PTxt);
if (PTxt^ = '<') then
tag := true
else
if (PTxt^ = '>') then
tag := false
else
if (not tag) and (PTxt^ > #32) then
PTxt^ := '0';
end;
end;
This solution is about 30% more effective than Mason's solution and 2.5 times more effective than Blorgbeard's.

How to use goto label in MySQL stored function

I would like to use goto in MySQL stored function.
How can I use?
Sample code is:
if (action = 'D') then
if (rowcount > 0) then
DELETE FROM datatable WHERE id = 2;
else
SET p=CONCAT('Can not delete',#b);
goto ret_label;
end if;
end if;
Label: ret_label;
return 0;
There are GOTO cases which can't be implemented in MySQL, like jumping backwards in code (and a good thing, too).
But for something like your example where you want to jump out of everything to a final series of statements, you can create a BEGIN / END block surrounding the code to jump out of:
aBlock:BEGIN
if (action = 'D') then
if (rowcount > 0) then
DELETE FROM datatable WHERE id = 2;
else
SET p=CONCAT('Can not delete',#b);
LEAVE aBlock;
end if;
end if;
END aBlock;
return 0;
Since your code is just some nested IFs, the construct is unnecessary in the given code. But it makes more sense for LOOP/WHILE/REPEAT to avoid multiple RETURN statements from inside a loop and to consolidate final processing (a little like TRY / FINALLY).
There is no GOTO in MySQL Stored Procs. You can refer to this post:
MySQL :: Re: Goto Statement

SQL Server change font in html string

I have a strings stored in my database formatted as html, and users can change the font size. That's fine, but I need to make a report and the font sizes all need to be the same. So, if I have the following html, I want to modify it to have a font size of 10:
<HTML><BODY><DIV STYLE="text-align:Left;font-family:Tahoma;font-style:normal;font-weight:normal;font-size:11;color:#000000;"><DIV><DIV><P><SPAN>This is my text to display.</SPAN></P></DIV></DIV></DIV></BODY></HTML>
I have a user defined function, but apparently, I can't use wildcards in a REPLACE, so it doesn't actually do anything:
ALTER FUNCTION [dbo].[udf_SetFont]
(#HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
RETURN REPLACE (#HTMLText, 'font-size:%;', 'font-size:10;')
END
(Of course, it would be even better if I sent the font size as a parameter, so I could change it to whatever.)
How do I modify this to change any string so the font size is 10?
This appears to work, although I've only tried it on one string (which has the font set in 2 places). I started with code that strips ALL html and modified it to only look for and change 'font-size:*'. I suspected there would be issues if the font size is 9 or less (1 character) and I'm changing it to 10 (2 chars), but it seems to work for that too.
ALTER FUNCTION [dbo].[udf_ChangeFont]
(#HTMLText VARCHAR(MAX), #FontSize VARCHAR(2))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #Start INT
DECLARE #End INT
DECLARE #Length INT
SET #Start = CHARINDEX('font-size:',#HTMLText)
SET #End = CHARINDEX(';',#HTMLText,CHARINDEX('font-size:',#HTMLText))
SET #Length = (#End - #Start) + 1
WHILE #Start > 0
AND #End > 0
AND #Length > 0
BEGIN
SET #HTMLText = STUFF(#HTMLText,#Start,#Length,'font-size:' + #FontSize + ';')
SET #Start = CHARINDEX('font-size:',#HTMLText, #End+2)
SET #End = CHARINDEX(';',#HTMLText,CHARINDEX('font-size:',#HTMLText, #End+2))
SET #Length = (#End - #Start) + 1
END
RETURN LTRIM(RTRIM(#HTMLText))
END
DECLARE #HTML NVarChar(2000) = '
<HTML>
<BODY>
<DIV STYLE="text-align:Left;font-family:Tahoma;font-style:normal;font-weight:normal;font-size:11;color:#000000;">
<DIV>
<DIV>
<P><SPAN>This is my text to display.</SPAN></P>
</DIV>
</DIV>
</DIV>
</BODY>
</HTML>';
DECLARE #X XML = #HTML;
WITH T AS (
SELECT C.value('.', 'VarChar(1000)') StyleAttribute
FROM #X.nodes('//#STYLE') D(C)
)
SELECT *
FROM T
WHERE T.StyleAttribute LIKE '%font-size:%';
From here I'd use a CLR function to split the StyleAttribute column on ;. Then look for the piece(s) that begin with font-size: and split again on :. TryParse the second element of that result and if it isn't 10, replace it. You'd then build up your string to get the value that StyleAttribute should have. From there you can do a REPLACE looking for the original value (from the table above) and substituting the output of the CLR function.
Nasty problem...good luck.
As Yuck said, SQL Server string functions pretty limited. You'll eventually run into a wall where your best bet is to resort to non-SQL solutions.
If you absolutely need to store HTML with embedded styles are you currently have, but also have the flexibility to revise your data model, you might want to consider adding a second database column to your table. The second column would store the style-free version of the HTML. You could parse out the styling at the application layer. That would make it a lot easier to view the contents in future reports and other scenarios.