How can I tokenize a string in MySQL? - mysql

My project is importing a sizable collection +500K rows of data from flat Excel files, which are manually created by a team of people. Now the problem is that it all needs to be normalized, for client searching. For example, the company field will have multiple company spellings and include branches, such as "IBM" and then "IBM Inc." and "IBM Japan" etc. Additionally, I have product names that alphanumeric, such as "A46-Rhizonme Pentahol", which SOUNDEX alone cannot handle.
I can solve the issue in the long term by having all the data input be through a web form, with an AJAX auto-suggest. Until then however, I still need to deal with the massive collection of existing data. This brings me to what I believe is a good process, based on what I've read here:
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
Steps to create a custom Fuzzy Logic Lookup, and Fuzzy Logic Grouping
List item
tokenize strings into keywords
calculate keyword TF-IDF (total frequency - inverse document frequecy)
calculate levenshtein distance between keywords
calculate Soundex on available alpha strings
determine context of keywords
place keywords, based on context, into separate DB tables, such as "Companies", "Products", "Ingredients"
I've been Googling, searching StackOverflow, reading over MySQL.com discussions, etc. about this issue, to attempt to find a prebuilt solution. Any ideas?

So, I gave up and just made a string tokenizing function for mysql. Here's the code:
CREATE DEFINER = `root`#`localhost` FUNCTION `NewProc`(in_string VARCHAR(255), delims VARCHAR(255), str_replace VARCHAR(255))
RETURNS varchar(255)
DETERMINISTIC
BEGIN
DECLARE str_len, delim_len, a, b, is_delim INT;
DECLARE z, y VARBINARY(1);
DECLARE str_out VARBINARY(256);
SET str_len = CHAR_LENGTH(in_string), delim_len = CHAR_LENGTH(delims),a = 1, b = 1, is_delim = 0, str_out = '';
-- get each CHARACTER
WHILE a <= str_len DO
SET z = SUBSTRING(in_string, a, 1);
-- loop through the deliminators
WHILE b <= delim_len AND is_delim < 1 DO
SET y = SUBSTRING(delims, b, 1);
-- search for each deliminator
IF z = y THEN
SET is_delim = 1;
END IF;
SET b = b + 1;
END WHILE;
IF is_delim = 1 THEN
SET str_out = CONCAT(str_out, str_replace);
ELSE
SET str_out = CONCAT(str_out, z);
END IF;
SET b = 0;
SET is_delim = 0;
SET a = a + 1;
END WHILE;
RETURN str_out;
END;
It's called like this:
strtok("this.is.my.input.string",".,:;"," | ")
and will return
"this | is | my | input | string"
I hope someone else finds this useful. Cheers!

You should check out Google Refine.
Google Refine is a power tool for working with messy data, cleaning it
up, transforming it from one format into another, extending it with
web services, and linking it to databases like Freebase.

Related

Stored Procedure With Function giving me errors in Oracle

I have stored procedure and function and I am calling the function in the stored procedure in ORACLE.The function CalculateIncomeTax is what is giving me errors.In MSSQL,this type of update is possible because I have done it before.I called the function in the stored procedure.When I read around the answer I get is to use a package before I cannot use a function to update a table from another table.Please if you have any idea,tell me.The error I get is
table string.string is mutating, trigger/function may not see it
Cause: A trigger (or a user defined plsql function that is referenced in this statement) attempted to look at (or modify) a table that was in the middle of being modified by the statement which fired it.
Action: Rewrite the trigger (or function) so it does not read that table.
This is function
CREATE OR REPLACE function CalculateIncomeTax(periodId NVARCHAR2,
employeeId NVARCHAR2, taxableIncome NUMBER)return NUMBER
AS
IncomeTax NUMBER (18,4);Taxable NUMBER(18,4);
BEGIN
SELECT SUM(CASE WHEN (taxableIncome > T.TAX_CUMMULATIVE_AMOUNT)
THEN (taxableIncome - T.TAX_CUMMULATIVE_AMOUNT)* T.TAX_PERCENTAGE/ 100
ELSE 0.00 END ) INTO IncomeTax
FROM TAX_LAW T JOIN PAY_GROUP P ON P.PAY_FORMULA_ID =T.TAX_FORMULA_ID
JOIN PAYROLL_MASTER PP ON P.PAY_CODE =PP.PAY_PAY_GROUP_CODE
WHERE PP.PAY_EMPLOYEE_ID = employeeId AND PP.PAY_PERIOD_CODE = periodId;
if IncomeTax IS NULL THEN IncomeTax :=0;
end if;
return IncomeTax;
end;/
This is the stored procedure
CREATE OR REPLACE PROCEDURE PROCESSPAYROLLMASTER (periodcode
VARCHAR2) AS BEGIN
INSERT INTO PAYROLL_MASTER
(
PAY_PAYROLL_ID,PAY_EMPLOYEE_ID ,PAY_EMPLOYEE_NAME,PAY_SALARY_GRADE_CODE
,PAY_SALARY_NOTCH_CODE,PAY_BASIC_SALARY,PAY_TOTAL_ALLOWANCE
,PAY_TOTAL_CASH_BENEFIT,PAY_MEDICAL_BENEFIT,PAY_TOTAL_BENEFIT
,PAY_TOTAL_DEDUCTION,PAY_GROSS_SALARY,PAY_TOTAL_TAXABLE,PAY_INCOME_TAX
,PAY_TAXABLE,PAY_PERIOD_CODE,PAY_BANK_CODE,PAY_BANK_NAME,PAY_BANK_ACCOUNT_NO
,PAY_PAY_GROUP_CODE )
SELECT
1,
E.EMP_ID AS PAY_EMPLOYEE_ID ,
E.EMP_FIRST_NAME || ' ' || E.EMP_LAST_NAME AS PAY_EMPLOYEE_NAME,
E.EMP_RANK_CODE,
'CODE',
(SC.SAL_MINIMUM_AMOUNT+( SN.SAL_SALARY_PERCENTAGE *
SC.SAL_MINIMUM_AMOUNT)/100) AS PAY_BASIC_SALARY,
0,
0,
0,
0,
0,
0,
0,
0,
0,
periodcode,
'BANKCODE',
'BANKNAME',
'BANKNUMBER',
'GENERAL'
FROM EMPLOYEE E
LEFT JOIN SALARY_SCALE SC ON SC.SAL_RANK_CODE = E.EMP_RANK_CODE
LEFT JOIN SALARY_NOTCH SN ON SC.SAL_ID = SN.SAL_SALARYSCALE_ID
WHERE E.EMP_RANK_CODE = SC.SAL_RANK_CODE AND E.EMP_STATUS=2;
CALCULATEALLOWANCE(v_payrollId,periodcode);
CALCULATECASHBENEFITS(v_payrollId,periodcode);
CALCULATEDEDUCTIONS(v_payrollId,periodcode);
-- UPDATE PAYROLL PAY_INCOME_TAX
UPDATE PAYROLL_MASTER PM SET PM.PAY_INCOME_TAX = CalculateIncomeTax(PM.PAY_PERIOD_CODE,PM.PAY_EMPLOYEE_ID,PM.PAY_TOTAL_TAXABLE) WHERE PM.PAY_PAYROLL_ID = v_payrollId;
UPDATE PAYROLL_PROCESS set PAY_CANCELLED = 1 WHERE PAY_PAY_GROUP_CODE='GENERAL' AND PAY_PERIOD_CODE=periodcode
AND PAY_ID<>v_payrollId;
COMMIT;
END ;
/
The function is querying the same table you are updating, which is what the error is reporting. As it happens you are not changing the value of the column you're querying, but Oracle doesn't check to that level - not least because there could be, for instance, a trigger that has less obvious side-effects.
The best solution really would be to not have to update at all, and to calculate and set all the value as part of the original insert, by joining to all the relevant tables. But you are already calling other procedures which are, presumably, updating some of the values you're inserting as zeros, including pay_total_taxable.
Unless you're able to reevaluate those as well, you may be stuck with doing a further update. In which case, you could remove the reference to the payroll_master table from the function and instead pass in the relevant data.
I think this is equivalent, though with out the table structures, sample data and what the other procedures are doing it's hard to be sure (so this is untested, obviously):
create or replace function calculateincometax (
p_periodid nvarchar2,
p_employeeid nvarchar2,
p_paypaygroupcode payroll_master.pay_pay_group_code%type,
p_taxableincome number
) return number as
l_incometax number(18, 4);
begin
select coalesce(sum(case when p_taxableincome > t.tax_cummulative_amount
then (taxableincome - t.tax_cummulative_amount) * t.tax_percentage / 100
else 0 end), 0)
into l_incometax
from tax_law t
join pay_group p
on p.pay_formula_id = t.tax_formula_id
where p.pay_code = p_paypaygroupcode;
return l_incometax;
end;
/
and then include the extra argument in your call:
update payroll_master pm
set pm.pay_income_tax = calculateincometax(pm.pay_period_code, pm.pay_employee_id,
pm.pay_pay_group_code, pm.pay_total_taxable)
where pm.pay_payroll_id = v_payrollid;
Although v_payrollid isn't defined in what you've shown, so even that isn't entirely clear.
I've also modified the function argument and local variable names with prefixes to remove potential ambiguity (which you seem to do by removing underscores from the names), removed the unused variable, and added a coalesce() call in place of the separate null check. Those things aren't directly relevant to the approach though.

Add 2 SUM CASE statements as a column update in MySQL

I think I have this almost figured out but after 50+ Google searches, I ask this: How can I add a column to a db that is essentially a sumif function? I've seen many related questions as simple Select statements for just looking at the table in a mini table but I was hoping to actually add a column that would show these totals. I'm taking this and then pulling the data into R for further analysis.
In Excel it works like so with [ ] denoting columns of a table. It is split into 2 areas via the Serial #. The first 6 digits of the serial indicate the "parent" and the later half indicate the "child". One parent can have multiple children, as seen with BSA101 below. What I'm trying to do is sum all the costs that went into making the child (parent + child costs). So the total parent costs, get allocated to both children below.
"Packing" is the last step so this is where I'd want the totals to end up so there are no duplicates.
Example
=IF(LEN([serial])>6,IF([process]="Packing",SUMIF([serial],[#serial],[process_cost])+SUMIF([serial],LEFT([#serial],6),[process_cost]),""),"")
serial process process_cost total_child_cost
BSA101A33 Packing 10 160
BSA101A34 Packing 10 195
BSA101 Cast 50 ""
BSA101 Mold 30 ""
BSA101 Mold 30 ""
BSA101A33 Finish 15 ""
BSA101A34 Finish 25 ""
BSA101A33 Polish 25 ""
BSA101A34 Polish 50 ""
^desired table result above
MySQL attempt:This post helped me Adding Case Statements
SQL Fiddle: http://sqlfiddle.com/#!9/b0e58
Here I've added a column in data called total_cost. Right now I'm getting an "Invalid use of group function" error which after researching, talks about a HAVING clause but not sure where to place it.
UPDATE data
SET total__child_cost =
(CASE WHEN length(serial) > 6
AND process = 'Packing'
THEN
IF(serial = serial, sum(process_cost),0) END)
+
(CASE WHEN left(serial,6) = serial
THEN sum(process_cost)
END)
This ended up being the solution.
DELIMITER //
CREATE FUNCTION `getParent1`(inSerialn Varchar(20)) RETURNS int(11)
BEGIN
Declare parent varchar(20);
Declare result int;
set parent = left(inSerialn, 6);
set result = (Select sum(process_cost) From mfng.data where serialn = parent);
return result;
END //
DELIMITER //
CREATE FUNCTION `getChild1`(inSerialn Varchar(20)) RETURNS int(11)
BEGIN
Declare result int;
set result = (Select sum(process_cost) FROM mfng.data where serialn = inSerialn);
return result;
END//
UPDATE mfng.data set total_child_cost =
(case when length(serialn) > 6 AND pdn_process = 'Packing'
THEN
getChild1(serialn) + getParent1(serialn)
ELSE
0 END);
//

MySQL user-defined function returns incorrect value when used in a SELECT statement

I met a problem when calling a user-defined function in MySQL. The computation is very simple but can't grasp where it went wrong and why it went wrong. Here's the thing.
So I created this function:
DELIMITER //
CREATE FUNCTION fn_computeLoanAmortization (_empId INT, _typeId INT)
RETURNS DECIMAL(17, 2)
BEGIN
SET #loanDeduction = 0.00;
SELECT TotalAmount, PeriodicDeduction, TotalInstallments, DeductionFlag
INTO #totalAmount, #periodicDeduction, #totalInstallments, #deductionFlag
FROM loans_table
WHERE TypeId = _typeId AND EmpId = _empId;
IF (#deductionFlag = 1) THEN
SET #remaining = #totalAmount - #totalInstallments;
IF(#remaining < #periodicDeduction) THEN
SET #loanDeduction = #remaining;
ELSE
SET #loanDeduction = #periodicDeduction;
END IF;
END IF;
RETURN #loanDeduction;
END;//
DELIMITER ;
If I call it like this, it works fine:
SELECT fn_computeLoanAmortization(3, 4)
But if I call it inside a SELECT statement, the result becomes erroneous:
SELECT Id, fn_computeLoanAmortization(Id, 4) AS Amort FROM emp_table
There's only one entry in the loans_table and the above statement should only result with one row having value in the Amort column but there are lots of random rows with the same Amort value as the one with the matching entry, which should not be the case.
Have anyone met this kind of weird dilemma? Or I might have done something wrong from my end. Kindly enlighten me.
Thank you very much.
EDIT:
By erroneous, I meant it like this:
loans_table has one record
EmpId = 1
TypeId = 2
PeriodicDeduction = 100
TotalAmount = 1000
TotalInstallments = 200
DeductionFlag = 1
emp_table has several rows
EmpId = 1
Name = Paolo
EmpId = 2
Name = Nikko
...
EmpId = 5
Name = Ariel
when I query the following statements, I get the correct value:
SELECT fn_computeLoanAmortization(1, 2)
SELECT Id, fn_computeLoanAmortization(Id, 2) AS Amort FROM emp_table WHERE EmpId = 1
But when I query this statement, I get incorrect values:
SELECT Id, fn_computeLoanAmortization(Id, 2) AS Amort FROM emp_table
Resultset would be:
EmpId | Amort
--------------------
1 | 100
2 | 100 (this should be 0, but the query returns 100)
3 | 100 (same error here)
...
5 | 100 (same error here up to the last record)
Inside your function, the variables you use to retrieve the values from the loans_table table are not local variables local to the function but session variables. When the select inside the function does not find any row, those variables still have the same values as from the previous execution of the function.
Use real local variables instead. In order to do that, use the variables names without # as a prefix and declare the variables at the beginning of the function. See this answer for more details.
I suspect the problem is that the variables in the INTO are not re-set when there is no matching row.
Just set them before the INTO:
BEGIN
SET #loanDeduction = 0.00;
SET #totalAmount = 0;
SET #periodicDeduction = 0;
SET #totalInstallments = 0;
SET #deductionFlag = 0;
SELECT TotalAmount, PeriodicDeduction, TotalInstallments, DeductionFlag
. . .
You might just want to set them to NULL.
Or, switch your logic to use local variables:
SET v_loanDeduction = 0.00;
SET v_totalAmount = 0;
SET v_periodicDeduction = 0;
SET v_totalInstallments = 0;
SET v_deductionFlag = 0;
And so on.

Conversion of Foxpro code to Set-Based MySQL Query

Trying to convert a Visual Foxpro code to set-based MySQL query. Following is the code segment from Foxpro
lnFound=0
IF LnFound = 0 .and. rcResult = "ALL" AND PcOpOrIp = "OP"
SELECT PFile
LcTag = ORDER()
SET ORDER TO TAG PtcntlNm
=SEEK(LcPatientNo)
SCAN WHILE PtcntlNm = LcPatientNo
IF GcMResult <= "0"
GcMResult = "1-7MAT-PTC"
ENDIF
IF MONTH(cSRa.Fromdate) = MONTH(pFile.Fromdate) ;
.AND. pFile.ThruDate >= cSRa.ThruDate
** Check From/Thru Date against pFile
IF (ABS(cSRa.totalchrg) = (pFile.BDeduct+pFile.Deduct+pFile.Coinsur)) .OR. cSRa.Tchrgs = (pFile.BDeduct+pFile.Deduct+pFile.Coinsur) .or. (ABS(cSRa.totalchrg) = pFile.Total .OR. cSRa.Tchrgs = pFile.Total)
IF lnFound = 0
gcRecid = recid
gcmResult=rcResult
ENDIF
lnFound = lnFound + 1
gcUNrECID = gcunRecid + IIF(EMPTY(gCUNreCID),Recid,[,]+recid)
ENDIF
ENDIF
ENDSCAN
SELECT PFile
SET ORDER TO &LcTag
ENDIF
I have a table named pfile which I'am trying to join with another table named csra. The main aim of this is to set the record_id (gcrecid) based on the condition of three nested if statements. After setting the gcrecid variable the lnfound variable is set to one hence the third if statement condition is false from the second iteration onwards.
Here is the MySQL stored procedure which I came up with and as you can see I'm not able to completely convert the code in an efficient manner.
UPDATE csra AS cs
JOIN p051331s AS p ON cs.patientno = p.ptcntlnm
SET cs.recid = p.recid
, cs.mcsult = "ALL"
, cs.lnfound = '"1"'
WHERE cs.provider = '051331'
AND cs.lnfound = "0"
AND cs.RECID IS NULL
AND month(cs.fromdate) = month(p.fromdate)
AND p.thrudate >= cs.ThruDate
AND ABS(cs.totalchrg) = (p.bdeduct+p.deduct+p.coinsur)
OR cs.tchrgs = (p.bdeduct+p.deduct+p.coinsur)
OR ABS(cs.totalchrg) = p.total OR cs.tchrgs = p.total;
Any lead in this regard will be much appreciated as I've been working on this procedure for a couple of day with no noticeable results.
According to this partial VFP code (which is not clear on variables it uses) there is no code to be converted to set based at all. Corresponding mySQL or MS SQL or any other SQL series backend code would simply be "nothing". ie: this would be equivalant:
-- Hello to mySQL or MS SQL
PS: On your trial to convert to an update code, inner joining with csra is wrong. It is not joined in VFP code, csra values are constant --unless there is a relation on fields set-- (pointing to the "current row" values in csra only). You would want to make them into parameters as with the rest of memory variables (which is not clear from the code which ones are memory variables).

How to find the first number in a text field using a MySQL query?

I like to return only the first number of a text stored in a column of a database table.
User have put in page ranges into a field like 'p.2-5' or 'page 2 to 5' or '2 - 5'.
I am interested in the '2' here.
I tried to
SELECT SUBSTR(the_field, LOCATE('2', the_field, 1)) AS 'the_number'
FROM the_table
and it works. But how to get ANY number?
I tried
SELECT SUBSTR(the_field, LOCATE(REGEXP '[0-9], the_field, 1)) AS 'the_number'
FROM the_table
but this time I get an error.
Any ideas?
Just use REGEXP_SUBSTR():
SELECT REGEXP_SUBSTR(`the_field`,'^[0-9]+') AS `the_number` FROM `the_table`;
Notes:
I'm using MySQL Server v8.0.
This pattern assumes that the_field is trimmed. Otherwise, use TRIM() first.
REGEXP is not a function in MySQL, but something of an operator. Returns 1 if field matches the regular expression, or 0 if it does not. You cannot use it to figure out a position in a string.
Usage:
mysql> SELECT 'Monty!' REGEXP '.*';
-> 1
As for answer to the question: I don't think there is a simple way to do that using MySQL only. You would be better off processing that field in the code, or extract values before inserting.
For the specific case in the question. Where the String is {number}{string}{number}
there is a simple solution to get the first number. In our case we had numbers like 1/2,3
4-10
1,2
and we were looking for the first number in each row.
It turned out that for this case one can use convert function to convert it into number. MySQL will return the first number
select convert(the_field ,SIGNED) as the_first_number from the_table
or more hard core will be
SELECT
the_field,
#num := CONVERT(the_field, SIGNED) AS cast_num,
SUBSTRING(the_field, 1, LOCATE(#num, the_field) + LENGTH(#num) - 1) AS num_part,
SUBSTRING(the_field, LOCATE(#num, the_field) + LENGTH(#num)) AS txt_part
FROM the_table;
This was original post at source by Eamon Daly
What does it do?
#num := CONVERT(the_field, SIGNED) AS cast_num # try to convert it into a number
SUBSTRING(the_field, 1, LOCATE(#num, the_field) + LENGTH(#num) - 1) # gets the number by using the length and the location of #num in field
SUBSTRING(the_field, LOCATE(#num, the_field) + LENGTH(#num)) # finds the rest of the string after the number.
Some thoughts for future use
Its worth keeping another column which will hold the first number after you parsed it before insert it to the database. Actually this is what we are doing these days.
Edit
Just saw that you have text like p.2-5 and etc.. which means the above cannot work as if the string does not start with a number convert return zero
There's no built-in way that I know of, but here's a Mysql function you can define, this will do it (I didn't code for minus-signs or non-integers, but those could of course be added).
Once created, you can use it like any other function:
SELECT firstNumber(the_field) from the_table;
Here's the code:
DELIMITER //
CREATE FUNCTION firstNumber(s TEXT)
RETURNS INTEGER
COMMENT 'Returns the first integer found in a string'
DETERMINISTIC
BEGIN
DECLARE token TEXT DEFAULT '';
DECLARE len INTEGER DEFAULT 0;
DECLARE ind INTEGER DEFAULT 0;
DECLARE thisChar CHAR(1) DEFAULT ' ';
SET len = CHAR_LENGTH(s);
SET ind = 1;
WHILE ind <= len DO
SET thisChar = SUBSTRING(s, ind, 1);
IF (ORD(thisChar) >= 48 AND ORD(thisChar) <= 57) THEN
SET token = CONCAT(token, thisChar);
ELSEIF token <> '' THEN
SET ind = len + 1;
END IF;
SET ind = ind + 1;
END WHILE;
IF token = '' THEN
RETURN 0;
END IF;
RETURN token;
END //
DELIMITER ;