Subsequence in MySQL/CakePHP - mysql

In my mysql table I have a field which is a 4 letter Myers-Briggs personality type. I would like to search through the table and match when the personality type matches the one in the query by having 2 aspects in common. The way I understand this, it is really just finding the longest common subsequence of the two and testing that it is >= 2
Example:
'ISTJ' would match with 'INFJ', because the length of the common subsequence is 'IJ' >= 2
and
'ISTJ' would not match with 'INFP', because the length of the common subsequence is 'I' <= 2
Is there a way to do this in a mysql query? I am using CakePHP for the querying, so if you know how to do this with Cake that would also be helpful.

The Myer-Briggs personality types are positional. This means that you can compare character by character.
Here is one method, where you just have to put in the comparison string once:
select t.*
from (select t.*,
(case when substring(t.MyerBriggs, 1, 1) = substring(const.comp, 1, 1)
then 1 else 0
end) as MB1,
(case when substring(t.MyerBriggs, 2, 1) = substring(const.comp, 2, 1)
then 1 else 0
end) as MB2,
(case when substring(t.MyerBriggs, 3, 1) = substring(const.comp, 3, 1)
then 1 else 0
end) as MB3,
(case when substring(t.MyerBriggs, 4, 1) = substring(const.comp, 4, 1)
then 1 else 0
end) as MB4
from t cross join
(select 'INFJ' as comp) const
)
where (MB1+MB2+MB3+MB4) >= 2
You can actually simplify this in MySQL as:
select t.*
from t cross join
(select 'INFJ' as comp) const
where (if(substring(t.MyerBriggs, 1, 1) = substring(const.comp, 1, 1), 1, 0) +
if(substring(t.MyerBriggs, 2, 1) = substring(const.comp, 2, 1), 1, 0) +
if(substring(t.MyerBriggs, 3, 1) = substring(const.comp, 3, 1), 1, 0) +
if(substring(t.MyerBriggs, 4, 1) = substring(const.comp, 4, 1), 1, 0)
) >= 2

If I understand the Myers-Briggs thingy properly, there are two possibilities for each of the four categorisation axis, and the order of the letters is constant (and therefore carries no meaning).
In this case, you could use four two-state columns like the below, instead of one string:
CREATE TABLE profile (
user_id INT,
EI ENUM ('E', 'I'),
SN ENUM ('S', 'N'),
TF ENUM ('T', 'F'),
JP ENUM ('J', 'P')
);
Profile 'ISTJ' would be inserted like this:
INSERT INTO profile VALUE (1, 'I', 'S', 'T', 'J');
Matching with profile 'INFJ' would look like this:
SELECT * FROM profile WHERE
(EI = 'I') + (SN = 'N') + (TF = 'F') + (JP = 'J') >= 2

Related

MYSQL Calculate average of a specific occurance in a column

I need to calculate the average of occurrences in a dataset for a given value in a column. I made an easy example but in my current database contains around 2 inner joins to reduce it to 100k records. I need to perform the following select distinct statement for 10 columns.
My current design forces an inner join for each column. Another constraint is that I need to perform it at least 50-100 rows for each name in this example.
I need to figure out an efficient way to calculate this values without using too many resources while making the query fast.
http://sqlfiddle.com/#!9/c2378/3
My expected Result is:
Name | R Avg dir | L Avg dir 1 | L Avg dir 2 | L Avg dir 3
A 0 .5 .25 .25
Create table query:
CREATE TABLE demo
(`id` int, `name` varchar(10),`hand` varchar(1), `dir` int)
;
INSERT INTO demo
(`id`, `name`, `hand`, `dir`)
VALUES
(1, 'A', 'L', 1),
(2, 'A', 'L', 1),
(3, 'A', 'L', 2),
(4, 'A', 'L', 3),
(5, 'A', 'R', 3),
(6, 'A', 'R', 3)
;
Example Query:
SELECT distinct name,
COALESCE(( (Select count(id) as 'cd' from demo where hand = 'L' AND dir = 1) /(Select count(id) as 'fd' from demo where hand = 'L')),0) as 'L AVG dir'
FROM
demo
where hand = 'L' AND dir = 1 AND name = 'A'
One option is to use conditional aggregation:
SELECT name,
count(case when hand = 'L' and dir = 1 then 1 end) /
count(case when hand = 'L' then 1 end) L1Avg,
count(case when hand = 'L' and dir = 2 then 1 end) /
count(case when hand = 'L' then 1 end) L2Avg,
count(case when hand = 'L' and dir = 3 then 1 end) /
count(case when hand = 'L' then 1 end) L3Avg,
count(case when hand = 'R' and dir = 3 then 1 end) /
count(case when hand = 'R' then 1 end) RAvg
FROM demo
WHERE name = 'A'
GROUP BY name
Updated Fiddle Demo
Please note, I wasn't 100% sure why you wanted your RAvg to be 0 -- I assumed you meant 100%. If not, you can adjust the above accordingly.

Extract numeric part of string and get max value in column

I have a table foo that stores codes in format lnnnnn where l is at least one letter and n is numeric value. Both letters or numbers can be of various length, so trying to solve this like mentioned here won't work.
Example:
group | code
=============
1 | a0010
1 | a0012
1 | a0013
2 | bn0014
2 | bn0015
2 | bn0016
3 | u0017
3 | u0018
My task is to get current highest numeric value of this column in desired group, to generate new number (like sequence).
Note that I cannot redesign table and explode string and text parts.
So far I tried:
select
max(code rlike '[0-9]$')
from
foo
where
group = 2
but, sadly, regexp or rlike (synonyms) returns only 0 or 1 (matched or not matched).
One method is a brute force method:
select grp,
max(case when substr(code, 1, 1) between '0' and '9' then code + 0
when substr(code, 2, 1) between '0' and '9' then substr(code, 2) + 0
when substr(code, 3, 1) between '0' and '9' then substr(code, 3) + 0
when substr(code, 4, 1) between '0' and '9' then substr(code, 4) + 0
when substr(code, 5, 1) between '0' and '9' then substr(code, 5) + 0
when substr(code, 6, 1) between '0' and '9' then substr(code, 6) + 0
when substr(code, 7, 1) between '0' and '9' then substr(code, 7) + 0
when substr(code, 8, 1) between '0' and '9' then substr(code, 8) + 0
end)
from foo
group by grp;
If your numeric codes is always four digits then you can do it like:
select groupid, max(right(code,4)) as maxcode
from foo
group by groupid
See it here on fiddle: http://sqlfiddle.com/#!2/775b3/2
If all numeric parts start with a 0:
select gp, max(cast(substr(code, instr(code, '0')) as unsigned))
from t
group by gp
See sqlfiddle
If not, for arbitrary numeric parts (that start with any digit):
select gp, max(cast(substr(code, instr(code, n)) as unsigned))
from t
join (select 0 n union select 1 union select 2 union select 3 union select 4 union select 5
union select 6 union select 7 union select 8 union select 9) x
group by gp
See sqlfiddle

SQL - Select Boolean Results from Table

Well ,I didn't find a correct title for this question, sorry about that.
I Have one table where I store some emails sent to users.
In this table I can know if the user read or not the email.
Table structure:
[MAILSEND_ID] (INT),
[ID_USER] (INT),
[MAIL_ID] (INT),
[READ] (BIT)
Data:
;WITH cte AS (
SELECT * FROM (VALUES
(1, 10256, 10, 0),
(1, 10257, 10, 1),
(1, 10258, 10, 1),
(1, 10259, 10, 0),
(2, 10256, 10, 0),
(2, 10257, 10, 0),
(2, 10258, 10, 1),
(2, 10259, 10, 0),
(3, 10256, 10, 1),
(3, 10257, 10, 0),
(3, 10258, 10, 0),
(3, 10259, 10, 0)
) as t(MAILSEND_ID, ID_USER, MAIL_ID, READ)
In this example, you can see, i have 4 Users and 3 Emails Sent.
User 10256
1st Email - Don't Read
2nd Email - Don't Read
3rd Email - Read
I need make a select on this table, that I give the [MAIL_ID] and a [NUMBER], this number represent the sequential e-mails that is not read by the user.
Using the last example:
Give the [NUMBER] = 3, [MAIL_ID] = 10
Return the USER_ID 10259 only.
Give the [NUMBER] = 2, [MAIL_ID] = 10
Return the USER_ID 10257, 20259.
Give the [NUMBER] = 1, [MAIL_ID] = 10
Return the USER_ID 10257, 10258, 20259.
In another words, the USER_ID can have one accumulated number of e-mails not read, but if this user read the last e-mail, he cant be returned in the query.
This is my query today, but only returns the total of emails not read:
select * from (
select
a.[USER_ID],
COUNT(a.[USER_ID]) as tt
from
emailmkt.mailing_history a
where
a.[MAIL_ID] = 58 and
a.[READ]=0
group by
[USER_ID]
) aa where tt > [NUMBER]
So the logic is not right. I Want to transfer this logic to SQL and not do this on Code, if is possible.
Sorry if have any english errors as well.
Thanks in advance.
With the following query you can get the rolling count of the mail to read by user, based of the hypothesis that mailsend_id is time related (I changed READ to IsRead 'cause I don't have the char ` on my keyboard)
SELECT ID_USER, Mail_ID
, groupid CURRENT
, #roll := CASE WHEN coalesce(#groupid, '') = groupid
THEN #roll + 1
ELSE 1
END AS roll
, #groupid := groupid OLD
FROM (SELECT mh.ID_USER, mh.Mail_ID
, concat(mh.id_user, mh.mail_id) groupid
FROM mailing_history mh
INNER JOIN (SELECT id_user
, max(CASE isread
WHEN 1 THEN MAILSEND_ID
ELSE 0
END) lastRead
FROM mailing_history
GROUP BY id_user) lr
ON mh.id_user = lr.id_user AND mh.MAILSEND_ID > lr.lastread
ORDER BY id_user, MAILSEND_ID) a
Demo: SQLFiddle
The column Roll has the rolling count of the mail to read for the user.
Adding a level you can check the value of Roll against NUMBER in a WHERE condition and group_concat the user_id

Count occurrences that differ within a column

I want to be able to select the amount of times the data in columns Somedata_A and Somedata_B has changed from the from the previous row within its column. I've tried using DISTINCT and it works to some degree. {1,2,3,2,1,1} will show 3 when I want it to show 4 course there's 5 different values in sequence.
Example:
A,B,C,D,E,F
{1,2,3,2,1,1}
A compare to B gives a difference, B compare to C gives a difference . . . E compare to F gives not difference. All in all it gives 4 differences within a set of 6 values.
I have gotten DISTINCT to work but it does not really do the trick for me. And to add more to the question I'm really not interested it the whole range, lets say just the 2 last days/entries per Title.
Second I'm concern about performance issues. I tried the query below on a real set of data and it got interrupted probably due to timeout.
SQL Fiddle
MySQL 5.5.32 Schema Setup:
CREATE TABLE testdata(
Title varchar(10),
Date varchar(10),
Somedata_A int(5),
Somedata_B int(5));
INSERT INTO testdata (Title, Date, Somedata_A, Somedata_B) VALUES
("Alpha", '123', 1, 2),
("Alpha", '234', 2, 2),
("Alpha", '345', 1, 2),
("Alpha", '349', 1, 2),
("Alpha", '456', 1, 2),
("Omega", '123', 1, 1),
("Omega", '234', 2, 2),
("Omega", '345', 3, 3),
("Omega", '349', 4, 3),
("Omega", '456', 5, 4),
("Delta", '123', 1, 1),
("Delta", '234', 2, 2),
("Delta", '345', 1, 3),
("Delta", '349', 2, 3),
("Delta", '456', 1, 4);
Query 1:
SELECT t.Title, (SELECT COUNT(DISTINCT Somedata_A) FROM testdata AS tt WHERE t.Title = tt.Title) AS A,
(SELECT COUNT(DISTINCT Somedata_B) FROM testdata AS tt WHERE t.Title = tt.Title) AS B
FROM testdata AS t
GROUP BY t.Title
Results:
| TITLE | A | B |
|-------|---|---|
| Alpha | 2 | 1 |
| Delta | 2 | 4 |
| Omega | 5 | 4 |
Something like this may work: it uses a variable for row number, joins on an offset of 1 and then counts differences for A and B.
http://sqlfiddle.com/#!2/3bbc8/9/2
set #i = 0;
set #j = 0;
Select
A.Title aTitle,
sum(Case when A.SomeData_A <> B.SomeData_A then 1 else 0 end) AVar,
sum(Case when A.SomeData_B <> B.SomeData_B then 1 else 0 end) BVar
from
(SELECT Title, #i:=#i+1 as ROWID, SomeData_A, SomeData_B
FROM testdata
ORDER BY Title, date desc) as A
INNER JOIN
(SELECT Title, #j:=#j+1 as ROWID, SomeData_A, SomeData_B
FROM testdata
ORDER BY Title, date desc) as B
ON A.RowID= B.RowID + 1
AND A.Title=B.Title
Group by A.Title
This works (see here) (FYI: Your results in the question do not match your data - for instance, for Alpha, ColumnA: it never changes from 1. The answer should be 0)
Hopefully you can adapt this Statement to your actual data model
SELECT t1.title, SUM(t1.Somedata_A<>t2.Somedata_a) as SomeData_A
,SUM(t1.Somedata_b<>t2.Somedata_b) as SomeData_B
FROM testdata AS t1
JOIN testdata AS t2
ON t1.title = t2.title
AND t2.date = DATE_ADD(t1.date, INTERVAL 1 DAY)
GROUP BY t1.title
ORDER BY t1.title;

How to select a range of rows from a multiple column primary key?

I'm trying to chunk through rows in MySQL 5.5 and to do this I want to select a range between two primary keys (which I can get easily). This is trivial when the primary key is only one column. However, some of the tables I need to chunk through have multiple columns in the primary key, and I haven't figured out how to make this work in a single prepared statement.
Here's an example table with some data:
CREATE TABLE test (
a INT UNSIGNED NOT NULL,
b INT UNSIGNED NOT NULL,
c INT UNSIGNED NOT NULL,
d VARCHAR(255) DEFAULT '', -- various data columns
PRIMARY KEY (a, b, c)
) ENGINE=InnoDB;
INSERT INTO test VALUES
(1, 1, 1),
(1, 1, 2),
(1, 1, 3),
(1, 2, 1),
(1, 2, 2),
(1, 2, 3),
(1, 3, 1),
(1, 3, 3),
(2, 1, 1),
(2, 1, 2),
(2, 2, 2),
(2, 3, 1),
(2, 3, 3),
(3, 1, 2),
(3, 1, 3),
(3, 2, 1),
(3, 2, 2),
(3, 2, 3),
(3, 3, 1),
(3, 3, 3);
If I had two primary keys like (1, 1, 3) and (3, 2, 1), the following statement would work. a1, b1, and c1 are the values from the first primary key, and a2, b2, and c2 are the values from the second primary key:
SELECT * FROM test WHERE a = a1 AND b = b1 AND c >= c1
UNION
SELECT * FROM test WHERE a = a1 AND b > b1
UNION
SELECT * FROM test WHERE a > a1 AND a < a2
UNION
SELECT * FROM test WHERE a = a2 AND b < b2
UNION
SELECT * FROM test WHERE a = a2 AND b = b2 AND c <= c2
Or
SELECT * FROM test WHERE a = 1 AND b = 1 AND c >= 3
UNION
SELECT * FROM test WHERE a = 1 AND b > 1
UNION
SELECT * FROM test WHERE a > 1 AND a < 3
UNION
SELECT * FROM test WHERE a = 3 AND b < 2
UNION
SELECT * FROM test WHERE a = 3 AND b = 2 AND c <= 1
Which gives
(1, 1, 3),
(1, 2, 1),
(1, 2, 2),
(1, 2, 3),
(1, 3, 1),
(1, 3, 3),
(2, 1, 1),
(2, 1, 2),
(2, 2, 2),
(2, 3, 1),
(2, 3, 3),
(3, 1, 2),
(3, 1, 3),
(3, 2, 1)
But the above fails when the first column is the same, e.g. (1, 2, 2) and (1, 3, 1). In this case, the 2nd and 4th SELECT select too much.
SELECT * FROM test WHERE a = 1 AND b = 2 AND c >= 2
UNION
SELECT * FROM test WHERE a = 1 AND b > 2
UNION
SELECT * FROM test WHERE a > 1 AND a < 1
UNION
SELECT * FROM test WHERE a = 1 AND b < 3
UNION
SELECT * FROM test WHERE a = 1 AND b = 3 AND c <= 1
Which gives
(1, 1, 1), -- erroneously selected from: SELECT * FROM test WHERE a = 1 AND b < 3
(1, 1, 2), -- erroneously selected from: SELECT * FROM test WHERE a = 1 AND b < 3
(1, 1, 3), -- erroneously selected from: SELECT * FROM test WHERE a = 1 AND b < 3
(1, 2, 1), -- erroneously selected from: SELECT * FROM test WHERE a = 1 AND b < 3
(1, 2, 2),
(1, 2, 3),
(1, 3, 1),
(1, 3, 3) -- erroneously selected from: SELECT * FROM test WHERE a = 1 AND b > 2
The desired output is
(1, 2, 2),
(1, 2, 3),
(1, 3, 1)
I would like a single statement that works with all primary key ranges, including identical values for the first and second columns. I also have tables with 4 columns in the primary key, and I'll extend the pattern in that case.
I would like a single statement per table instead of creating queries on the fly because the query will be executed up to a million times as I chunk through the tables. Some of the tables have over 100M rows.
I would rather avoid constructing multiple statements as I have hundreds to write following this pattern, and writing more would be significantly more work. I will do this if it's the only option.
I currently use parametrized queries, and generate the values programmatically from the two primary keys, taking care of required duplicate values (the a1 x3, b1 x2, a2 x3, b2 x2 in the above example) in the application layer. So passing duplicate values for parameters is simple for me to do.
My best guess at this point is duplicate the SELECTs with an additional part of the WHERE clause comparing the values of the columns of the primary keys.
I would use this query to select a range:
SELECT *
FROM test
WHERE (a,b,c) >= (1, 1, 3)
and (a,b,c) <= (3, 2, 1)
Demo: http://www.sqlfiddle.com/#!2/d6cf7b/4
Unfortunately, MySql is not able to perform a range optimalization for the above query, see this link: http://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#range-access-single-part
(chapter: 8.2.1.3.4. Range Optimization of Row Constructor Expressions)
They say that starting from verion 5.7 MySql can optimize only queries of a form:
WHERE ( col_1, col_2 ) IN (( 'a', 'b' ), ( 'c', 'd' ));
Basically the above query is equivalent to this one:
SELECT *
FROM test
WHERE
a = 1 and b = 1 and c >= 3 -- lowest end
or
a = 3 and b = 2 and c <= 1 -- highest end
or
a = 1 and b > 1
or
a = 3 and b < 2
or
a > 1 and a < 3
;
MySql might use a range access method optimalization for this form of the query, see below link
(chapter :8.2.1.3.2. The Range Access Method for Multiple-Part Indexes):
http://dev.mysql.com/doc/refman/5.7/en/range-optimization.html