Advanced MySQL: Find correlations between poll responses - mysql

I've got four MySQL tables:
users (id, name)
polls (id, text)
options (id, poll_id, text)
responses (id, poll_id, option_id, user_id)
Given a particular poll and a particular option, I'd like to generate a table that shows which options from other polls are most strongly correlated.
Suppose this is our data set:
TABLE users:
+------+-------+
| id | name |
+------+-------+
| 1 | Abe |
| 2 | Bob |
| 3 | Che |
| 4 | Den |
+------+-------+
TABLE polls:
+------+-----------------------+
| id | text |
+------+-----------------------+
| 1 | Do you like apples? |
| 2 | What is your gender? |
| 3 | What is your height? |
| 4 | Do you like polls? |
+------+-----------------------+
TABLE options:
+------+----------+---------+
| id | poll_id | text |
+------+----------+---------+
| 1 | 1 | Yes |
| 2 | 1 | No |
| 3 | 2 | Male |
| 4 | 2 | Female |
| 5 | 3 | Short |
| 6 | 3 | Tall |
| 7 | 4 | Yes |
| 8 | 4 | No |
+------+----------+---------+
TABLE responses:
+------+----------+------------+----------+
| id | poll_id | option_id | user_id |
+------+----------+------------+----------+
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 3 |
| 4 | 1 | 2 | 4 |
| 5 | 2 | 3 | 1 |
| 6 | 2 | 3 | 2 |
| 7 | 2 | 3 | 3 |
| 8 | 2 | 4 | 4 |
| 9 | 3 | 5 | 1 |
| 10 | 3 | 6 | 2 |
| 10 | 3 | 5 | 3 |
| 10 | 3 | 6 | 4 |
| 10 | 4 | 7 | 1 |
| 10 | 4 | 7 | 2 |
| 10 | 4 | 7 | 3 |
| 10 | 4 | 7 | 4 |
+------+----------+------------+----------+
Given the poll ID 1 and the option ID 2, the generated table should be something like this:
+----------+------------+-----------------------+
| poll_id | option_id | percent_correlated |
+----------+------------+-----------------------+
| 4 | 7 | 100 |
| 2 | 3 | 66.66 |
| 3 | 6 | 66.66 |
| 2 | 4 | 33.33 |
| 3 | 5 | 33.33 |
| 4 | 8 | 0 |
+----------+------------+-----------------------+
So basically, we're identifying all of the users who responded to poll ID 1 and selected option ID 2, and we're looking through all the other polls to see what percentage of them also selected each other option.

Don't have an instance handy to test, can you see if this gets proper results:
select
poll_id,
option_id,
((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
n
from
(
select
poll_id,
option_id,
SUM(score) AS sum1,
SUM(score_rev) AS sum2,
SUM(score * score) AS sum1sq,
SUM(score_rev * score_rev) AS sum2sq,
SUM(score * score_rev) AS psum,
COUNT(*) AS n
from
(
select
responses.poll_id,
responses.option_id,
CASE
WHEN user_resp.user_id IS NULL THEN SELECT 0
ELSE SELECT 1
END CASE as score,
CASE
WHEN user_resp.user_id IS NULL THEN SELECT 1
ELSE SELECT 0
END CASE as score_rev,
from responses left outer join
(
select
user_id
from
responses
where
poll_id = 1 and
option_id = 2
)user_resp
ON (user_resp.user_id = responses.user_id)
) temp1
group by
poll_id,
option_id
)components

After a few hours of trial and error, I managed to put together a query that works correctly:
SELECT poll_id AS p_id,
option_id AS o_id,
COUNT(*) AS optCount,
(SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id AND user_id IN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2'))
AS percentage
FROM response
INNER JOIN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1'
GROUP BY option_id DESC
ORDER BY percentage DESC, optCount DESC
Based on a tests with a small data set, this query looks to be reasonably fast, but I'd like to modify it so the "IN" subquery is not repeated three times. Any suggestions?

This seems to give the right results for me:
select poll_stats.poll_id,
option_stats.option_id,
(100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
count(*) as poll_responses
from response selecting_response
join response on response.user_id = selecting_response.user_id
where selecting_response.poll_id = 1 and selecting_response.option_id = 2
group by response.poll_id) poll_stats
join (select options.poll_id,
options.id as option_id,
count(response.id) as option_responses
from options
left join response on response.poll_id = options.poll_id
and response.option_id = options.id
and exists (
select 1 from response selecting_response
where selecting_response.user_id = response.user_id
and selecting_response.poll_id = 1
and selecting_response.option_id = 2)
group by options.poll_id, options.id
) as option_stats
on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc

Related

Get minimum from result with GROUP BY in MySQL

I have table it store hierarchy data in MySQL this table store stable relation but if each user less than 1000 buy removed and user User a lower level replace this is my code and work fine, after GROUP BY it contain all ancestor of descendant with compare then COUNT(*) AS level count level each user. This I have SQL code to compress data According to minimum buy for each user
+-------------+---------------+-------------+
| ancestor_id | descendant_id | path_length |
+-------------+---------------+-------------+
| 1 | 1 | 0 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 1 | 4 | 2 |
| 1 | 5 | 3 |
| 1 | 6 | 4 |
| 2 | 2 | 0 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
| 2 | 6 | 3 |
| 3 | 3 | 0 |
| 4 | 4 | 0 |
| 4 | 5 | 1 |
| 4 | 6 | 2 |
| 5 | 5 | 0 |
| 5 | 6 | 1 |
| 6 | 6 | 0 |
+-------------+---------------+-------------+
This is table buy
+--------+--------+
| userid | amount |
+--------+--------+
| 2 | 2000 |
| 4 | 6000 |
| 6 | 7000 |
| 1 | 7000 |
SQL code
SELECT a.*
FROM
( SELECT userid
FROM webineh_user_buys
GROUP BY userid
HAVING SUM(amount) >= 1000
) AS buys_d
JOIN
webineh_prefix_nodes_paths AS a
ON a.descendant_id = buys_d.userid
JOIN
(
SELECT userid
FROM webineh_user_buys
GROUP BY userid
HAVING SUM(amount) >= 1000
) AS buys_a on (a.ancestor_id = buys_a.userid )
JOIN
( SELECT descendant_id
, MAX(path_length) path_length
FROM webineh_prefix_nodes_paths
where a.ancestor_id = ancestor_id
GROUP
BY descendant_id
) b
ON b.descendant_id = a.descendant_id
AND b.path_length = a.path_length
GROUP BY a.descendant_id, a.ancestor_id
I need get max path_length where ancestor_id have At least 1000 amount buy but have error in where in subquery where a.ancestor_id = ancestor_id error code
1054 - Unknown column 'a.ancestor_id' in 'where clause'
I add SQLFidle demo.
You could use this query:
select m.userid as descendant,
p.ancestor_id,
p.path_length
from (
select b1.userid,
min(case when b2.amount >= 1000
then p.path_length
end) as path_length
from (select userid, sum(amount) amount
from webineh_user_buys
group by userid
having sum(amount) >= 1000
) as b1
left join webineh_prefix_nodes_paths p
on p.descendant_id = b1.userid
and p.path_length > 0
left join (select userid, sum(amount) amount
from webineh_user_buys
group by userid) as b2
on p.ancestor_id = b2.userid
group by b1.userid
) as m
left join webineh_prefix_nodes_paths p
on p.descendant_id = m.userid
and p.path_length = m.path_length
order by m.userid
Output for sample data in the question:
| userid | ancestor_id | path_length |
|--------|-------------|-------------|
| 1 | (null) | (null) |
| 2 | 1 | 1 |
| 4 | 2 | 1 |
| 6 | 4 | 2 |
SQL fiddle

MySQL Query to get Similar likes

I am designing a simple architecture where i have a table which stores users and some elements that they like so my table structure is something like this:
+---------+---------+
| user_id | like_id |
+---------+---------+
| 1 | 4 |
| 2 | 2 |
| 4 | 4 |
| 4 | 3 |
| 5 | 4 |
| 6 | 7 |
| 7 | 5 |
| 34 | 6 |
| 3 | 8 |
| 2 | 3 |
| 2 | 5 |
| 1 | 3 |
| 1 | 10 |
| 1 | 12 |
| 2 | 10 |
+---------+---------+
Now what i will have is id of any user (lets say user_id = 1 ) and i want a query to get all the other users who have similar Likes as that of 1.
So in the Output for user_id = 1 will be :
+---------------------------+------------------------+----------------+
| users_with_common_likes | no_of_common_likes | common_likes |
+---------------------------+------------------------+----------------+
| 4 | 2 | 3,4 |
| 2 | 2 | 3,10 |
| 5 | 1 | 4 |
+---------------------------+------------------------+----------------+
What I have achieved :
I can do this using a sub-query as below :
SELECT user_id
FROM `user_likes`
WHERE `like_id`
IN (
SELECT GROUP_CONCAT( `like_id` )
FROM user_likes
WHERE user_id =1
)
AND user_id !=1
LIMIT 0 , 30
However this query is not giving all the users,it misses the user_id = 2 which has like id 3 in common with user_id=1.
and i cant figure out how to find the remaining 2 columns.
Also I feel that this is not the best way to to this as this table will contain thousands of data and it may effect system performance.
I would like to do this with a single Mysql Query.
This assumes a PK formed on user_id,like_id...
SELECT y.user_id
, GROUP_CONCAT(y.like_id) likes
, COUNT(*) total
FROM my_table x
JOIN my_table y
ON y.like_id = x.like_id
AND y.user_id <> x.user_id
WHERE x.user_id = 1
GROUP
BY y.user_id;

group Items by column and order by other column

I have table as below , I want to take latest rating for the client
basically user whenever updates rating, count will be incremented and a entry will be made in table. Table goes as below
-----------------------------------------------------
|_id| name | client_id | user_id | rating | count |
-----------------------------------------------------
|1 | Four | 1 | 1 | 4 | 1 |
|2 | three | 1 | 1 | 3 | 2 |
|3 | two | 1 | 1 | 2 | 3 |
|4 | five | 1 | 1 | 5 | 4 |
|5 | two | 1 | 2 | 2 | 1 |
|6 | three | 1 | 2 | 3 | 2 |
|7 | two | 2 | 1 | 2 | 1 |
|8 | three | 2 | 1 | 3 | 2 |
-----------------------------------------------------
For rating of client_id 1 I want out put like
-----------------------------------------------------
|_id| name | client_id | user_id | rating | count |
-----------------------------------------------------
|4 | five | 1 | 1 | 5 | 4 |
|6 | three | 1 | 2 | 3 | 2 |
-----------------------------------------------------
so far I tried SELECT * FROM test
where client_id = 1 group by client_id order by count desc;
but not getting expected result, any help??
You can use left join on the same table as
select t1.* from test t1
left join test t2 on t1.user_id = t2.user_id
and t1.client_id = t2.client_id
and t1._id < t2._id
where
t2._id is null
and t1.client_id = 1
order by t1.`count` desc;
Using un-correlated subquery you may do as
select t1.* from test t1
join (
select max(_id) as _id,
client_id,
user_id
from test
where client_id = 1
group by client_id,user_id
)t2
on t1._id = t2._id
and t1.client_id = t2.client_id
order by t1.`count` desc;
UPDATE : From the comment how to join another table into above , for this here is an example
mysql> select * from users ;
+------+------+
| _id | name |
+------+------+
| 1 | AAA |
| 2 | BBB |
+------+------+
2 rows in set (0.00 sec)
mysql> select * from test ;
+------+-------+-----------+---------+--------+-------+
| _id | name | client_id | user_id | rating | count |
+------+-------+-----------+---------+--------+-------+
| 1 | four | 1 | 1 | 4 | 1 |
| 2 | three | 1 | 1 | 3 | 2 |
| 3 | two | 1 | 1 | 2 | 3 |
| 4 | five | 1 | 1 | 5 | 4 |
| 5 | two | 1 | 2 | 2 | 1 |
| 6 | three | 1 | 2 | 3 | 2 |
| 7 | two | 2 | 1 | 2 | 1 |
| 8 | three | 2 | 1 | 3 | 2 |
+------+-------+-----------+---------+--------+-------+
select t1.*,u.name from test t1
join users u on u._id = t1.user_id
left join test t2 on t1.user_id = t2.user_id
and t1.client_id = t2.client_id
and t1._id < t2._id
where
t2._id is null
and t1.client_id = 1
order by t1.`count` desc;
Will give you
+------+-------+-----------+---------+--------+-------+------+
| _id | name | client_id | user_id | rating | count | name |
+------+-------+-----------+---------+--------+-------+------+
| 4 | five | 1 | 1 | 5 | 4 | AAA |
| 6 | three | 1 | 2 | 3 | 2 | BBB |
+------+-------+-----------+---------+--------+-------+------+
Note that the join to users table is inner join and this will require all the user to be preset in users table which are in test table
If some users are missing in the users table then use left join this will have null values for the data selected from users table.
You may try something like
select _id, name, client_id, user_id, rating, max(count)
from clients
group by client_id
Try it
SELECT * FROM test
where client_id = 1
group by user_id
order by count desc

SQL, difficult fetching data query

Suppose I have such a table:
+-----+---------+-------+
| ID | TIME | DAY |
+-----+---------+-------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
| 1 | 1 | 2 |
| 2 | 2 | 2 |
| 3 | 3 | 2 |
| 1 | 1 | 3 |
| 2 | 2 | 3 |
| 3 | 3 | 3 |
| 1 | 1 | 4 |
| 2 | 2 | 4 |
| 3 | 3 | 4 |
| 1 | 1 | 5 |
| 2 | 2 | 5 |
| 3 | 3 | 5 |
+-----+---------+-------+
I want to fetch a table which represents 2 IDs which got the largest sum of TIME within the last 3 days (means from 3 to 5 in a DAY column)
So the correct result would be:
+-----+---------+
| ID | SUM |
+-----+---------+
| 3 | 9 |
| 2 | 6 |
+-----+---------+
The original table is much larger and more complex. So i need a generic approach.
Thanks in advance.
And so I just learned that MySQL used LIMIT instead of TOP...
fiddle
CREATE TABLE tbl (ID INT,tm INT,dy INT);
INSERT INTO tbl (id, tm, dy) VALUES
(1,1,1)
,(2,2,1)
,(3,3,1)
,(1,1,2)
,(1,1,1)
SELECT ID
,SUM(SumTimeForDay) SumTimeFromLastThreeDays
FROM (SELECT ID
,SUM(tm) SumTimeForDay
FROM tbl
GROUP BY ID, dy
HAVING dy > MAX(dy) -3) a
GROUP BY id
ORDER BY SUM(SumTimeForDay) DESC
LIMIT 2
select t1.`id`, sum(t1.`time`) as `sum`
from `table` t1
inner join ( select distinct `day` from `table` order by `day` desc limit 3 ) t2
on t2.`da`y = t1.`day`
group by t1.`id`
order by sum(t1.`time`) desc
limit 2

How to count and group query to get proper results?

I have a problem, please see my database:
-------------------
| id | article_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
| 6 | 3 |
| 7 | 3 |
| 8 | 3 |
| 9 | 3 |
| 10 | 3 |
And I want to receive something like this (order by votes, from max to min):
---------------------------
| id | article_id | votes |
---------------------------
| 1 | 3 | 5 |
| 2 | 1 | 3 |
| 3 | 2 | 2 |
Could you please help me to write proper sql query?
SET #currentRow = 0;
SELECT #currentRow := #currentRow + 1 AS id, t.article_id, t.c AS `votes`
FROM (
SELECT article_id, count(*) as `c`
FROM table_votes
GROUP BY article_id
) t
ORDER BY t.c DESC
please note that you can't select an id column like this in this context, and your "expected result" is incorrect. I tried to adapt it at a maximum.
cheers
SELECT article_id, COUNT(article_id) AS votes
FROM votes_table
GROUP BY article_id
ORDER BY votes DESC;