I'm building a Tinder clone for a study project and I'm trying to do something very simple conceptually but it appears that my request is really too heavy.
Data Structure
I've created this simple fiddle to visualize the database structure.
I've tried to put indexes on user.id user.gender * user.orientation match.user1 match.user2 match.createdAt with no luck.
Expected result
I want to find the people who have the less number of matches depending on gender, orientation, lastLogin and calendar date.
Users musn't be part of more than 4 matches during 24h so I look for users with <= 3 matches during the last 24h.
Values in the following are hard coded for easy editing of the request and because I didn't took time to do this part for now.
A match is composed of 2 users (user1 and user2).
The limit of 4 matches on the same day is a sum of when they appear as user1 and user2.
SELECT total_sum, userId
FROM (
SELECT u.id as userId, u.orientation as userOrientation, u.gender as userGender, m1.sum1, m2.sum2, (m1.sum1 + m2.sum2) AS total_sum
FROM user u
INNER JOIN (
SELECT user1, COUNT(user1) as sum1
FROM `match`
WHERE createdAt > DATE('2017-12-11 00:00:00')
GROUP BY user1
) m1
ON m1.user1 = u.id
INNER JOIN (
SELECT user2, COUNT(user1) as sum2
FROM `match`
WHERE createdAt > DATE('2017-12-11 00:00:00')
GROUP BY user2
) m2
ON m2.user2 = u.id
WHERE u.gender IN ('female')
AND u.orientation IN ('hetero', 'bi')
AND u.lastLogin > 1512873464582
) as total
WHERE total_sum < 4
ORDER BY total_sum ASC
LIMIT 8
The issue
With tiny tables, request takes few ms but with medium tables (50k users, 200k matches), request takes ages (170s).
Optimizing
According to #Thorsten Kettner response, this is the explain plan of his request when I run it into my test db after setting the indexes he advised:
Solution
I've ended up doing something easier.
First I flatened my match table by removing user2 column. It double the size because now 1 match become 2 rows but allow me to do something very simpler and very efficient with proper indexes.
The first query is to manage users with no matches and the second one to handle user with matches. I don't have anymore the matchesLimit into the query as it add extra work for mysql and I just need to check the first result to see if matchNumber is <= 3.
(SELECT u.id, mc.id as nb_match, u.gender, u.orientation
FROM user u
LEFT JOIN match_composition mc
ON (mc.matchedUser = u.id AND mc.createdAt > DATE('2017-12-11 00:00:00'))
WHERE u.lastLogin > 1512931740721
AND u.orientation IN ('bi', 'hetero')
AND u.gender IN ('female')
AND mc.id IS NULL
ORDER BY u.lastLogin DESC)
UNION ALL
(SELECT u.id, count(mc.id) as nb_match, u.gender, u.orientation
FROM match_composition mc
JOIN user u
ON u.id = matchedUser
WHERE mc.createdAt > DATE('2017-12-11 00:00:00')
AND u.lastLogin > 1512931740721
AND u.orientation IN ('bi', 'hetero')
AND u.gender IN ('female')
GROUP BY matchedUser
ORDER BY nb_match ASC
LIMIT 8)
thanks for your help
A user can be matched as user1 or user2. We can use UNION ALL to get one record per user:
select user1 as userid from match union all select user2 as userid from match;
The complete query:
select
u.id as userid,
coalesce(um.total, 0) as total
from user u
left join
(
select userid, count(*) as total
from
(
select user1 as userid from match where createdat > date '2017-12-11'
union all
select user2 as userid from match where createdat > date '2017-12-11'
) m
group by userid
) um on um.userid = u.id
where u.gender IN ('female')
and u.orientation in ('hetero', 'bi')
and u.lastlogin > 1512873464582
and coalesce(um.total, 0) < 4
order by coalesce(um.total, 0);
You would have the following indexes for this:
create index idx_m1 on match (createdat, user1);
create index idx_m2 on match (createdat, user2);
create index idx_u on user (lastlogin, gender, orientation, id);
I guess you were right about your SQL skills. This is what I came up with:
SELECT u.id as userId,
u.orientation as userOrientation,
u.gender as userGender,
count(m.user1) total_sum
FROM user u
LEFT JOIN `match` m on (u.id in (m.user1, m.user2)
and m.createdAt > DATE('2017-12-11 00:00:00'))
WHERE u.gender IN ('female')
AND u.orientation IN ('hetero', 'bi')
AND u.lastLogin > 1512873464582
having count(m.user1) <=4
ORDER BY total_sum ASC
LIMIT 8;
Edit: Covered also cases with no matches
Try to play around with indexing match table columns user1, user1 and also with User table columns(or column combinations) you use in filters (gender for example), see what brings better performance.
From what you provide, I would create indexes on:
- match.user1
- match.user2
- match.createdAt
- user.id (unique, and probably a PK)
- user.lastLogin
I would also try to replace COUNT(user1) by COUNT(*), but it won't probably have a big impact.
Indexes on user.gender and user.orientation are probably useless: the efficiency of an index is somehow proportional to the variance of its underlying values. Therefore an index on a field with 2-3 distinct values is more costly than useful.
As for the DLL, try the following. I tried to force the filtering on user to be done BEFORE the joins with match, in case the query optimizer does not work properly (I have little experience with non MS databases)
SELECT total_sum, userId
FROM (SELECT u.id as userId, u.orientation as userOrientation, u.gender as userGender, m1.sum1, m2.sum2, (m1.sum1 + m2.sum2) AS total_sum
FROM (SELECT * FROM user
WHERE gender = 'female'
AND orientation IN ('hetero', 'bi')
AND lastLogin > 1512873464582
) u
INNER JOIN (SELECT user1, COUNT(*) as sum1
FROM `match`
WHERE createdAt > DATE('2017-12-11 00:00:00')
GROUP BY user1
) m1 ON m1.user1 = u.id
INNER JOIN (SELECT user2, COUNT(*) as sum2
FROM `match`
WHERE createdAt > DATE('2017-12-11 00:00:00')
GROUP BY user2
) m2 ON m2.user2 = u.id
) as total
WHERE total_sum < 4
ORDER BY total_sum ASC
LIMIT 8
Related
Inner query:
select up.user_id, up.id as utility_pro_id from utility_pro as up
join utility_pro_zip_code as upz ON upz.utility_pro_id = up.id and upz.zip_code_id=1
where up.available_for_survey=1 and up.user_id not in (select bjr.user_id from book_job_request as bjr where
((1583821800000 between bjr.start_time and bjr.end_time) and (1583825400000 between bjr.start_time and bjr.end_time)))
Divided in two queries:
select up.user_id, up.id as utility_pro_id from utility_pro as up
join utility_pro_zip_code as upz ON upz.utility_pro_id = up.id and upz.zip_code_id=1
Select bjr.user_id as userId from book_job_request as bjr where bjr.user_id in :userIds and (:startTime between bjr.start_time and bjr.end_time) and (:endTime between bjr.start_time and bjr.end_time)
Note:
As per my understanding, when single query will be executed using inner query it will scan all the data of book_job_request but while using multiple queries rows with specified user ids will be checked.
Any other better option for the same operation other than these two is also appreciated.
I expect that the query is supposed to be more like this:
SELECT up.user_id
, up.id utility_pro_id
FROM utility_pro up
JOIN utility_pro_zip_code upz
ON upz.utility_pro_id = up.id
LEFT
JOIN book_job_request bjr
ON bjr.user_id = up.user_id
AND bjr.end_time >= 1583821800000
AND bjr.start_time <= 1583825400000
WHERE up.available_for_survey = 1
AND upz.zip_code_id = 1
AND bjr.user_id IS NULL
For further help with optimisation (i.e. which indexes to provide) we'd need SHOW CREATE TABLE statements for all relevant tables as well as the EXPLAIN for the above
Another possibility:
SELECT up.user_id , up.id utility_pro_id
FROM utility_pro up
JOIN utility_pro_zip_code upz ON upz.utility_pro_id = up.id
WHERE up.available_for_survey = 1
AND upz.zip_code_id = 1
AND bjr.user_id IS NULL
AND NOT EXISTS( SELECT 1 FROM book_job_request
WHERE user_id = up.user_id
AND end_time >= 1583821800000
AND start_time <= 1583825400000 )
Recommended indexes (for my NOT EXISTS and for Strawberry's LEFT JOIN):
book_job_request: (user_id, start_time, end_time)
upz: (zip_code_id, utility_pro_id)
up: (available_for_survey, user_id, id)
The column order given is important. And, no, the single-column indexes you currently have are not as good.
I need to find and merge records in a table which are related by time. The table records user activity in a website (activity start and activity end times).
I am trying to merge down to one record any activity within an hour of other activity by the same user. So if the start of one record is 55 minutes after the end of the same user's previous activity, I merge that down to make one record.
I've tried various kinds of self join to achieve this, but the results are never perfect.
In two steps, I have tried this:
First UPDATE the updated_at (activity end), so that all records within an hour of each other have a common updated_at timestamp, which is the latest of the group.
Delete all the later records in the group, so that only the earliest record remains, now with the earliest created_at and the latest updated_at
-- First set a common end-time (updated_at) for all activity by one user with less than an hour between
UPDATE users_activity
SET updated_at = (SELECT a.LatestEnd FROM (SELECT
UA1.id,
MAX(UA2.updated_at) AS LatestEnd
FROM users_activity UA1, users_activity UA2
WHERE
UA1.id <> UA2.id
AND UA1.user_id = UA2.user_id
AND UA1.created_at > DATE_SUB(UA2.updated_at,INTERVAL 1 HOUR)
AND UA1.created_at < UA2.updated_at
) a)
WHERE
users_activity.id IN (SELECT b.id FROM (SELECT
UA1.id
FROM users_activity UA1, users_activity UA2
WHERE
UA1.id <> UA2.id
AND UA1.user_id = UA2.user_id
AND UA1.created_at > DATE_SUB(UA2.updated_at,INTERVAL 1 HOUR)
AND UA1.created_at < UA2.updated_at
) b);
-- next delete all the later records in the group, leaving only the earliest
DELETE FROM users_activity
WHERE
users_activity.id IN (SELECT * FROM (SELECT d.id FROM users_activity d
INNER JOIN
(SELECT
COUNT(CONCAT(user_id,'_',updated_at)) AS Duplicates,
CONCAT(user_id,'_',updated_at) AS UserVisitEnd,
id,
user_id,
MAX(created_at) AS LatestStart
FROM users_activity
GROUP BY UserVisitEnd
HAVING Duplicates > 1) a on a.LatestStart = d.created_at AND a.user_id = d.user_id) as AllDupes);
If the data is like this:
|id |user_id|created_at |updated_at
|5788|1222 |2019-06-06 08:55:28|2019-06-06 09:30:41
|5787|3555 |2019-06-06 08:40:04|2019-06-06 11:07:21
|5786|1222 |2019-06-06 07:11:03|2019-06-06 08:01:29
|5785|7999 |2019-06-05 18:11:03|2019-05-01 18:17:44
|5784|3555 |2019-06-04 16:53:32|2019-06-04 16:58:19
|5783|9222 |2019-04-01 15:21:32|2019-04-01 16:53:32
|5782|1222 |2019-03-29 14:02:09|2019-03-29 15:51:07
|5774|1222 |2019-03-29 13:38:43|2019-03-29 13:50:43
|5773|7999 |2018-09-23 17:38:35|2018-09-23 17:40:35
I should get this result:
|id |user_id|created_at |updated_at
|5787|3555 |2019-06-06 08:40:04|2019-06-06 11:07:21
|5786|1222 |2019-06-06 07:11:03|2019-06-06 09:30:41
|5785|7999 |2019-06-05 18:11:03|2019-05-01 18:17:44
|5784|3555 |2019-06-04 16:53:32|2019-06-04 16:58:19
|5783|9222 |2019-04-01 15:21:32|2019-04-01 16:53:32
|5774|1222 |2019-03-29 13:38:43|2019-03-29 15:51:07
|5773|7999 |2018-09-23 17:38:35|2018-09-23 17:40:35
New info. This query will get me results containing the info I need: id of sessions to update and merge. But how to mass update, when each row's update potentially changes the updates needed on other rows?
SELECT b.id, b.user_id, b.created_at, b.updated_at, b.UpdatedAtOfSessionToMerge, b.IDofSessionToMerge FROM (SELECT
UA1.id,
UA1.user_id,
UA1.created_at,
UA1.updated_at,
UA2.updated_at AS UpdatedAtOfSessionToMerge,
UA2.id AS IDofSessionToMerge
FROM users_activity UA1, users_activity UA2
WHERE
UA1.id <> UA2.id
AND UA1.user_id = UA2.user_id
AND UA1.created_at > DATE_SUB(UA2.updated_at,INTERVAL 1 HOUR)
AND UA1.updated_at < UA2.updated_at
AND UA1.created_at < UA2.updated_at
) b order by b.user_id;
SELECT min(ID) as ID, User_ID, Min(Created_At) Created_At, Max(Updated_At) as Updated_At
FROM Table
GROUP BY User_ID, DATE_FORMAT(Created_At, "%Y%m%d%H");
Would be close but I'm not sure I'm handling the "Hour" rollup the way you want.
You can group your date based on a parameter. Also, it is always good in terms of future processing speed to order your data if you can. It also makes your query result just nicer.
SELECT min(ID) as ID, User_ID, Min(Created_At) Created_At, Max(Updated_At) as Updated_At
FROM Table
GROUP BY User_ID,
ORDER BY User_ID;
Check the following link for formatting dates in MySQL
This is a manual solution, good enough for a one-time clean-up of old session data. It uses two SELF joins, so there could be a more efficient way to do this.
Step 1, Find batches of session records and unify them by giving them all the same end-of-session value (updated_at)
UPDATE users_activity as u1 JOIN (SELECT b.id, b.user_id, b.created_at, b.updated_at, b.UpdatedAtOfSessionToMerge, b.IDofSessionToMerge FROM (SELECT
UA1.id,
UA1.user_id,
UA1.created_at,
UA1.updated_at,
UA2.updated_at AS UpdatedAtOfSessionToMerge,
UA2.id AS IDofSessionToMerge
FROM users_activity UA1, users_activity UA2
WHERE
UA1.id <> UA2.id
AND UA1.user_id = UA2.user_id
AND UA1.created_at > DATE_SUB(UA2.updated_at,INTERVAL 1 HOUR)
AND UA1.updated_at < UA2.updated_at
AND UA1.created_at < UA2.updated_at
) b order by b.user_id) as u2
on u1.id = u2.id
SET u1.updated_at = u2.UpdatedAtOfSessionToMerge;
Repeat this query until no rows are affected
Step 2, Delete unnecessary session records in each unified batch;
DELETE FROM users_activity
WHERE
users_activity.id IN (SELECT * FROM (SELECT d.id FROM users_activity d
INNER JOIN
(SELECT
COUNT(CONCAT(user_id,'_',updated_at)) AS Duplicates,
CONCAT(user_id,'_',updated_at) AS UserVisitEnd,
id,
user_id,
MAX(created_at) AS LatestStart
FROM users_activity
GROUP BY UserVisitEnd
HAVING Duplicates > 1) a on a.LatestStart = d.created_at AND a.user_id = d.user_id) as AllDupes);
Repeat this query until no rows are affected
I have a query which does a count from another table then adds a column for the result. I then need to alter the original select results based on that but am being told unknown column.
E.g. the following query does a count from another table within the main query, and the result is named shares, I need to filter the main query result set based on whether that column is greater than 0 but I get error unknown column shares
select b.name, event_title, ue.event_vis, event_date,
(select count(*) from list_shares
where user_id = 63 and event_id=ue.user_event_id) as shares,
(DAYOFYEAR(ue.event_date) - DAYOFYEAR(CURDATE())) as days
FROM brains b
join user_events ue on b.user_id=ue.user_id
where b.user_id=63 and ((ue.event_vis='Public') OR (shares>0))
and MOD(DAYOFYEAR(ue.event_date) - DAYOFYEAR(CURDATE()) + 365, 365) <= 30
order by days asc
Is there a way to do this?
I would suggest using a derived table to deliver the aggregate value and join it as you would do with a "phyiscal" table. Example:
select
b.name,
ue.event_title,
ue.event_vis,
ue.event_date,
tmp.shares,
(DAYOFYEAR(ue.event_date) - DAYOFYEAR(CURDATE())) as days
from
brains b join user_events ue on b.user_id = ue.user_id
left join (
select
ls.user_id,
ls.event_id,
count(*) as shares
from
list_shares ls
group by
ls.user_id,
ls.event_id) tmp on b.user_id = tmp.user_id and ue.user_event_id = tmp.event_id
where
b.user_id = 63
and
((ue.event_vis = 'Public') OR (tmp.shares > 0))
and
MOD(DAYOFYEAR(ue.event_date) - DAYOFYEAR(CURDATE()) + 365, 365) <= 30
order by
days asc
Please note the "left join". Because your using the OR operator in your where clause it seems to me like you want to get also rows without a share.
Of course you could also use the same subselect in your where clause but that's duplicate code and harder to maintain.
You cannot use a computed column to filter in the same query. Try something like
SELECT x.*
FROM (
/*Your Query*/
) as x
WHERE x.shares > 0
Or you could do something like
select b.name, event_title, ue.event_vis, event_date,
shares.SharesCount as shares,
(DAYOFYEAR(ue.event_date) - DAYOFYEAR(CURDATE())) as days
FROM brains b
join user_events ue on b.user_id=ue.user_id, event_id
JOIN (select count(*) as [sharesCount], from list_shares
where user_id = 63) as shares ON shares.event_id = ue.user_event_id
where b.user_id=63 and ((ue.event_vis='Public') OR (shares>0))
and MOD(DAYOFYEAR(ue.event_date) - DAYOFYEAR(CURDATE()) + 365, 365) <= 30
AND shares.sharesCount > 0
order by days asc
I'm stuck, I tried to grab statistics from DB.
My query needs to return count of users that at least have 1 entry in connections table.
So I tried:
SELECT DISTINCT (u.id) AS total_has_atleast_one_word FROM users AS u
LEFT JOIN connections AS c
ON u.id = c.user_id
WHERE c.word_id IS NOT null;
This returns correct user_id, I've 3 rows with correct id which is all fine.
But when I do count(u.id) it returns 35 which instead should be 3. My understanding is it is counting non DISTINCT number of rows. So what should I do?
And as a last part of my question, how do I unite this with other stat queries of mine?
/*SELECT COUNT(u.id) AS total_users,
sum(u.created < (NOW() - INTERVAL 7 DAY)) as total_seven_day_period,
sum(u.verified = 1) as total_verified,
sum(u.level = 3) as total_passed_tut,
sum(u.avatar IS NOT null) as total_with_avatar,
sum(u.privacy = 0) as total_private,
sum(u.privacy = 2) as total_friends_only,
sum(u.privacy = 3) as total_public,
sum(u.sex = "F") as total_female,
sum(u.sex = "M") as total_male
FROM users AS u;*/
Testing playground: http://www.sqlfiddle.com/#!2/c79a6/63
SELECT COUNT(DISTINCT user_id) FROM connections
for the first part :
/* user ids of those who have ever connected */
Select user_id from connections group by user_id
/* to get those who have connected after a particular date time ... */
Select user_id from connections group by user_id where connection_time > '2013-11-23'
/* join with user table to get user details e.g. .. */
Select u.name , u.address from user u
join on (Select user_id from connections group by user_id) c on c.user_id =u.user_id
As per my requirement i made the below query. Now it not working.
Query is:
SELECT *
FROM T_INV_DTL T
LEFT JOIN (
SELECT inv_dtl_id,
Employee_id AS emp_id,
GROUP_CONCAT(DISTINCT Employee_id) AS Employee_id
FROM T_INV_INVESTIGATOR
GROUP BY
inv_dtl_id
)TII
ON T.inv_dtl_id = TII.inv_dtl_id
JOIN T_INVESTIGATION TI
ON T.inv_id = TI.inv_id
LEFT JOIN (
SELECT inv_dtl_id
FROM T_INV_BILL
GROUP BY
inv_dtl_id
)TIB
ON T.inv_dtl_id = TIB.inv_dtl_id
JOIN T_Insurance_company TIC
ON TI.client_id = TIC.ins_cmp_id
WHERE 1 T.Report_dt != '0000-00-00'
AND (
T.inv_dtl_id NOT IN (SELECT inv_dtl_id
FROM T_INV_BILL TIBS
WHERE TIBS.inv_dtl_id NOT IN (SELECT
inv_dtl_id
FROM
T_INV_BILL
WHERE
Bill_submitted_dt =
'0000-00-00'))
)
ORDER BY
Allotment_dt DESC
LIMIT 20
Can anyone tells the problem and can you please modify to more efficient query(Suppose if we have more than 100 records, then we take the count for it for pagination it should be give faster).
T_INV_DTL is main table and it connect to others. So my probelm is each entry of this table T_INV_DTL has multtiple investigation bill in the table T_INV_BILL. Report_dt in the T_INV_DTL. So my outcome is that i need result if there’s a report date in T_INV_DTL and not atleast one bill date in T_INV_BILL.
I need the result with both if there’s a report date in T_INV_DTL and not atleast one bill date in T_INV_BILL(If all have entered the bill submitted date it does not need it).
While I admittedly don't know what issues you're having (please provide addl info), your query does look like it could be optimized.
Removing your Where criteria and adding to your Join should save 2 of your table scans:
SELECT *
FROM T_INV_DTL T
LEFT JOIN (
SELECT inv_dtl_id,
Employee_id AS emp_id,
GROUP_CONCAT(DISTINCT Employee_id) AS Employee_id
FROM T_INV_INVESTIGATOR
GROUP BY
inv_dtl_id
)TII
ON T.inv_dtl_id = TII.inv_dtl_id
JOIN T_INVESTIGATION TI
ON T.inv_id = TI.inv_id
LEFT JOIN (
SELECT inv_dtl_id
FROM T_INV_BILL
WHERE Bill_submitted_dt != '0000-00-00'
GROUP BY inv_dtl_id
)TIB
ON T.inv_dtl_id = TIB.inv_dtl_id
JOIN T_Insurance_company TIC
ON TI.client_id = TIC.ins_cmp_id
WHERE T.Report_dt != '0000-00-00'
AND TIB.inv_dtl_id IS NULL
ORDER BY
Allotment_dt DESC
LIMIT 20