Sql efficient query from multiple tables - mysql

I've two tables tbl_data and tbl_user_data
Structure of tbl_data
id (int) (primary)
names (varchar)
dept_id (int)
Structure of tbl_user_data:
id (int) (primary)
user_id (int)
names_id (int)
tbl_data.id and tbl_user_data.names_id are foreign key
I've situation where I've to pick 25 random entries from tbl_data which is not served earlier to particular user. So I've created a tbl_user_data which will store user_id and names_id(from tbl_data which is already served).
I'm bit confused, how to query on behalf of this or is there any other way to do this efficiently ?
Note: tbl_data have more than 5 million entries.
So far I've written this but it seems its not right.
SELECT td.names, td.dept_id
FROM tbl_data AS td
LEFT JOIN tbl_user_data AS tud ON td.id = tud.names_id
WHERE tud.user_id !=2
ORDER BY RAND( ) LIMIT 25

Two things:
First ... you need the LEFT JOIN .... IS NULL pattern to pick out your not-yet-served items. You'll need to mention the user id in the ON clause to get this to work correctly.
SELECT td.names, td.dept_id
FROM tbl_data AS td
LEFT JOIN tbl_user_data AS tud ON td.id = tud.names_id
AND tud.user_id = 2
WHERE tud.id IS NULL
ORDER BY RAND( ) LIMIT 25
Second, ORDER BY RAND() LIMIT ... is a notoriously poor performer on a large table. It has to select the entire table, then sort it, then discard all except 25 items from it. That's massively wasteful and will never perform decently.
You can make it a little less wasteful by sorting just the id values, then using them to get the other information.
This gets your 25 random ID values.
SELECT td.id
FROM tbl_data AS td
LEFT JOIN tbl_user_data AS tud ON td.id = tud.names_id
AND tud.user_id = 2
WHERE tud.id IS NULL
ORDER BY RAND( )
LIMIT 25
This gets your names and dept_id values.
SELECT a.names, a.dept_id
FROM tbl_data AS a
JOIN (
SELECT td.id
FROM tbl_data AS td
LEFT JOIN tbl_user_data AS tud ON td.id = tud.names_id
AND tud.user_id = 2
WHERE tud.id IS NULL
ORDER BY RAND( )
LIMIT 25
) b ON a.id = b.id
But, it's still wasteful. You may want to build a randomized version of this tbl_data table, and then use it sequentially. You could re-randomize it once a day, with something like this.
DROP TABLE tbl_data_random;
INSERT INTO tbl_data_random FROM
SELECT *
FROM tbl_data
ORDER BY RAND()
That way you don't do the sort over and over again, just to discard the results. Instead, you randomize once in a while.

As you're not selecting anything from the tbl_user_data, you can use exists instead:
SELECT td.names, td.dept_id
FROM tbl_data AS td
where exists (
select 1
from tbl_user_data AS tud
where td.id = tud.names_id
and tud.user_id !=2
)
ORDER BY RAND( ) LIMIT 25
Index on tbl_data(id) and tbl_user_data(names_id, user_id) will help.

Create index on names_id and user_id. Why is user_id varchar?
If need to be varchar and is varchar very long, create partial index on user_id.
You can use EXPLAIN to see what index use your query.

Related

Conditioning a column that's being selected by subquery

I have a following query
select distinct `orders`.`id`, `order_status`,
(SELECT `setting_id` FROM `settings` WHERE `settings`.`order_id` = `orders`.`id`) AS `setting_id`,
from `orders` order by orders.id desc limit 100 OFFSET 0
which works just fine, the column "setting_id" is being fetched as it should but when I add another WHERE
and `setting_id` = 2
it cannot find it and outputs Unknown column 'setting_id' in 'where clause'. Why?
I would change your query to a JOIN vs selecting column. Slower as it has to perform the column select every time. Now, if you only want for a setting_id = 2, just add that the the JOIN portion of the condition
select distinct
o.id,
o.order_status,
s.setting_id
from
orders o
join settings s
on o.id = s.order_id
AND s.setting_id = 2
order by
orders.id desc
limit
100 OFFSET 0
(it was also failing because you had an extra comma after your select setting column).
You could also have reversed the join by starting with your setting table so you were only concerned with those with status of 2, then finding the orders. I would ensure that your setting table had an index on ( setting_id, order_id ) to optimize the query.
select distinct
s.order_id id,
o.order_status,
s.setting_id
from
settings s
JOIN orders o
on s.order_id = o.id
where
s.setting_id = 2
order by
s.order_id desc
limit
100 OFFSET 0
With proper index as suggested, the above should be lightning fast directly from the index
Another consideration for an apparent large table, is to limit the days-back you query orders to limit your 100. Go back 1 month? 2? 15 days? How large is your orders table to make it drag for 10 seconds. That may be a better choice for you.

MySQL: combination of LEFT JOIN and ORDER BY is slow

There are two tables: posts (~5,000,000 rows) and relations (~8,000 rows).
posts columns:
-------------------------------------------------
| id | source_id | content | date (int) |
-------------------------------------------------
relations columns:
---------------------------
| source_id | user_id |
---------------------------
I wrote a MySQL query for getting 10 most recent rows from posts which are related to a specific user:
SELECT p.id, p.content
FROM posts AS p
LEFT JOIN relations AS r
ON r.source_id = p.source_id
WHERE r.user_id = 1
ORDER BY p.date DESC
LIMIT 10
However, it takes ~30 seconds to execute it.
I already have indexes at relations for (source_id, user_id), (user_id) and for (source_id), (date), (date, source_id) at posts.
EXPLAIN results:
How can I optimize the query?
Your WHERE clause renders your outer join a mere inner join (because in an outer-joined pseudo record user_id will always be null, never 1).
If you really want this to be an outer join then it is completely superfluous, because every record in posts either has or has not a match in relations of course. Your query would then be
select id, content
from posts
order by "date" desc limit 10;
If you don't want this to be an outer join really, but want a match in relations, then we are talking about existence in a table, an EXISTS or IN clause hence:
select id, content
from posts
where source_id in
(
select source_id
from relations
where user_id = 1
)
order by "date" desc
limit 10;
There should be an index on relations(user_id, source_id) - in this order, so we can select user_id 1 first and get an array of all desired source_id which we then look up.
Of course you also need an index on posts(source_id) which you probably have already, as source_id is an ID. You can even speed things up with a composite index posts(source_id, date, id, content), so the table itself doesn't have to be read anymore - all the information needed is in the index already.
UPDATE: Here is the related EXISTS query:
select id, content
from posts p
where exists
(
select *
from relations r
where r.user_id = 1
and r.source_id = p.source_id
)
order by "date" desc
limit 10;
You could put an index on the date column of the posts table, I believe that will help the order-by speed.
You could also try reducing the number of results before ordering with some additional where statements. For example if you know the that there will likely be ten records with the correct user_id today, you could limit the date to just today (or N days back depending on your actual data).
Try This
SELECT p.id, p.content FROM posts AS p
WHERE p.source_id IN (SELECT source_id FROM relations WHERE user_id = 1)
ORDER BY p.date DESC
LIMIT 10
I'd consider the following :-
Firstly, you only want the 10 most recent rows from posts which are related to a user. So, an INNER JOIN should do just fine.
SELECT p.id, p.content
FROM posts AS p
JOIN relations AS r
ON r.source_id = p.source_id
WHERE r.user_id = 1
ORDER BY p.date DESC
LIMIT 10
The LEFT JOIN is needed if you want to fetch the records which do not have a relations mapping. Hence, doing the LEFT JOIN results in a full table scan of the left table, which as per your info, contains ~5,000,000 rows. This could be the root cause of your query.
For further optimisation, consider moving the WHERE clause into the ON clause.
SELECT p.id, p.content
FROM posts AS p
JOIN relations AS r
ON (r.source_id = p.source_id AND r.user_id = 1)
ORDER BY p.date DESC
LIMIT 10
I would try with a composite index on relations :
INDEX source_user (user_id,source_id)
and change the query to this :
SELECT p.id, p.content
FROM posts AS p
INNER JOIN relations AS r
ON ( r.user_id = 1 AND r.source_id = p.source_id )
ORDER BY p.date DESC
LIMIT 10

SQL select ... in (select...) taking long time

I have a table of items that users have bought, within that table there is a category identifier. So, I want to show users other items from the same table, with the same categories they have already bought from.
The query I'm trying is taking over 22 seconds to run and the main items table is not even 3000 lines... Why so inefficient? Should I index, if so which columns?
Here's the query:
select * from items
where category in (
select category from items
where ((user_id = '63') AND (category <> '0'))
group by category
)
order by dateadded desc limit 20
Here is a query. And sure add index on category,user_id,dateadded
select i1.*
from items i1
inner join
(select distinct
category
from items
where ((user_id = '63') AND (category <> '0'))
) i2 on (i1.Category=i2.Category)
order by i1.dateadded desc limit 20
Appropriate places to put an index if necessary would be on dateadded, user_id and/or category
Try using self join for better performance as:
select i1.* from items i1 JOIN items i2 on i1.category= i2.category
where i2.user_id = '63' AND i2.category <> '0'
group by i2.category
order by i1.dateadded desc limit 20
Join is much faster than nested subqueries.
EDIT: Try without group by as :
select i1.* from items i1 JOIN items i2 on i1.category= i2.category
where i2.user_id = '63' AND i2.category <> '0'
order by i1.dateadded desc limit 20
If you index on category and possibly dateadded, it should speed things up a bit.

speed up mysql query with multiple subqueries

I'm wondering if there's a way to speed up a mysql query which is ordered by multiple subqueries.
On a music related site users can like different things like artists, songs, albums etc. These "likes" are all stored in the same table. Now I want to show a list of artists ordered by the number of "likes" by the users friends and all users. I want to show all artists, also those who have no likes at all.
I have the following query:
SELECT `artists`.*,
// friend likes
(SELECT COUNT(*)
FROM `likes`
WHERE like_type = 'artist'
AND like_id = artists.id
AND user_id IN (1,2,3,4, etc) // ids of friends
GROUP BY like_id
) AS `friend_likes`,
// all likes
(SELECT COUNT(*)
FROM `likes`
WHERE like_type = 'artist'
AND like_id = artists.id
GROUP BY like_id
) AS `all_likes`
FROM artists
ORDER BY
friend_likes DESC,
all_likes DESC,
artists.name ASC
The query takes ± 1.5 seconds on an artist table with 2000 rows. I'm afraid that this takes longer and longer as the table gets bigger and bigger. I tried using JOINS by can't seem to get this working because the subqueries contain WHERE statements.
Any ideas in the right direction would be greatly appreciated!
Try using JOINs instead of subqueries:
SELECT
artists.*, -- do you really need all this?
count(user_id) AS all_likes,
sum(user_id IN (1, 2, 3, 4)) AS friend_likes
FROM artists a
LEFT JOIN likes l
ON l.like_type = 'artist' AND l.like_id = a.id
GROUP BY a.id
ORDER BY
friend_likes DESC,
all_likes DESC,
artists.name ASC;
If this doesn't make the query faster, try adding indices, or consider selecting less fields.
You need to break this down a bit to see where the time goes. You're absolutely right that 1.5 sec on 2000 rows won't scale well. I suspect you need to look at indexes and foreign-key relationships. Look at each count/group-by query individually to get them tuned as best you can then recombine.
try rolling up into a query using inline IF() and go through the table/join ONCE
SELECT STRAIGHT_JOIN
artists.*
, LikeCounts.AllCount
, LikeCounts.FriendLikeCount
FROM
(SELECT
like_id
, count(*) AllCount
, sum( If( User_id in ( 1, 2, 3, 4 ), 1, 0 ) as FriendLikeCount
FROM
friend_likes
WHERE
like_type = 'artist'
GROUP BY
like_id ) LikeCounts
JOIN artists ON LikeCounts.like_id = artists.id
ORDER BY
LikeCounts.FriendLikeCount DESC
, LikeCounts.AllCount DESC
, artists.name ASC

mysql union limit problem

I want a paging script working properly basically but the situation is a bit complex. I need to pick data from union of two sql queries. See the query below. I have a table book and a table bookvisit. What I want is here to show all books for a particular category in their popularity order. I am getting data for all books with atleast one visit by joining table book and bookvisit. and then union it with all books with no visit. Everything works fine but when I try to do paging, I need to limit it like (0,10) (10,10) (20,10) (30,10), correct? If I have 9 books in bookvisit for that category and 3761 books without any visit for that category (total of 3770 books), it should list 377 pages , 10 books on each page. but it does not show any data for some pages because it tries to show books with limit 3760,10 and hence no records for second query in union. May be I am unable to clear the situation here but if you think a bit about the situation, you will get my point.
SELECT * FROM (
SELECT * FROM (
SELECT viewcount, b.isbn, booktitle, stock_status, price, description FROM book AS b
INNER JOIN bookvisit AS bv ON b.isbn = bv.isbn WHERE b.price <> 0 AND hcategoryid = '25'
ORDER BY viewcount DESC
LIMIT 10, 10
) AS t1
UNION
SELECT * FROM
(
SELECT viewcount, b.isbn, booktitle, stock_status, price, description FROM book AS b
LEFT JOIN bookvisit AS bv ON b.isbn = bv.isbn WHERE b.price <> 0 AND hcategoryid = '25'
AND viewcount IS NULL
ORDER BY viewcount DESC
LIMIT 10, 10
) AS t2
)
AS qry
ORDER BY viewcount DESC
LIMIT 10
Do not use limit for the separate queries. Use limit only at the end. You want to get the hole result set from the 2 queries and then show only the 10 results that you need no matter if this is LIMIT 0, 10 or LIMIT 3760,10
SELECT * FROM (
SELECT * FROM (
SELECT viewcount, b.isbn, booktitle, stock_status, price, description FROM book AS b
INNER JOIN bookvisit AS bv ON b.isbn = bv.isbn WHERE b.price <> 0 AND hcategoryid = '25'
ORDER BY viewcount DESC
) AS t1
UNION
SELECT * FROM
(
SELECT viewcount, b.isbn, booktitle, stock_status, price, description FROM book AS b
LEFT JOIN bookvisit AS bv ON b.isbn = bv.isbn WHERE b.price <> 0 AND hcategoryid = '25'
AND viewcount IS NULL
ORDER BY viewcount DESC
) AS t2
)
AS qry
ORDER BY viewcount DESC
LIMIT 10, 10
old one, but still relevant.
Basically, performance wise, you have to use LIMIT on each query involved into UNION, if you know there will be no duplicates between result sets you should consider using UNION ALL, again, performance wise. Then, if you need, lets say, LIMIT 100, 20, you do LIMIT each query with 120 (OFFSET + LIMIT), you are always fetching twice as much records you need, but not all.
SELECT [fields] FROM
(
(SELECT [fields] FROM ... LIMIT 10)
UNION ALL
(SELECT [fields] FROM ... LIMIT 10)
) query
LIMIT 0, 10
5th page
SELECT [fields] FROM
(
(SELECT [fields] FROM ... LIMIT 50)
UNION ALL
(SELECT [fields] FROM ... LIMIT 50)
) query
LIMIT 40, 10
A decade after this question was asked, I can offer a solution, one that perhaps seems obvious to anyone familiar with views: instead of attempting a nested select statement to combine the two tables, use CREATE VIEW (or CREATE OR REPLACE VIEW) to combine the two tables into a view. The speed performance may be poor, as the tables will have to be combined for every page access and may have to be recombined for every pagination, depending on how your code is arranged, but it will work.
If you run into SQL user permissions issues that you and your sysadmin cannot solve, my best advice is to create a new user with full permissions, assign the new user to the table, and use the new user to create the views. That was the only thing that worked for me.