Rails: Performance issue with joining of records - mysql

I have the following setup with ActiveRecord and MySQL:
User has many groups through memberships
Group has many users through memberships
There is also an index by group_id and user_id described in schema.rb:
add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree
3 different queries:
User.where(id: Membership.uniq.pluck(:user_id))
(3.8ms) SELECT DISTINCT memberships.user_id FROM memberships
User Load (11.0ms) SELECT users.* FROM users WHERE users.id IN (1, 2...)
User.where(id: Membership.uniq.select(:user_id))
User Load (15.2ms) SELECT users.* FROM users WHERE users.id IN (SELECT DISTINCT memberships.user_id FROM memberships)
User.uniq.joins(:memberships)
User Load (135.1ms) SELECT DISTINCT users.* FROM users INNER JOIN memberships ON memberships.user_id = users.id
What is the best approach for doing this? Why the query with join is much slower?

The first query is bad because it sucks all of the user ids into a Ruby array and then sends them back to the database. If you have a lot of users, that's a huge array and a huge amount of bandwidth, plus 2 roundtrips to the database instead of one. Furthermore, the database has no way to efficiently handle that huge array.
The second and third approaches are both efficient database-driven solutions (one is a subquery, and one is a join), but you need to have the proper index. You need an index on the memberships table on user_id.
add_index :memberships, :user_id
The index that you already have, would only be helpful if you wanted to find all of the users that belong to a particular group.
Update:
If you have a lot of columns and data in your users table, the DISTINCT users.* in the 3rd query is going to be fairly slow because MySQL has to compare a lot of data in order to ensure uniqueness.
To be clear: this is not intrinsic slowness with JOIN, it's slowness with DISTINCT. For example: Here is a way to avoid the DISTINCT and still use a JOIN:
SELECT users.* FROM users
INNER JOIN (SELECT DISTINCT memberships.user_id FROM memberships) AS user_ids
ON user_ids.user_id = users.id;
Given all of that, in this case, I believe the 2nd query is going to be the best approach for you. The 2nd query should be even faster than reported in your original results if you add the above index. Please retry the second approach, if you haven't done so yet since adding the index.
Although the 1st query has some slowness issues of its own, from your comment, it's clear that it is still faster than the 3rd query (at least, for your particular dataset). The trade-offs of these approaches is going to depend on your particular dataset in regards to how many users you have and how many memberships you have. Generally speaking, I believe the 1st approach is still the worst even if it ends up being faster.
Also, please note that the index I'm recommending is particularly designed for the three queries you listed in your question. If you have other kinds of queries against these tables, you may be better served by additional indexes, or possibly multi-column indexes, as #tata mentioned in his/her answer.

The query with join is slow because it loads all columns from database despite of the fact that rails don't preload them this way. If you need preloading then you should use includes (or similar) instead. But includes will be even slower because it will construct objects for all associations. Also you should know that
User.where.not(id: Membership.uniq.select(:user_id)) will return empty set in case when there is at least one membership with user_id equal to nil while the query with pluck will return the correct relation.

Below is more efficient solution:
User.exists?(id: Membership.uniq.pluck(:user_id))
join will fetch all the columns from membership table , so it will take more time while in the other queries. Here, you are only fetching rhe user_id from memberships. Calling distinct from users will slow down the query.

I think that you have a problem with the declaration of your indexes.
you declared an index as:
add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree
If your primary key was ["user_id","group_id"] - you were good to go, but....
Making this in rails is not so trivial.
Therefore in order to query the data with JOIN with Users table - you need to have 2 indexes:
add_index "memberships", ["user_id", "group_id" ]
This is because of the way MySQL handles indexes (they are treated as concatenated strings)
You can read more about it here Multiple-Column Indexes
There are also other techniques to make it faster dependant on all your cases, but the suggested one is the simple one with ActiveRecord
Furthermore - I don't think that you need the .uniq here as the result should be unique anyway because of the terms on the table.
Adding .uniq can make the MySQL to perform unnecessary sorting with filesort and usually it will also put a temporary table on disk.
You can run the command generated by rails directly on the mysql to check it with EXPLAIN
EXPLAIN <your command goes here>

#bublik42 and #user3409950 if I have to chose the Production environment Query then I would go for the First one:
User.where(id: Membership.uniq.pluck(:user_id))
Reason: Because it will use sql DISTINCT keyword to filter out the database result and then SELECT only 'user_id' column from the databse and return those values in a array form([1,2,3..]).
Database level filtration of result is always faster than Active record query object.
For your second query:
User.where(id: Membership.uniq.select(:user_id))
It is same query as with the 'pluck' but with 'select' it will make a active record relation object with single field 'user_id'. In this query it has a overhead of building the active record object as: ([#<Membership user_id: 1>, #<Membership user_id: 2>, ... ], which was not the case for the first query. Though I haven't done any real bench marking for both, but the results are obvious with the steps followed by the queries.
Third case is expensive here because with 'Join' function It will fetch all the columns from memberships table and it will take more time to process the filtration of the result in comparison to other queries.
Thank you

SELECT DISTINCT users.*
FROM users
INNER JOIN memberships
ON memberships.user_id = users.id
is slower because it is performed something like this:
Go through all of one table, collecting stuff as it goes.
for each entry from step 1, reach into the other table.
put that stuff into a tmp table
dedup (DISTINCT) that table to deliver the results
If there are 1000 users and each has 100 memberships, then the table in step 3 will have 100000 rows, even though the answer will have only 1000 rows.
This is a "semi-join" and only checks that the user has at least one membership; it is much more efficient:
SELECT users.*
FROM users -- no DISTINCT needed
WHERE EXISTS
( SELECT *
FROM memberships ON memberships.user_id = users.id
)
If you don't really need that check, then this would be still faster:
SELECT users.*
FROM users
If Rails can't generate these queries, then grumble at it.

Here is a great example, demonstrating Include VS Join :
http://railscasts.com/episodes/181-include-vs-joins
Please try with includes. I'm damn sure. It will take comparatively less time.
User.uniq.includes(:memberships)

Related

How to make a faster query when joining multiple huge tables?

I have 3 tables. All 3 tables have approximately 2 million rows. Everyday 10,000-100,000 new entries are entered. It takes approximately 10 seconds to finish the sql statement below. Is there a way to make this sql statement faster?
SELECT customers.name
FROM customers
INNER JOIN hotels ON hotels.cus_id = customers.cus_id
INNER JOIN bookings ON bookings.book_id = customers.book_id
WHERE customers.gender = 0 AND
customers.cus_id = 3
LIMIT 25 OFFSET 1;
Of course this statement works fine, but its slow. Is there a better way to write this code?
All database servers have a form of an optimization engine that is going to determine how best to grab the data you want. With a simple query such as the select you showed, there isn't going to be any way to greatly improve performance within the SQL. As others have said sub-queries won't helps as that will get optimized into the same plan as joins.
Reduce the number of columns, add indexes, beef up the server if that's an option.
Consider caching. I'm not a mysql expert but found this article interesting and worth a skim. https://www.percona.com/blog/2011/04/04/mysql-caching-methods-and-tips/
Look at the section on summary tables and consider if that would be appropriate. Does pulling every hotel, customer, and booking need to be up-to-the-minute or would inserting this into a summary table once an hour be fine?
A subquery don't help but a proper index can improve the performance so be sure you have proper index
create index idx1 on customers(gender , cus_id,book_id, name )
create index idex2 on hotels(cus_id)
create index idex3 on hotels(book_id)
I find it a bit hard to believe that this is related to a real problem. As written, I would expect this to return the same customer name over and over.
I would recommend the following indexes:
customers(cus_id, gender, book_id, name)
hotels(cus_id)
bookings(book_id)
It is really weird that bookings are not to a hotel.
First, these indexes cover the query, so the data pages don't need to be accessed. The logic is to start with the where clause and use those columns first. Then add additional columns from the on and select clauses.
Only one column is used for hotels and bookings, so those indexes are trivial.
The use of OFFSET without ORDER BY is quite suspicious. The result set is in indeterminate order anyway, so there is no reason to skip the nominally "first" value.

What is a "point-in-select" in MySQL?

I was given this query to update a report, and it was taking a long time to run on my computer.
select
c.category_type, t.categoryid, t.date, t.clicks
from transactions t
join category c
on c.category_id = t.categoryid
I asked the DBA if there were any issues with the query, and the DBA optimized the query in this manner:
select
(select category_type
from category c where c.category_id = t.categoryid) category_type,
categoryid,
date, clicks
from transactions t
He described the first subquery as a "point-in-select". I have never heard of this before. Can someone explain this concept?
I want to note that the two queries are not the same, unless the following is true:
transactions.categoryid is always present in category.
category has no duplicate values of category_id.
In practice, these would be true (in most databases). The first query should be using a left join version for closer equivalence:
select c.category_type, t.categoryid, t.date, t.clicks
from transactions t left join
category c
on c.category_id = t.categoryid;
Still not exactly the same, but more similar.
Finally, both versions should make use of an index on category(category_id), and I would expect the performance to be very similar in MySQL.
Your DBA's query is not the same, as others noted, and afaik nonstandard SQL. Yours is much preferable just for its simplicity alone.
It's usually not advantageous to re-write queries for performance. It can help sometimes, but the DBMS is supposed to execute logically equivalent queries equivalently. Failure to do so is a flaw in the query planner.
Performance issues are often a function of physical design. In your case, I would look for indexes on the category and transactions tables that contain categoryid as first column. If neither exist, your join is O(mn) because the category table must be scanned for each transaction row.
Not being a MySQL user, I can only advise you to get query planner output and look for indexing opportunities.

Get Relation of Followed or Public User

I am working on a social app. I have users that can have private accounts. Users can also follow each other. What is the fastest way using ActiveRecord or pure SQL to fetch all the records of a has_many on a User that either belong to someone I am following or belong to a public user. In pseudo code:
User.get_all_posts_for_users_being_followed_by(me) + User.get_all_posts_for_public_users
I have this:
SELECT `posts`.*
FROM `posts`
WHERE ( user_id IN (SELECT id
FROM users
WHERE visibility = 'all'
UNION
SELECT followable_id
FROM follows
WHERE followable_type = "User"
AND follower_type = "User"
AND follower_id = 4
AND follows.status = 1) )
But I was hoping there might be a faster way to handle that, or a way to do it with Rails query methods.
You can perform your clear query with ActiveRecord, but I recommend to use this pure version for_now, because it's very easy to modificate it now. You need pay attention on this:
The query might be faster, if you add indexes
add_index :users, :visibility, :name => 'visibility_ix'
Selecting all columns with the * wildcard will cause the query's meaning and behavior to change if the table's schema changes, and might cause the query to retrieve too much data.
IN() and NOT IN() subqueries are poorly optimized. MySQL executes the subquery as a dependent subquery for each row in the outer query. This is a frequent cause of serious performance problems in MySQL 5.5 and older versions. The query probably should be rewritten as a JOIN or a LEFT OUTER JOIN, respectively.

Why is this simple SQL query causing major lag on a simple 6k record table?

So I have 2 tables, one called user, and one called user_favorite. user_favorite stores an itemId and userId, for storing the items that the user has favorited. I'm simply trying to locate the users who don't have a record in user_favorite, so I can find those users who haven't favorited anything yet.
For testing purposes, I have 6001 records in user and 6001 in user_favorite, so there's just one record who doesn't have any favorites.
Here's my query:
SELECT u.* FROM user u
JOIN user_favorite fav ON u.id != fav.userId
ORDER BY id DESC
Here the id in the last statement is not ambigious, it refers to the id from the user table. I have a PK index on u.id and an index on fav.userId.
When I run this query, my computer just becomes unresponsive and completely freezes, with no output ever being given. I have 2gb RAM, not a great computer, but I think it should be able to handle a query like this with 6k records easily.
Both tables are in MyISAM, could that be the issue? Would switching to INNODB fix it?
Let's first discuss what your query (as written) is doing. Because of the != in the on-clause, you are joining every user record with every one of the other user's favorites. So your query is going to produce something like 36M rows. This is not going to give you the answer that you want. And it explains why your computer is unhappy.
How should you write the query? There are three main patterns you can use. I think this is a pretty good explanation: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ and discusses performance specifically in the context of mysql. And it shows you how to look at and read an execution plan, which is critical to optimizing queries.
change your query to something like this:
select * from User
where not exists (select * from user_favorite where User.id = user_favorite.userId)
let me know how it goes
A join on A != B means that every record of A is joined with every record of B in which the id's aren't equal.
In other words, instead of producing 6000 rows, you're producing approximately 36 million (6000 * 6001) rows of output, which all have to be collected, then sorted...

Which is more efficient in mysql, a big join or multiple queries of single table?

I have a mysql database like this
Post – 500,000 rows (Postid,Userid)
Photo – 200,000 rows (Photoid,Postid)
About 50,000 posts have photos, average 4 each, most posts do not have photos.
I need to get a feed of all posts with photos for a userid, average 50 posts each.
Which approach would be more efficient?
1: Big Join
select *
from post
left join photo on post.postid=photo.postid
where post.userid=123
2: Multiple queries
select * from post where userid=123
while (loop through rows) {
select * from photo where postid=row[postid]
}
I've not tested this, but I very much suspect (at an almost cellular level) that a join would be vastly, vastly faster - what you're attempting is pretty much the reason why joins exist after all.
Additionally, there would be considerably less overhead in terms of scripting language <-> MySQL communications, etc. but I suspect that's somewhat of a mute factor.
The JOIN is always faster with proper indexing (as mentioned before) but several smaller queries may be more easily cached, provided of course that you are using the query cache. The more tables a query contains the greater the chances of more frequent invalidations.
As long as the parsing and optimization procedure, I believe MySQL maintains its own statistics internally and this usually happens once. What you are losing when executing multiple queries is the roundtrip time and the client buffering lag, which is small if the resultset is relatively small in size.
A join will be much faster.
Each separate query will need to be parsed, optimized and executed which takes quite long.
Just don't forget to create the following indexes:
post (userid)
photo (postid)
With proper indexing on the postid columns, the join should be superior.
There's also the possibility of a sub-query:
SELECT * FROM photo WHERE postid IN (SELECT postid FROM post WHERE userid = 123);
I'd start with optimizing your queries, e.g. select * from post where userid=123 is obviously not needed as you only use row[postid] in your loop, so don't select * if you want to split the query.Then I'd run a couple of tests which ones faster but JOINing just two tables is usually the fastest (don't forget to create an index where needed).
If you're planning to make your "big query" very big (by joining more tables), things can get very slow and you may need to split your query. I once joined seven tables which took the query to run 30 seconds. Splitting the query made in run in a fraction of a second.
I'm not sure about this but there is another option. It might be much slower or faster depending upon indexes used.
In your case, something like:
select t1.postid FROM (select postid from post where userid = 23) AS t1 JOIN photo ON t1.postid = photo.postid
If the number of rows in table t1 is going to be small compared to table post there might be a chance for considerable performance improvement. But I haven't tested it yet.
SELECT * FROM photo, post
WHERE post.userid = 123 AND photo.postid = post.postid;
If you only want posts with photos, construct your query starting with the photo table as your base table. Note, you will get the post info repeated with each result row.
If you didn't want to return all of the post info with each row, an alternative would be to
SELECT DISTINCT postid from photo, post where post.userid = 123;
Then foreach postid, you could
SELECT * from photo WHERE postid = $inpostid;