Get Relation of Followed or Public User - mysql

I am working on a social app. I have users that can have private accounts. Users can also follow each other. What is the fastest way using ActiveRecord or pure SQL to fetch all the records of a has_many on a User that either belong to someone I am following or belong to a public user. In pseudo code:
User.get_all_posts_for_users_being_followed_by(me) + User.get_all_posts_for_public_users
I have this:
SELECT `posts`.*
FROM `posts`
WHERE ( user_id IN (SELECT id
FROM users
WHERE visibility = 'all'
UNION
SELECT followable_id
FROM follows
WHERE followable_type = "User"
AND follower_type = "User"
AND follower_id = 4
AND follows.status = 1) )
But I was hoping there might be a faster way to handle that, or a way to do it with Rails query methods.

You can perform your clear query with ActiveRecord, but I recommend to use this pure version for_now, because it's very easy to modificate it now. You need pay attention on this:
The query might be faster, if you add indexes
add_index :users, :visibility, :name => 'visibility_ix'
Selecting all columns with the * wildcard will cause the query's meaning and behavior to change if the table's schema changes, and might cause the query to retrieve too much data.
IN() and NOT IN() subqueries are poorly optimized. MySQL executes the subquery as a dependent subquery for each row in the outer query. This is a frequent cause of serious performance problems in MySQL 5.5 and older versions. The query probably should be rewritten as a JOIN or a LEFT OUTER JOIN, respectively.

Related

Rails: Performance issue with joining of records

I have the following setup with ActiveRecord and MySQL:
User has many groups through memberships
Group has many users through memberships
There is also an index by group_id and user_id described in schema.rb:
add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree
3 different queries:
User.where(id: Membership.uniq.pluck(:user_id))
(3.8ms) SELECT DISTINCT memberships.user_id FROM memberships
User Load (11.0ms) SELECT users.* FROM users WHERE users.id IN (1, 2...)
User.where(id: Membership.uniq.select(:user_id))
User Load (15.2ms) SELECT users.* FROM users WHERE users.id IN (SELECT DISTINCT memberships.user_id FROM memberships)
User.uniq.joins(:memberships)
User Load (135.1ms) SELECT DISTINCT users.* FROM users INNER JOIN memberships ON memberships.user_id = users.id
What is the best approach for doing this? Why the query with join is much slower?
The first query is bad because it sucks all of the user ids into a Ruby array and then sends them back to the database. If you have a lot of users, that's a huge array and a huge amount of bandwidth, plus 2 roundtrips to the database instead of one. Furthermore, the database has no way to efficiently handle that huge array.
The second and third approaches are both efficient database-driven solutions (one is a subquery, and one is a join), but you need to have the proper index. You need an index on the memberships table on user_id.
add_index :memberships, :user_id
The index that you already have, would only be helpful if you wanted to find all of the users that belong to a particular group.
Update:
If you have a lot of columns and data in your users table, the DISTINCT users.* in the 3rd query is going to be fairly slow because MySQL has to compare a lot of data in order to ensure uniqueness.
To be clear: this is not intrinsic slowness with JOIN, it's slowness with DISTINCT. For example: Here is a way to avoid the DISTINCT and still use a JOIN:
SELECT users.* FROM users
INNER JOIN (SELECT DISTINCT memberships.user_id FROM memberships) AS user_ids
ON user_ids.user_id = users.id;
Given all of that, in this case, I believe the 2nd query is going to be the best approach for you. The 2nd query should be even faster than reported in your original results if you add the above index. Please retry the second approach, if you haven't done so yet since adding the index.
Although the 1st query has some slowness issues of its own, from your comment, it's clear that it is still faster than the 3rd query (at least, for your particular dataset). The trade-offs of these approaches is going to depend on your particular dataset in regards to how many users you have and how many memberships you have. Generally speaking, I believe the 1st approach is still the worst even if it ends up being faster.
Also, please note that the index I'm recommending is particularly designed for the three queries you listed in your question. If you have other kinds of queries against these tables, you may be better served by additional indexes, or possibly multi-column indexes, as #tata mentioned in his/her answer.
The query with join is slow because it loads all columns from database despite of the fact that rails don't preload them this way. If you need preloading then you should use includes (or similar) instead. But includes will be even slower because it will construct objects for all associations. Also you should know that
User.where.not(id: Membership.uniq.select(:user_id)) will return empty set in case when there is at least one membership with user_id equal to nil while the query with pluck will return the correct relation.
Below is more efficient solution:
User.exists?(id: Membership.uniq.pluck(:user_id))
join will fetch all the columns from membership table , so it will take more time while in the other queries. Here, you are only fetching rhe user_id from memberships. Calling distinct from users will slow down the query.
I think that you have a problem with the declaration of your indexes.
you declared an index as:
add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree
If your primary key was ["user_id","group_id"] - you were good to go, but....
Making this in rails is not so trivial.
Therefore in order to query the data with JOIN with Users table - you need to have 2 indexes:
add_index "memberships", ["user_id", "group_id" ]
This is because of the way MySQL handles indexes (they are treated as concatenated strings)
You can read more about it here Multiple-Column Indexes
There are also other techniques to make it faster dependant on all your cases, but the suggested one is the simple one with ActiveRecord
Furthermore - I don't think that you need the .uniq here as the result should be unique anyway because of the terms on the table.
Adding .uniq can make the MySQL to perform unnecessary sorting with filesort and usually it will also put a temporary table on disk.
You can run the command generated by rails directly on the mysql to check it with EXPLAIN
EXPLAIN <your command goes here>
#bublik42 and #user3409950 if I have to chose the Production environment Query then I would go for the First one:
User.where(id: Membership.uniq.pluck(:user_id))
Reason: Because it will use sql DISTINCT keyword to filter out the database result and then SELECT only 'user_id' column from the databse and return those values in a array form([1,2,3..]).
Database level filtration of result is always faster than Active record query object.
For your second query:
User.where(id: Membership.uniq.select(:user_id))
It is same query as with the 'pluck' but with 'select' it will make a active record relation object with single field 'user_id'. In this query it has a overhead of building the active record object as: ([#<Membership user_id: 1>, #<Membership user_id: 2>, ... ], which was not the case for the first query. Though I haven't done any real bench marking for both, but the results are obvious with the steps followed by the queries.
Third case is expensive here because with 'Join' function It will fetch all the columns from memberships table and it will take more time to process the filtration of the result in comparison to other queries.
Thank you
SELECT DISTINCT users.*
FROM users
INNER JOIN memberships
ON memberships.user_id = users.id
is slower because it is performed something like this:
Go through all of one table, collecting stuff as it goes.
for each entry from step 1, reach into the other table.
put that stuff into a tmp table
dedup (DISTINCT) that table to deliver the results
If there are 1000 users and each has 100 memberships, then the table in step 3 will have 100000 rows, even though the answer will have only 1000 rows.
This is a "semi-join" and only checks that the user has at least one membership; it is much more efficient:
SELECT users.*
FROM users -- no DISTINCT needed
WHERE EXISTS
( SELECT *
FROM memberships ON memberships.user_id = users.id
)
If you don't really need that check, then this would be still faster:
SELECT users.*
FROM users
If Rails can't generate these queries, then grumble at it.
Here is a great example, demonstrating Include VS Join :
http://railscasts.com/episodes/181-include-vs-joins
Please try with includes. I'm damn sure. It will take comparatively less time.
User.uniq.includes(:memberships)

Which is faster subquery with filter and then join or join the queries and then filter in MYSQL

I have two table one is user table other is specialty table.
Fields in user table: userid, username, userLocation
Fields in specialty table: userid, userSpecialty
Now I want to join there two tables. Please let me know which approach will be better:
select * from ( select * from user where userLocation = 'value') u
inner join specialty s on u.userid = s.userid;
or
select * from user u inner join specialty s on u.userid = s.userid where userLocation = 'value';
Is it good practice to minimize the number of records where ever we can or SQL optimizer will do that automatically?
For best shot at optimal performance, give preference to the pattern in the second query, the query without the inline view.
For earlier version of MySQL (version 5.5 and earlier), the first query will require MySQL to run the inline view query, and materialize a derived table (u). Once that is done, the outer query will run against the derived table. And that table won't be indexed. For large sets, that can be a significant performance hit. For small sets, the performance impact for a single query isn't noticeable.
With the second query, the optimizer isn't forced to create and populate a derived table, so there's potential for better performance.
The existence of suitable indexes (or the non-existence of indexes) i0ndexes is going to have a much bigger impact on performance. And retrieving all columns including columns that aren't needed by the query (SELECT *) also has an impact on performance. Specifying a subset of the columns, the expressions that are actually needed, will give better performance, especially if a covering index is available to avoid lookups to the underlying data pages of the table.

Optimizing the SQL Query to reduce execution time

My SQL Query with all the filters applied is returning 10 lakhs (one million) records . To get all the records it is taking 76.28 seconds .. which is not acceptable . How can I optimize my SQL Query which should take less time.
The Query I am using is :
SELECT cDistName , cTlkName, cGpName, cVlgName ,
cMmbName , dSrvyOn
FROM sspk.villages
LEFT JOIN gps ON nVlgGpID = nGpID
LEFT JOIN TALUKS ON nGpTlkID = nTlkID
left JOIN dists ON nTlkDistID = nDistID
LEFT JOIN HHINFO ON nHLstGpID = nGpID
LEFT JOIN MEMBERS ON nHLstID = nMmbHhiID
LEFT JOIN BNFTSTTS ON nMmbID = nBStsMmbID
LEFT JOIN STATUS ON nBStsSttsID = nSttsID
LEFT JOIN SCHEMES ON nBStsSchID = nSchID
WHERE (
(nMmbGndrID = 1 and nMmbAge between 18 and 60)
or (nMmbGndrID = 2 and nMmbAge between 18 and 55)
)
AND cSttsDesc like 'No, Eligible'
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
GROUP BY cDistName , cTlkName, cGpName, cVlgName ,
DATE_FORMAT(dSrvyOn , '%m-%Y')
I have searched on the forum and outside and used some of the tips given but it hardly makes any difference . The joins that i have used in above query is left join all on Primary Key and Foreign key . Can any one suggest me how can I modify this sql to get less execution time ....
You are, sir, a very demanding user of MySQL! A million records retrieved from a massively joined result set at the speed you mentioned is 76 microseconds per record. Many would consider this to be acceptable performance. Keep in mind that your client software may be a limiting factor with a result set of that size: it has to consume the enormous result set and do something with it.
That being said, I see a couple of problems.
First, rewrite your query so every column name is qualified by a table name. You'll do this for yourself and the next person who maintains it. You can see at a glance what your WHERE criteria need to do.
Second, consider this search criterion. It requires TWO searches, because of the OR.
WHERE (
(MEMBERS.nMmbGndrID = 1 and MEMBERS.nMmbAge between 18 and 60)
or (MEMBERS.nMmbGndrID = 2 and MEMBERS.nMmbAge between 18 and 55)
)
I'm guessing that these criteria match most of your population -- females 18-60 and males 18-55 (a guess). Can you put the MEMBERS table first in your list of LEFT JOINs? Or can you put a derived column (MEMBERS.working_age = 1 or some such) in your table?
Also try a compound index on (nMmbGndrID,nMmbAge) on MEMBERS to speed this up. It may or may not work.
Third, consider this criterion.
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
You've applied a function to the dSrvyOn column. This defeats the use of an index for that search. Instead, try this.
AND dSrvyOn >= '2102-08-01'
AND dSrvyOn < '2012-08-01' + INTERVAL 1 MONTH
This will, if you have an index on dSrvyOn, do a range search on that index. My remark also applies to the function in your ORDER BY clause.
Finally, as somebody else mentioned, don't use LIKE to search where = will do. And NEVER use column LIKE '%something%' if you want acceptable performance.
You claim yourself you base your joins on good and unique indexes. So there is little to be optimized. Maybe a few hints:
try to optimize your table layout, maybe you can reduce the number of joins required. That probably brings more performance optimization than anything else.
check your hardware (available memory and things) and the server configuration.
use mysqls explain feature to find bottle necks.
maybe you can create an auxilliary table especially for this query, which is filled by a background process. That way the query itself runs faster, since the work is done before the query in background. That usually works if the query retrieves data that must not neccessarily be synchronous with every single change in the database.
check if an RDBMS is really the right type of database. For many purposes graph databases are much more efficient and offer better performance.
Try adding an index to nMmbGndrID, nMmbAge, and cSttsDesc and see if that helps your queries out.
Additionally you can use the "Explain" command before your select statement to give you some hints on what you might do better. See the MySQL Reference for more details on explain.
If the tables used in joins are least use for updates queries, then you can probably change the engine type from INNODB to MyISAM.
Select queries in MyISAM runs 2x faster then in INNODB, but the updates and insert queries are much slower in MyISAM.
You can create Views in order to avoid long queries and time.
Your like operator could be holding you up -- full-text search with like is not MySQL's strong point.
Consider setting a fulltext index on cSttsDesc (make sure it is a TEXT field first).
ALTER TABLE articles ADD FULLTEXT(cSttsDesc);
SELECT
*
FROM
table_name
WHERE MATCH(cSttsDesc) AGAINST('No, Eligible')
Alternatively, you can set a boolean flag instead of cSttsDesc like 'No, Eligible'.
Source: http://devzone.zend.com/26/using-mysql-full-text-searching/
This SQL has many things that are redundant that may not show up in an explain.
If you require a field, it shouldn't be in a table that's in a LEFT JOIN - left join is for when data might be in the joined table, not when it has to be.
If all the required fields are in the same table, it should be the in your first FROM.
If your text search is predictable (not from user input) and relates to a single known ID, use the ID not the text search (props to Patricia for spotting the LIKE bottleneck).
Your query is hard to read because of the lack of table hinting, but there does seem to be a pattern to your field names.
You require nMmbGndrID and nMmbAge to have a value, but these are probably in MEMBERS, which is 5 left joins down. That's a redundancy.
Remember that you can do a simple join like this:
FROM sspk.villages, gps, TALUKS, dists, HHINFO, MEMBERS [...] WHERE [...] nVlgGpID = nGpID
AND nGpTlkID = nTlkID
AND nTlkDistID = nDistID
AND nHLstGpID = nGpID
AND nHLstID = nMmbHhiID
It looks like cSttsDesc comes from STATUS. But if the text 'No, Eligible' matches exactly one nBStsSttsID in BNFTSTTS then find out the value and use that! If it is 7, take out LEFT JOIN STATUS ON nBStsSttsID = nSttsID and replace AND cSttsDesc like 'No, Eligible' with AND nBStsSttsID = '7'. This would see a massive speed improvement.

ActiveRecord/MySQL query to return grouped set of objects

I have a model called a Statement that belongs to a Member. Given an array of members, I want to create a query that will return the most recent statement for each of those members (preferably in one nice clean query).
I thought I might be able to achieve this using group and order - something like:
# #members is already in an array
#statements = Statement.all(
:conditions => ["member_id IN(?)", #members.collect{|m| m.id}],
:group => :member_id,
:order => "created_at DESC"
)
But unfortunately the above always returns the oldest statement for each member. I've tried swapping the order option round, but alas it always returns the oldest statement of the group rather than the most recent.
I'm guessing group_by isn't the way to achieve this - so how do I achieve it?
PS - any non Ruby/Rails people reading this, if you know how to achieve this in raw MySQL, then fire away.
In MySQL directly, you need a sub-query that returns the maximum created_at value for each member, which can then be joined back to the Statement table to retrieve the rest of the row.
SELECT *
FROM Statement s
JOIN (SELECT
member_id, MAX(created_at) max_created_at
FROM Statement
GROUP BY member_id
) latest
ON s.member_id = latest.member_id
AND s.created_at = latest.max_created_at
If you are using Rails 3 I would recommend taking a look at the new ActiveRecord query syntax. There is an overview at http://guides.rubyonrails.org/active_record_querying.html
I am pretty certain you could do what you are trying to do here without writing any SQL. There is an example in the "Grouping" section on that page which looks similar to what you are trying to do.

which query is better and efficient - mysql

I came across writing the query in differnt ways like shown below
Type-I
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS
INNER
JOIN Countries C
ON JS.CountryID = C.CountryID
INNER
JOIN SalaryBracket S
ON JS.MinSalaryID = S.SalaryID
INNER
JOIN DegreeDisciplines DD
ON JS.DegreeDisciplineID = DD.DisciplineID
INNER
JOIN DegreeType DT
ON JS.DegreeTypeID = DT.DegreeTypeID
WHERE
JS.ShowCV = 'Yes'
Type-II
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS, Countries C, SalaryBracket S, DegreeDisciplines DD
, DegreeType DT
WHERE
JS.CountryID = C.CountryID
AND JS.MinSalaryID = S.SalaryID
AND JS.DegreeDisciplineID = DD.DisciplineID
AND JS.DegreeTypeID = DT.DegreeTypeID
AND JS.ShowCV = 'Yes'
I am using Mysql database
Both works really well, But I am wondering
which is best practice to use all time for any situation?
Performance wise which is better one?(Say the database as a millions records)
Any advantages of one over the other?
Is there any tool where I can check which is better query?
Thanks in advance
1- It's a no brainer, use the Type I
2- The type II join are also called 'implicit join', whereas the type I are called 'explicit join'. With modern DBMS, you will not have any performance problem with normal query. But I think with some big complex multi join query, the DBMS could have issue with the implicit join. Using explicit join only could improve your explain plan, so faster result !
3- So performance could be an issue, but most important maybe, the readability is improve for further maintenance. Explicit join explain exactly what you want to join on what field, whereas implicit join doesn't show if you make a join or a filter. The Where clause is for filter, not for join !
And a big big point for explicit join : outer join are really annoying with implicit join. It is so hard to read when you want multiple join with outer join that explicit join are THE solution.
4- Execution plan are what you need (See the doc)
Some duplicates :
Explicit vs implicit SQL joins
SQL join: where clause vs. on clause
INNER JOIN ON vs WHERE clause
in the most code i've seen, those querys are done like your Type-II - but i think Type-I is better because of readability (and more logic - a join is a join, so you should write it as a join (althoug the second one is just another writing style for inner joins)).
in performance, there shouldn't be a difference (if there is one, i think the Type-I would be a bit faster).
Look at "Explain"-syntax
http://dev.mysql.com/doc/refman/5.1/en/explain.html
My suggestion.
Update all your tables with some amount of records. Access the MySQL console and run SQL both command one by one. You can see the time execution time in the console.
For the two queries you mentioned (each with only inner joins) any modern database's query optimizer should produce exactly the same query plan, and thus the same performance.
For MySQL, if you prefix the query with EXPLAIN, it will spit out information about the query plan (instead of running the query). If the information from both queries is the same, them the query plan is the same, and the performance will be identical. From the MySQL Reference Manual:
EXPLAIN returns a row of information
for each table used in the SELECT
statement. The tables are listed in
the output in the order that MySQL
would read them while processing the
query. MySQL resolves all joins using
a nested-loop join method. This means
that MySQL reads a row from the first
table, and then finds a matching row
in the second table, the third table,
and so on. When all tables are
processed, MySQL outputs the selected
columns and backtracks through the
table list until a table is found for
which there are more matching rows.
The next row is read from this table
and the process continues with the
next table.
When the EXTENDED keyword is used,
EXPLAIN produces extra information
that can be viewed by issuing a SHOW
WARNINGS statement following the
EXPLAIN statement. This information
displays how the optimizer qualifies
table and column names in the SELECT
statement, what the SELECT looks like
after the application of rewriting and
optimization rules, and possibly other
notes about the optimization process.
As to which syntax is better? That's up to you, but once you move beyond inner joins to outer joins, you'll need to use the newer syntax, since there's no standard for describing outer joins using the older implicit join syntax.