In Rails 3 with mysql, suppose I have two models, Customers and Purchases, obviously purchase belongs_to customer. I want to find all the customers with 2 orders or more. I can simply say:
Customer.includes(:purchases).all.select{|c| c.purchases.count > 2}
Effectively though, the line above makes query on the magnitude of Customer.all and Purchase.all, then does the "select" type processing in ruby. In a large database, I would much prefer to avoid doing all this "select" calculation in ruby, and have mysql do the processing and only give me the list of qualified customers. That is both much faster (since mysql is more tuned to do this) and significantly reduces bandwidth from the database.
Unfortunately I am unable to conjure up the code with the building blocks in rails(where, having, group, etc) to make this happen, something on the lines of (psudo-code):
Customer.joins(:purchases).where("count(purchases) > 2").all
I will settle for straight MySql solution, though I much prefer to figure this out in the elegant framework of rails.
No need to install a gem to get this to work (though metawhere is cool)
Customer.joins(:purchases).group("customers.id").having("count(purchases.id) > ?",0)
The documentation on this stuff is fairly sparse at this point. I'd look into using Metawhere if you'll be doing any more queries that are similar to this. Using Metawhere, you can do this (or something similar, not sure if the syntax is exactly correct):
Customer.includes(:purchases).where(:purchases => {:count.gte => 2})
The beauty of this is that MetaWhere still uses ActiveRecord and arel to perform the query, so it works with the 'new' rails 3 way of doing queries.
Additionally, you probably don't want to call .all on the end as this will cause the query to ping the database. Instead, you want to use lazy loading and not hit the db until you actually require the data (in the view, or some other method that is processing actual data.)
This is a bit more verbose, but if you want Customers where count = 0 or a more flexible sql, you would do a LEFT JOIN
Customer.joins('LEFT JOIN purchases on purchases.customer_id = customers.id').group('customers.id').having('count(purchases.id) = 0').length
Related
I'm using CakePHP 2.x. When I inspect the sql dump, I notice that it's "automagic" is causing one of my find()s to run several separate SELECT queries (and then presumably merging them all together into a single pretty array of data).
This is normally fine, but I need to run one very large query on a table of 10K rows with several joins, and this is proving too much for the magic to handle because when I try to construct it through find('all', $conditions) the query times out after 300 seconds. But when I write an equivalent query manually with JOINS, it runs very fast.
My theory is that whatever PHP "magic" is required to weave the separate queries together is causing a bottleneck for this one large query.
Is my theory a plausible explanation for what's going on?
Is there a way to tell Cake to just keep it simple and make one big fat SELECT instead of it's fancy automagic?
Update: I forgot to mention that I already know about $this->Model->query(); Using this is how I figured out that the slow-down was coming from PHP magic. It works when we do it this way, but it feels a little clunky to maintain the same query in two different forms. That's why I was hoping CakePHP offered an alternative to the way it builds up big queries from multiple smaller ones.
In cases like this where you query tables with 10k records you shouldn't be doing a find('all') without limiting the associations, these are some of the strategies you can apply:
Set recursive to 0 If you don't need related models
Use Containable Behavior to bring only the associated models you need.
Apply limits to your query
Caching is a good friend
Create and destroy associations on the fly As you need.
Since you didn't specify the problem I just gave you general ideas to apply depending on the problem you have
I have an application that allows users to filter applicants based on very large set of criteria. The criteria are each represented by boolean columns spanning multiple tables in the database. Instead of using active record models I thought it was best to use pure sql and put the bulk of the work in the database. In order to do this I have to construct a rather complex sql query based on the criteria that the users selected and then run it through AR on the db. Is there a better way to do this? I want to maximize performance while also having maintainable and non brittle code at the same time? Any help would be greatly appreciated.
As #hazzit said, it is difficult to answer without much details, but here's my two cents on this. Raw SQL is usually needed to perform complex operations like aggregates, calculations, etc. However, when it comes to search / filtering features, I often find using raw SQL overkill and not quite maintainable.
The key question here is : can you break down your problem in multiple independent filters ?
If the answer is yes, then you should leverage the power of ActiveRecord and Arel. I often find myself implementing something like this in my model :
scope :a_scope, ->{ where something: true }
scope :another_scope, ->( option ){ where an_option: option }
scope :using_arel, ->{ joins(:assoc).where Assoc.arel_table[:some_field].not_eq "foo" }
# cue a bunch of scopes
def self.search( options = {} )
output = relation
relation = relation.a_scope if options[:an_option]
relation = relation.another_scope( options[:another_option] ) unless options[:flag]
# add logic as you need it
end
The beauty of this solution is that you declare a clean interface in which you can directly pour all the params from your checkboxes and fields, and that returns a relation. Breaking the query into multiple, reusable scopes helps keeping the thing readable and maintainable ; using a search class method ties it all together and allows thorough documentation... And all in all, using Arel helps securing the app against injections.
As a side note, this does not prevent you from using raw SQL, as long as the query can be isolated inside a scope.
If this method is not suitable to your needs, there's another option : use a full-fledged search / filtering solution like Sunspot. This uses another store, separate from your db, that indexes defined parts of your data for easy and performant search.
It is hard to answer this question fully without knowing more details, but I'll try anyway.
While databases are bad at quite a few things, they are very good at filtering data, especially when it comes to a high volumes.
If you do the filtering in Ruby on Rails (or just about any other programming language), the system will have to retrieve all of the unfiltered data from the database, which will cause tons of disk I/O and network (or interprocess) traffic. It then has to go through all those unfiltered results in memory, which may be quite a burdon on RAM and CPU.
If you do the filtering in the database, there is a pretty good chance that most of the records will never be actually retrieved from disk, won't be handed over to RoR and won't then be filtered. The main reason for indexes to even exist is for the sole purpose of avoiding expensive operations in order to speed things up. (Yes, they also help maintain data integrity)
To make this work, however, you may need to help the database a bit to do its job efficiently. You will have to create indexes matching your filtering criteria, and you may have to look into performance issues with certain types of queries (how to avoid temporary tables and such). However, it is definately worth it.
Having that said, there actually are a few types of queries that a given database is not good at doing. Those are few and far between, but they do exist. In those cases, an implementation in RoR might be the better way to go. Even without knowing more about your scenario, I'd say it's a pretty safe bet that your queries are not among those.
I'm developing an online application for education research, where I frequently have the need for very complex SQL queries:
queries usually include 5-20 joins, often joined to the same table several times
the SELECT field often ends up being 30-40 lines tall, between derived fields / calculations and CASE statements
extra WHERE conditions are added in the PHP, based on user's permissions & other security settings
the user interface has search & sort controls to add custom clauses to the WHERE / ORDER / HAVING clauses.
Currently this app is built on PHP + MYSQL + Jquery for the moving parts. (This grew out of old Dreamweaver code.) Soon we are going to rebuild the application from scratch, with the intent to consolidate, clean, and be ready for future expansion. While I'm comfortable in PHP, I'm learning bits about Rails and realizing, Maybe it would be better to build version 2.0 on a more modern framework instead. But before I can commit to hours of tutorials, I need to know if the Rails querying system (ActiveRecord?) will meet our query needs.
Here's an example of one query challenge I'm concerned about. A query must select from 3+ "instances" of a table, and get comparable information from each instance:
SELECT p1.name AS my_name, pm.name AS mother_name, pf.name AS father_name
FROM people p1
JOIN mother pm ON p1.mother_id = pm.id
JOIN father pf ON p1.father_id = pf.id
# etc. etc. etc.
WHERE p1.age BETWEEN 10 AND 16
# (selects this info for 10-200 people)
Or, a similar example, more representative of our challenges. A "raw data" table joins multiple times to a "coding choices" table, each instance of which in turn has to look up the text associated with a key it stores:
SELECT d.*, c1.coder_name AS name_c1, c2.coder_name AS name_c2, c3.coder_name AS name_c3,
(c1.result + c2.result + c3.result) AS result_combined,
m_c1.selection AS selected_c1, m_c2.selection AS selected_c2. m_c3.selection AS selected_c3
FROM t_data d
LEFT JOIN t_codes c1 ON d.id = c1.data_id AND c1.category = 1
LEFT JOIN t_menu_choice m_c1 ON c1.menu_choice = m_c1.id
LEFT JOIN t_codes c2 ON d.id = c2.data_id AND c2.category = 2
LEFT JOIN t_menu_choice m_c2 ON c2.menu_choice = m_c2.id
LEFT JOIN t_codes c3 ON d.id = c3.data_id AND c3.category = 3
LEFT JOIN t_menu_choice m_c3 ON c3.menu_choice = m_c3.id
WHERE d.date_completed BETWEEN ? AND ?
AND c1.coder_id = ?
These sorts of joins are straightforward to write in pure SQL, and when search filters and other varying elements are needed, a couple PHP loops can help to cobble strings together into a complete query. But I haven't seen any Rails / ActiveRecord examples that address this sort of structure. If I'll need to run every query as pure SQL using find_by_sql(""), then maybe using Rails won't be much of an improvement over sticking with the PHP I know.
My question is: Does ActiveRecord support cases where tables need "nicknames", such as in the queries above? Can the primary table have an alias too? (in my examples, "p1" or "d") How much control do I have over what fields are selected in the SELECT statement? Can I create aliases for selected fields? Can I do calculations & select derived fields in the SELECT clause? How about CASE statements?
How about setting WHERE conditions that specify the joined table's alias? Can my WHERE clause include things like (using the top example) " WHERE pm.age BETWEEN p1.age AND 65 "?
This sort of complexity isn't just an occasional bizarre query, it's a constant and central feature of the application (as it's currently structured). My concern is not just whether writing these queries is "possible" within Rails & ActiveRecord; it's whether this sort of need is supported by "the Rails way", because I'll need to be writing a lot of these. So I'm trying to decide whether switching to Rails will cause more trouble than it's worth.
Thanks in advance! - if you have similar experiences with big scary queries in Rails, I'd love to hear your story & how it worked out.
Short answer is Yes. Rails takes care of the large part of these requirements through various types of relations, scopes, etc. Most important thing is to properly model your application to support types of queries and functionality you are going to need. If something is difficult to explain to a person, generally will be very hard to do in rails. It's optimized to handle most of "real world" type of relationships and tasks, so "exceptions" become somewhat difficult to fit into this convention, and later become harder to maintain, manage, develop further, decouple etc. Bottom line is that rails can handle sql query for you, SomeObject.all_active_objects_with_some_quality, give you complete control over sql SomeObject.find_by_sql("select * from ..."), execute("update blah set something=''...) and everything in between.
One of advantages of rails allows you to quickly create prototypes, I would create your model concepts, and then test the most complex business requirements that you have. This will give you a quick idea of what is possible and easy to do vs bottlenecks and potential issues that you might face in development.
I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.
I've been developing web applications for a while and i am quite comfortable with mySql, in fact as many do i use some form of SQL almost every day. I like the syntax and a have zero problems writing queries or optimizing my tables. I have enjoyed this mysql api.
The thing that has been bugging me is Ruby on Rails uses ActiveRecord and migrates everything so you use functions to query the database. I suppose the idea being you "never have to look at SQL again". Maybe this isn't KISS (keep it simple stupid) but is the ActiveRecord interface really best? If so why?
Is development without having to ever write a SQL statement healthy? What if you ever have to look something up that isn't already defined as a rails function? I know they have a function that allows me to do a custom query. I guess really i want to know what people think the advantages are of using ActiveRecord over mySQL and if anyone feels like me that maybe this would be for the rails community what the calculator was to the math community and some people might forget how to do long division.
You're right that hiding the SQL behind ActiveRecord's layer means people might forget to check the generated SQL. I've been bitten by this myself: missing indexes, inefficient queries, etc.
What ActiveRecord allows is making the easy things easy:
Post.find(1)
vs
SELECT * FROM posts WHERE posts.id = 1
You, the developer, have less to type, and thus have less chances for error.
Validation is another thing that ActiveRecord makes easy. You have to do it anyway, so why not have an easy way to do it? With the repetitive, boring, parts abstracted out?
class Post < ActiveRecord::Base
validates_presence_of :title
validates_length_of :title, :maximum => 80
end
vs
if params[:post][:title].blank? then
# complain
elsif params[:post][:title].length > 80 then
# complain again
end
Again, easy to specify, easy to validate. Want more validation? A single line to add to an ActiveRecord model. Convoluted code with multiple conditions is always harder to debug and test. Why not make it easy on you?
The final thing I really like about ActiveRecord instead of SQL are callbacks. Callbacks can be emulated with SQL triggers (which are only available in MySQL 5.0 or above), while ActiveRecord has had callbacks since way back then (I started on 0.13).
To summarize:
ActiveRecord makes the easy things easy;
ActiveRecord removes the boring, repetitive parts;
ActiveRecord does not prevent you from writing your own SQL (usually for performance reasons), and finally;
ActiveRecord is fully portable accross most database engines, while SQL itself is not (sometimes).
I know in your case you are talking specifically about MySQL, but still. Having the option is nice.
The idea here is that by putting your DB logic inside your Active Records, you're dealing with SQL code in one place, rather than spread all over your application. This makes it easier for the various layers of your application to follow the Single Responsibility Principle (that an object should have only one reason to change).
Here's an article on the Active Record pattern.
Avoiding SQL helps you when you decide to change the database scheme. The abstraction is also necessary for all kinds of things, like validation. I doesn't mean you don't get to write SQL: you can always do that if you feel the need for it. But you don't have to write a 5 line query where all you need is user.save. It's the rails philosophy to avoid unnecessary code.