MySQL datamapper adapter is chaining queries instead of joining - mysql

I'm using DataMapper (the ruby gem) as an ORM to a mysql database. (dm-core 1.1.0, do-mysql-adapter 1.1.0, do_mysql 0.10.6)
I'm writing an application that has two tables: a log of disk usage over time, and a "current usage" table containing foreign keys with the "latest" disk usage for easy reference. The DataMapper classes are Quota and LatestQuota, with a simple schema:
class Quota
include DataMapper::Resource
property :unique_id, Serial, :key => true
property :percentage, Integer
... (more properties)
end
class LatestQuota
include DataMapper::Resource
belongs_to :quota, :key => true
end
In my code I want to find all the entries in the LatestQuota table that correspond with a quota with a percentage higher than 95. I'm using the following datamapper query:
quotas = LatestQuota.all(:quota => {:percentage.gte => threshold})
...later...
quotas.select{|q| some_boolean_function?(q)}
Whereas some_boolean_function is something that filters out the results in a manner that DataMapper can't know about, hence why I need to call ruby's select().
But it ends up calling the following SQL queries (reported from DM's debug output:)
SELECT `unique_id` FROM `quota` WHERE `percentage` >= 95
then later:
SELECT `quota_unique_id` FROM `latest_quota`
WHERE `quota_unique_id` IN (52, 78, 82, 232, 313, 320…. all the unique id's from the above query...)
This is a ridiculously suboptimal query, so I think I'm doing something wrong. The quota table has millions of records in it (historical data) versus the 15k or so records in latest_quota, and selecting all quota records first and then selecting latest_quota records out of the results is exactly the wrong way to do it.
What I would like it to do is something to the effect of:
SELECT q.* from quota q
INNER JOIN latest_quota lq
ON lq.quota_unique_id=q.unique_id
WHERE q.percentage >= 95;
Which takes .01 seconds with my current data, instead of the 5 minutes or so it takes DataMapper to do its query. Any way to coerce it to do what I want? Do I have my relations wrong? Am I querying it wrong?

For some reason nested-Hash-style queries will always perform sub-selects. To force INNER JOINs, use String query-paths: LatestQuota.all('quota.percentage.gte' => threshold)

Related

Order and sort_by difference in Ruby on Rails ActiveRecord

I am trying to sort my data according to timestamp field in my controller, note that the timestamp field may be null and may have some value. I wrote the following query.
#item = Item.sort_by(&:item_timestamp).reverse
.paginate(:page => params[:page], :per_page =>5)
But this gives error when I have items that have time_timestamp field value as NULL, but following query works.
#item = Item.order(:item_timestamp).reverse
.paginate(:page => params[:page], :per_page => 5)
Can anybody tell the difference between these two queries, and in which condition to use which one?
And I am using order and reverse to get the latest items from the database, Is this the best way or there are other best ways to get the latest data from database in terms of performance?
.sort_by is a Ruby method from Enumerable that is used to sort arrays (or array like objects). Using .sort_by will cause all the records to be loaded from the database into the servers memory, which can lead to serious performance problems (as well as your issue with nil values).
.order is a ActiveRecord method that adds a ORDER BY clause to the SQL select statement. The database will handle sorting the records. This is preferable in 99% of cases.
sort_by is executed in Ruby, so if you have a nil value, things will break in a way similar to this:
[3, nil, 1].sort
#=> ArgumentError: comparison of Fixnum with nil failed
order is executed by your RDBMS, which generally will do fine with NULL values. You can even specify where you want to put the NULL VALUES, by adding NULL FIRST (usually the default) or NULL LAST to your ORDER BY clause?
Hey you needn't you sort in that query, it'll work very long, if you work with DB you should always use :order, there solution for your problem
#item = Item.order('item_timestamp DESC NULLS LAST').paginate(:page => params[:page], :per_page => 5)
As it was said before me, .order is quicker, and it's enough in most cases, but sometimes you need sort_by, if you want to sort by value in a relation for example.
If you have a posts table and a view_counters table, where you have the number of views by article, you can't easily sort your posts by total views with .order.
But with sort_by, you can just do:
posts = #user.posts.joins(:view_counter)
#posts = posts.sort_by { |p| p.total_views }
.sort_by going to browse each element, get the relation value, then sort by the value of this relation, just with one code line.
You can further reduce the code with &:[attributeName], for example:
#posts = posts.sort_by(&:total_views)
Also, for your last question about the reverse, you can do this:
Item.order(item_timestamp: :desc)
When you use sort_by you break active record caching and as pointed out before, you load all the records into RAM memory.
When writing down queries, please always think about the SQL and the memory world, they are 2 separate things. It is like having an archive (SQL) and cart (Memory) where you put the files you take out of the archive to use later.
As most people mentioned the main difference is sort_by is a Ruby method and order is Rails ActiveRecord method. However, the scenario where to use them may vary case by case. For example you may have a scenario where sort_by may be appropriate if you already retrieved the data from the DB and want to sort on the loaded data. If you use order on then you might introduce n+1 issue and go to the database again while you already have the data loaded.

Rails 3: What is the best way to update a column in a very large table

I want to update all of a column in a table with over 2.2 million rows where the attribute is set to null. There is a Users table and a Posts table. Even though there is a column for num_posts in User, only about 70,000 users have that number populated; otherwise I have to query the db like so:
#num_posts = #user.posts.count
I want to use a migration to update the attributes and I'm not sure whether or not it's the best way to do it. Here is my migration file:
class UpdateNilPostCountInUsers < ActiveRecord::Migration
def up
nil_count = User.select(:id).where("num_posts IS NULL")
nil_count.each do |user|
user.update_attribute :num_posts, user.posts.count
end
end
def down
end
end
In my console, I ran a query on the first 10 rows where num_posts was null, and then used puts for each user.posts.count . The total time was 85.3ms for 10 rows, for an avg of 8.53ms. 8.53ms*2.2million rows is about 5.25 hours, and that's without updating any attributes. How do I know if my migration is running as expected? Is there a way to log to the console %complete? I really don't want to wait 5+ hours to find out it didn't do anything. Much appreciated.
EDIT:
Per Max's comment below, I abandoned the migration route and used find_each to solve the problem in batches. I solved the problem by writing the following code in the User model, which I successfully ran from the Rails console:
def self.update_post_count
nil_count = User.select(:id).where("num_posts IS NULL")
nil_count.find_each { |user|
user.update_column(:num_posts, user.posts.count) if user.posts
}
end
Thanks again for the help everyone!
desc 'Update User post cache counter'
task :update_cache_counter => :environment do
users = User.joins('LEFT OUTER JOIN "posts" ON "posts.user_id" = "users.id"')
.select('"users.id", "posts.id", COUNT("posts.id") AS "p_count"')
.where('"num_posts" IS NULL')
puts "Updating user post counts:"
users.find_each do |user|
print '.'
user.update_attribute(:num_posts, user.p_count)
end
end
First off don't use a migration for what is essentially a maintenance task. Migrations should mainly alter the schema of your database. Especially if it is long running like in this case and may fail midway resulting in a botched migration and problems with the database state.
Then you need to address the fact that calling user.posts is causing a N+1 query and you instead should join the posts table and select a count.
And without using batches you are likely to exhaust the servers memory quickly.
You can use update_all and subquery to do this.
sub_query = 'SELECT count(*) FROM `posts` WHERE `posts`.`user_id` = `users`.`id`'
User.where('num_posts IS NULL').update_all('num_posts = (#{sub_query})')
It will take only seconds instead of hours.
If so, you may not have to find a way to log something.

Pass array in raw MySQL query in Ruby on Rails

So, I have a problem. I have a query which returns ids from one table (say table1) and I have to pass those ids to another query which uses table2. (Writing inner selects or joins is not an option due to some certain reasons).
Query:
client = Mysql2::Client.new(:host => "localhost", :username => "", :password => "", :database =>"test")
query1 = %Q{select id from table1 where code='ABC123'}
ids = client.query(query1)
query2 = %Q{select * from table2 where `table2`.`table1_id` IN (#{ids}) and status="rejected"}
table2_data = client.query(query2)
ids is Mysql2::Result type
Also, when I do ids.to_a, the resulting array has data something like this: [{"id"=>1}, {"id"=>2}]
I need some feasible way to pass ids to the second query. I tried ids.to_a, but it gives error due to the braces [ ]. I have also tried concatenating, say the MySQL result is:
array = ids.to_a # [1,2,3]
id_new = "("+#{array.join(',')}+")"
id_new becomes "(1,2,3)" which is a string and hence IN doesn't work.
Can anyone please suggest something how to pass ids array in the raw MySQL query?
I have banged my head finding the answer, but couldn't find an appropriate one.
Edit: I can use Active Record only for query1 and if that is the case and ids is an Active Record object, can anyone suggest how to pass it to query2 in the IN clause which is supposed to be a raw SQL query?
Edit2: I can't use Active Record (for query2) or join because it's making the query heavy and taking long time (>10s) to fetch the result (indices are present). So, I am using raw query to optimise it.
When I ran similar queries to try to mimic your problem I saw that I'm getting an array of array for ids, like [["1"], ["2"], ["3"]].
If this is also what you're getting then you should call ids.flatten before calling join:
query2 = %Q{select * from table2 where `table2`.`table1_id` IN (#{ids.flatten.join(',')}) and status="rejected"}
array.flatten removes extra braces, so:
[[1], [2], [3]].flatten
# => [1,2,3]
[[1], [2], [3]].flatten.join(',')
# => "1,2,3"
EDIT
Since you reported you are receiving a Mysql2::Result object, do this:
ids.to_a.map(&:values).flatten.join(',')
The to_a first converts the Mysql2::Result to an array of hashes that looks like this:
[{"id"=>"1"}, {"id"=>"2"}]
Then using map(&:values) we convert it to an array that looks like this:
[["1"], ["2"]]
This array is similar to the above (before the edit), so running flatten.join(',') converts it to the string you are looking for.
Note that instead of doing map(&:values).flatten you could use the common shortcut flat_map(&:values) which results in the same thing.
Are you sure it doesn't work because it is a string. I think it doesn't work because of duplicate brackets. Please try this:
array = ids.flat_map(&:values).join(',')
query2 = %Q{select * from table2 where `table2`.`table1_id` IN (#{array}) and status="rejected"}
I suggest to use a ORM (object-relational mapping) like the ActiveRecord or Sequel gems - especially because building database queries manually by string concatination is error prone and leads to vulnerabilities like sql injections.
If the main reason you posted was to learn how to extract data from an array of hashes, then you can ignore this answer.
However, if you wanted the best way to get the data from the database, I'd suggest you use ActiveRecord to do the donkey work for you:
class Table1 < ActiveRecord::Base
self.table_name = :table1
has_many :table2s
end
class Table2 < ActiveRecord::Base
self.table_name = :table2
belongs_to :table1
end
table2_data = Table2.joins(:table1).where(table1: {code: 'ABC123'}, status: 'rejected')
A key point is that a SQL join, will effectively do the processing of the IDs for you. You could code up the SQL join yourself, but ActiveRecord will do that for you, and allow you to add the additional queries, such that you can gather the data you want in one query.
You can join array with comma, like following code.
ids = ids.to_a.map{|h| h['id']}
query2 = %Q{select * from table2 where `table2`.`table1_id` IN (#{ids.join(',')}) and status="rejected"}
table2_data = client.query(query2)
It will work fine.

Rails - how to fetch random records from an object?

I am doing something like this:
data = Model.where('something="something"')
random_data = data.rand(100..200)
returns:
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Once I get this random data, I need to iterate through that data, like this:
random_data.each do |rd|
...
I know there's a way to fetch random data in MySQL, but I need to pick the random data like 400 times, so I think to load data once from database and 400 times to pick random number is more efficient than to run the query 400 times on MySQL.
But - how to get rid of that error?
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Thank you in advance
I would add the following scope to the model (depends on the database you are using):
# to model/model.rb
# 'RANDOM' works with postgresql and sqlite, whereas mysql uses 'RAND'
scope :random, -> { order('RAND()') }
Then the following query would load a random number (in the range of 200-400) of objects in one query:
Model.random.limit(rand(200...400))
If you really want to do that in Rails and not in the database, then load all records and use sample:
Model.all.sample(rand(200..400))
But that to be slower (depending on the number of entries in the database), because Rails would load all records from the database and instantiate them what might take loads of memory.
It really depends how much effort you want to put into optimizing this, because there's more than one solution. Here's 2 options..
Something simple is to use ORDER BY RAND() LIMIT 400 to randomly select 400 items.
Alternatively, just select everything under the moon and then use Ruby to randomly pick 400 out of the total result set, ex:
data = Model.where(something: 'something').all # all is necessary to exec query
400.times do
data.sample # returns a random model
end
I wouldn't recommend the second method, but it should work.
Another way, which is not DB specific is :
def self.random_record
self.where('something = ? and id = ?', "something", rand(self.count))
end
The only things here is - 2 queries are being performed. self.count is doing one query - SELECT COUNT(*) FROM models and the other is your actual query to get a random record.
Well, now suppose you want n random records. Then write it like :
def self.random_records n
records = self.count
rand_ids = Array.new(n) { rand(records) }
self.where('something = ? and id IN (?)',
"something", rand_ids )
end
Use data.sample(rand(100..200))
for more info why rand is not working, read here https://rails.lighthouseapp.com/projects/8994-ruby-on-rails/tickets/4555

ActiveRecord::Base SQL result object types different in MySQL and PostgreSQL

Am I missing something? Same console, same codebase, different database connections. Result: different object types returned. If MySQL is used, we get an array of arrays, if PostgreSQL is used, we get an array of hashes.
Example classes
class User < ActiveRecord::Base
...
end
class Series < ActiveRecord::Base
establish_connection postgres_database_hash
...
end
With a MySQL connection
> User.connection.execute('SELECT * from users limit 2').to_a
(211.0ms) SELECT * from users limit 2
=> [[1, "jmetta", "jmetta#gmail.com"], [2, "johnmetta", "jmetta+test#gmail.com"]]
With a Postgres connection
> Series.connection.execute('SELECT * from series limit 2').to_a
(107.1ms) SELECT * from series limit 2
=> [{"id"=>"29", "enr_id"=>"114118", "ent_id"=>"164",}, {"id"=>"30", "enr_id"=>"114110", "ent_id"=>"164"}]
Coda
It seems that, at this level of abstraction where to_a is asked to give a result, the result should be the same.
The thing you're missing is that execute is a very low-level connection to the underlying database driver that is most useful for sending SQL DDL (Data Definition Language) into the database to manually alter tables, add constraints and indexes that AR doesn't understand, etc. If you want to send raw queries into the database and get raw results back, you should use the slightly higher level select_rows instead:
User.connection.select_rows('SELECT * from users limit 2').each do |row|
# `row` is an Array of Strings here
end
A select_rows call should give you an array-of-arrays-of-strings with any database.
execute will return whatever the underlying driver returns, select_rows will return something consistent.