Rails - how to fetch random records from an object? - mysql

I am doing something like this:
data = Model.where('something="something"')
random_data = data.rand(100..200)
returns:
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Once I get this random data, I need to iterate through that data, like this:
random_data.each do |rd|
...
I know there's a way to fetch random data in MySQL, but I need to pick the random data like 400 times, so I think to load data once from database and 400 times to pick random number is more efficient than to run the query 400 times on MySQL.
But - how to get rid of that error?
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Thank you in advance

I would add the following scope to the model (depends on the database you are using):
# to model/model.rb
# 'RANDOM' works with postgresql and sqlite, whereas mysql uses 'RAND'
scope :random, -> { order('RAND()') }
Then the following query would load a random number (in the range of 200-400) of objects in one query:
Model.random.limit(rand(200...400))
If you really want to do that in Rails and not in the database, then load all records and use sample:
Model.all.sample(rand(200..400))
But that to be slower (depending on the number of entries in the database), because Rails would load all records from the database and instantiate them what might take loads of memory.

It really depends how much effort you want to put into optimizing this, because there's more than one solution. Here's 2 options..
Something simple is to use ORDER BY RAND() LIMIT 400 to randomly select 400 items.
Alternatively, just select everything under the moon and then use Ruby to randomly pick 400 out of the total result set, ex:
data = Model.where(something: 'something').all # all is necessary to exec query
400.times do
data.sample # returns a random model
end
I wouldn't recommend the second method, but it should work.

Another way, which is not DB specific is :
def self.random_record
self.where('something = ? and id = ?', "something", rand(self.count))
end
The only things here is - 2 queries are being performed. self.count is doing one query - SELECT COUNT(*) FROM models and the other is your actual query to get a random record.
Well, now suppose you want n random records. Then write it like :
def self.random_records n
records = self.count
rand_ids = Array.new(n) { rand(records) }
self.where('something = ? and id IN (?)',
"something", rand_ids )
end

Use data.sample(rand(100..200))
for more info why rand is not working, read here https://rails.lighthouseapp.com/projects/8994-ruby-on-rails/tickets/4555

Related

Stream a NOT NULL selection of a table?

I'm trying to select the primary key for all rows in a table based on if another column is NULL.
The following code does not do what I want, but this is what it would look like as a pure select(), but the table is so large that it nearly fills up memory before returning any results.
s = tweets.select().where(tweets.c.coordinates != None)
result = engine.execute(s)
for row in result:
print(row)
Because the table is so large, I found a streaming solution that works for the session.query() object:
def page_query(q):
r = True
offset = 0
while r:
r = False
for elem in q.limit(1000).offset(offset):
r = True
yield elem
offset += 1000
so I'm trying to structure the above select() as a query(), but when I do, it returns every row in the table, including ones with coordinates = 'null'
q= session.query(Tweet).filter(Tweet.coordinates.is_not(None))
for i in page_query(q):
print(f' {i}')
If I instead do
q= session.query(Tweet).filter(Tweet.coordinates.is_not('null'))
for i in page_query(q):
print(f' {i}')
I get an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "'null'"
LINE 3: WHERE milan_tweets.coordinates IS NOT 'null'
^
(using != appears to give the same results as the built in .is_not())
So how can I make this selection?
EDIT: Code block at the top does NOT do what I expected originally, my mistake.
Rows are added to the database as python Nones, and looking in dbeaver shows the values as "null"
You have correctly diagnosed the problem.
Query returns e.g. a million rows,
and the psycopg2 driver drags all
of those result rows over the network,
buffering them locally, before returning
even a single row up to your app.
Why? Because the public API includes a
detail where your app could ask
"how many rows were in that result?",
and the driver must retrieve all in order
to learn that bit of trivia.
If you promise not to ask the "how many?"
question, you can stream results with this:
import sqlalchemy as sa
engine = sa.create_engine(uri).execution_options(stream_results=True)
Then rows will be up-delivered to your app
nearly
as soon as they become available,
rather than being buffered
a long time.
This yields a significantly smaller
memory footprint for your python process,
as the DB driver layer does not need
to malloc() storage sufficient to
store all million result rows.
https://docs.sqlalchemy.org/en/14/core/connections.html#streaming-with-a-fixed-buffer-via-yield-per
cf test_core_fetchmany_w_streaming

SqlAlchemy table name reflection using an efficient method

I am using the code below to extract table names on a database at a GET call in a Flask app.:
session = db.session()
qry = session.query(models.BaseTableModel)
results = session.execute(qry)
table_names = []
for row in results:
for column, value in row.items():
#this seems like a bit of a hack
if column == "tables_table_name":
table_names.append(value)
print('{0}: '.format(table_names))
Given that tables in the database may added/deleted regularly, is the code above an efficient and reliable way to get the names of tables in a database?
One obvious optimization is to use row["tables_table_name"] instead of second loop.
Assuming that BaseTableModel is a table, which contains names of all other tables, than you're using the fastest approach to get this data.

Multiple, unknown number of fields passed into a query

Is it possible to create a generic query that would work for different types of documents? For example I have "cases" and "factories",
They have different set of fields. e.g:
{
id: 'case_o1',
name: 'Case numero uno',
amount: 40
}
{
id: 'factory_002',
location: 'Venezuela',
workers: 200,
operating: true
}
Is it possible to create a generic query where I would pass the type of an entity (case or factory) and additional parameters and it would filter results based on those?
I could of course use javascript view, but it doesn't allow me to filter by multiple fields. Let's say I want to fetch all factories located in Venezuela, with number of workers between 20 and 55.
I started with this, but then I got stuck:
select * from `mybucket` as entity
where position(meta(entity).id, $entity_type) == 0
How do I pass multiple predicates and have the query to recognize them?
I can of course list fields like this:
where position(meta(entity).id, $entity_type) == 0
and entity.location == 'Venezuela'
and entity.workers > $workers_min
and entity.workers < $workers_max
but then
I'm gonna have to create a separate query for each entity
And even then it won't solve my problem - I have no idea how to ignore predicates, what if next time $workers_min and $workers_max are not passed, does it mean I have to create a query for every single predicate (column)?
For security reasons I cannot generate free-form queries and pass them to Couchbase server, all the queries are already stored in the database, our api just picks them up out of a document and executes them
I think it's possible to create a query that would be "short-circuiting" for args that's undefined (e.g. WHERE $location IS MISSING OR entity.location == $location or something like that)
Is it possible at all to create a query that would be able to effectively filter and order a dataset based on arbitrary parameters? Or there's no way?
#Agzam. Sorry. I were writting my comment when you said it. But anyway. What you are asking for is possible by using coalesces in a not too complex expressions, but it is a REALLY bad idea because this will drastically throw down most of internal database optimizations. Including the use of any existing index. So, except if you are dealing with a relatively small database (and you are sure it will remain being approximately the same size), I suggest you to better try distinct approach… This is, in fact, the reason I implmented sqlapi.
If you need to have all querys previously stored in database, it probably could be much better to sort given arguments by its name and precalculate and store precalculated querys for each possible combination.
You can do it by assigning a default value to the variable when is not used. For instance if $location is not used you can set it to -1 as default value.
Then the where condition would be:
WHERE ($location=-1 OR entity.location = $location)

What's the mongoid equvilent to COUNT(`column`) in MYSQL?

I want to count the amount of matching documents with a query using mongoid such as:
Chain.where(:updated_at.gte => past_time).count
However, I am worried that what's actually happening here is that mongoid is selecting and PARSING everything from mongoid and then returning the count to me. This seems very slow. I want mongo to directly return to me a count, so that ruby/mongoid doesnt have to parse a large amount of objects. In MYSQL I would do this by doing COUNT(column), which would spare PHP (for instance) the hassle of parsing/mapping a bunch of rows just to disregard them since I'm only interested in the amount of rows returned.
You're worrying needlessly. If you check the Mongoid docs, you'll see that Criteria#count is thing wrapper around Moped::Query#count. If you look at how Moped::Query#count works, you'll see this:
def count(limit = false)
command = { count: collection.name, query: selector }
command.merge!(skip: operation.skip, limit: operation.limit) if limit
result = collection.database.command(command)
result["n"].to_i
end
So Moped::Query#count simply sends a count command down into MongoDB, then MongoDB does the counting and sends the count back to your Ruby code.

How could I know how much time it takes to query items in a table of MYSQL?

Our website has a problem: The visiting time of one page is too long. We have found out that it has a n*n matrix in that page; and for each item in the matrix, it queries three tables from MYSQL database. Every item in that matrix do the query quiet alike.
So I wonder maybe it is the large amount of MYSQL queries lead to the problem. And I want to try to fix it. Here is one of my confusions I list below:
1.
m = store.execute('SELECT X FROM TABLE1 WHERE I=1')
result = store.execute('SELECT Y FROM TABLE2 WHERE X in m')
2.
r = store.execute('SELECT X, Y FROM TABLE2');
result = []
for each in r:
i = store.execute('SELECT I FROM TABLE1 WHERE X=%s', each[0])
if i[0][0]=1:
result.append(each)
It got about 200 items in TABLE1 and more then 400 items in TABLE2. I don't know witch part takes the most time, so I can't make a better decision of how to write my sql statement.
How could I know how much time it takes to do some operation in MYSQL? Thank you!
Rather than installing a bunch of special tools, you could take a dead-simple approach like this (pardon my Ruby):
start = Time.new
# DB query here
puts "Query XYZ took #{Time.now - start} sec"
Hopefully you can translate that to Python. OR... pardon my Ruby again...
QUERY_TIMES = {}
def query(sql)
start = Time.new
connection.execute(sql)
elapsed = Time.new - start
QUERY_TIMES[sql] ||= []
QUERY_TIMES[sql] << elapsed
end
Then run all your queries through this custom method. After doing a test run, you can make it print out the number of times each query was run, and the average/total execution times.
For the future, plan to spend some time learning about "profilers" (if you haven't already). Get a good one for your chosen platform, and spend a little time learning how to use it well.
I use the MySQL Workbench for SQL development. It gives response times and can connect remotely to MySQL servers granted you have the permission (which in this case will give you a more accurate reading).
http://www.mysql.com/products/workbench/
Also, as you've realized it appears you have a SQL statement in a for loop. That could drastically effect performance. You'll want to take a different route with retrieving that data.