I have two Sidekiq jobs. The first loads a feed of articles in JSON and splits it into multiple jobs. It also creates a log and stores a start_time.
class LoadFeed
include Sidekiq::Worker
def perform url
log = Log.create! start_time: Time.now, url: url
articles = load_feed(url) # this one loads the feed
articles.each do |article|
ProcessArticle.perform_async(article, log.id)
end
end
end
The second job processes an article and updates the end_time field of the former created log to find out, how long the whole process (loading the feed, splitting it into jobs, processing the articles) took.
class ProcessArticle
include Sidekiq::Worker
def perform data, log_id
process(data)
Log.find(log_id).update_attribute(:end_time, Time.now)
end
end
But now I have some problems / questions:
Log.find(log_id).update_attribute(:end_time, Time.now) isn't atomic, and because of the async behaviour of the jobs, this could lead to incorrect end_time values. Is there a way to do an atomic update of a datetime field in MySQL with the current time?
The feed can get pretty long (~ 800k articles) and updating a value 800k times when you would just need the last one seems like a lot of unnecessary work. Any ideas how to find out which one was the last job, and only update the end_time field in this job?
For 1) you could do an update with one less query and let MySQL find the time:
Log.where(id: log_id).update_all('end_time = now()')
For 2) one way to solve this would be to update your end time only if all articles have been processed. For example by having a boolean that you could query. This does not reduce the number of queries but would certainly have better performance.
if feed.articles.needs_processing.none?
Log.where(id: log_id).update_all('end_time = now()')
end
This is the problem Sidekiq Pro's Batch feature solves. You create a set of jobs, it calls your code when they are all complete.
class LoadFeed
include Sidekiq::Worker
def on_success(status, options)
Log.find(options['log_id']).update_attribute(:end_time, Time.now)
end
def perform url
log = Log.create! start_time: Time.now, url: url
articles = load_feed(url) # this one loads the feed
batch = Sidekiq::Batch.new
batch.on(:success, self.class, 'log_id' => log.id)
batch.jobs do
articles.each do |article|
ProcessArticle.perform_async(article, log.id)
end
end
end
end
Related
I have additional calculation columns (based on joins) I want to include in my CSV.
If I open and calculate it individually for every record
csv do
column :complicated_calculation { |r| r.calculate_things }
end
it's going to take a long time to generate with thousands of records.
I need to customize the SELECT query for when my CSV is generated and then use the columns in that query. How do I do that?
Customizing resource retrieval, in documentation, shows you how without rewriting the whole csv builder: modifying scoped_collection.
So if you have your query nicely waiting in your model:
class Person < ActiveRecord::Base
def self.with_calculation
select("people.*, (mumbo + jumbo * fumbo) "\
"AS complicated_calculation") # notice we name the attribute here
.joins("LEFT JOIN games ON person_id = people.id")
.group("people.id")
end
end
with your calculation you can do:
controller do
def scoped_collection
super.with_calculation
end
end
and then your CSV will have the attribute for free:
csv do
column :complicated_calculation
end
With Django, I have two related models. Call the first one BaseObject. The second one is called BaseObjectObservation, where every 6 hours or so I create a new BaseObjectObservation that's linked via ForeignKey to a BaseObject and has another field for a particular data point about that object at that time, along with a timestamp.
As you might expect, one thing I'm always interested in is the "latest" BaseObjectObservation for a given BaseObject. The trouble is that there are now lots of observations for each BaseObject, and even with ~500 BaseObjects, loading a page with all BaseObjects with each one's latest observation gets very slow.
Any recommendations on how to speed up the retrieval of the latest observation?
Bonus question: I'm also interested in how each object's observation has changed over the last 24 hours. Previously I tried querying for the latest observation and the observation closest to 24 hours ago and calculating the difference; this was too slow as well. Any recommendations here?
You could do something like:
class BaseObject(models.Model):
pass
class BaseObjectObservation(models.Model):
base_object = models.ForeignKey(BaseObject, related_name="observations")
last_modification = models.DateTimeField(auto_now=True)
latest = models.BooleanField(default=False)
def save(self, **kwargs)
if not self.pk:
# mark new instance as latest
self.latest = True
# Update previous observations
self.base_object.observations.update(latest=False)
super().save(**kwargs)
Then, if you want to get latest observations with their base object, you can do :
BaseObjectObservation.objects.filter(latest=True).select_related('base_object')
The select_related clause will save you 500 queries, because it will fetch the base object, along the observation.
Since you do everything in a single query, performance should be better. However, some cleanest solutions may exist without needing to store a boolean on each instance.
Bonus
For your bonus question, you can probably get some inspiration here:
import datetime
from django.utils import timezone
24_hours_ago = timezone.now() - datetime.timedelta(hours=24)
current_observation = base_object.observations.get(latest=True)
closest_observation_greater = base_object.observations.filter(creation_date__gt=24_hours_ago).first()
closest_observation_lower = base_object.observations.filter(creation_date__lte=24_hours_ago).first()
if closest_observation_greater - target > target - closest_observation_lower:
return closest_observation_lower
else:
return closest_observation_greater
However, that's still two query for each observation. You can probably optimize it, but you can also reduce
the number of element you display on each page. Do you really need to display 500 elements on the same page ?
I am selecting records from a table named "bookings" that contains over 100,000 records. I am new to SQL and for some reason this is taking many seconds to finish, and even timing out on my production server:
def bookings_in_date_range(division, startdate, enddate)
sql = "SELECT * FROM bookings WHERE division = '#{division}';"
bookings = ActiveRecord::Base.connection.execute(sql) # all bookings from this division
bookingsindaterange = bookings.select { |b| (parsedate(b["date"]) >= parsedate(startdate)) and (parsedate(b["date"]) <= parsedate(enddate)) } # refine to bookings in date range
end
def parsedate(date) # get date from mm/dd/yy format
d = date.split("/")
return Date.parse("#{d[2]}-#{d[0]}-#{d[1]}")
end
I also included the function I'm using to re-format the date, however executing the SQL statement appears to be where the process is hanging up, based on my tests.
My goal is to select all "bookings" in a "division" within a specified date range. Existing code works faster for divisions with low numbers of bookings.
EDIT
Otávio's code below seems to speed things up a bit. However, my requirement is to see if a booking falls within a date range (on or after startdate and on or before enddate). I can't figure out how to get this logic into the .where statement, so I am running a loop like this:
bookings_start_thru_end = []
(startdate..enddate).each do |date|
date_bookings = Booking.where("division = ? AND date = ?",division, date.strftime("%m/%d/%y"))
date_bookings.each do |b|
bookings_start_thru_end.push b
end
end
Also, the issue with crashing was ActiveRecord session store filling up. I was dumping a bunch of data from the report into the session store to save it between requests to avoid doing another database query, but this was killing performance. The database query is still taking 5 seconds or so, but I can live with that.
Use EXPLAIN to see what the query execution plan is:
https://dev.mysql.com/doc/refman/5.6/en/explain.html
https://dev.mysql.com/doc/refman/5.6/en/using-explain.html
Now my guess is that you do not have indexes on the columns that you are referencing in your WHERE and that leads to table scans which are causing your query to run very slowly. But that is just my guess since I do not know your tables.
The indexes will be required whether you are using raw sql or active record (spit).
Whenever possible, you should avoid executing raw SQL in your applications. Prefer to use ActiveRecord interfaces, this not only will make your app more secure but it will also execute queries in a way in which they are optimized to.
In your case, refactor your bookings_in_date_range method to use ActiveRecord's .where method:
def bookings_in_date_range(division, enddate, startdate)
YourModelName.where("division = ? AND enddate = ? AND startdate = ?",division, parsedate(enddate), parsedate(startdate))
end
To look for things in a range, use
YourModelName.where("division = ? AND enddate <= ? AND startdate >= ?",division, parsedate(enddate), parsedate(startdate))
User.select(:name).group(:name).having("count(*) > 1")
this query works fine to select records having duplicate user name. But problem am facing is when there is space in name.
For example.
recoed1 = "Username"
record2 = "Username "
This are the two records but they are having same name, but above query consider them as different records because space is there in the second record. So while selecting I did not get this record.
Any solution using normal mysql query or rails will do.
OR
How I can strip or trim all the column data first from table using rails/mysql query. Then I can apply above query.
What i would do here is make sure your data is tidy in the first place.
You could put in a pre-validation method to call strip on usernames. You could do it like so
#in lib/trimmer.rb
module Trimmer
# Make a class method available to define space-trimming behavior.
def self.included base
base.extend(ClassMethods)
end
module ClassMethods
# Register a before-validation handler for the given fields to
# trim leading and trailing spaces.
def trimmed_fields *field_list
before_validation do |model|
field_list.each do |field|
model.send("#{field}=", model.send("#{field}").strip) if model.send("#{field}").respond_to?('strip')
end
end
end
end
end
Make sure this module is required, wherever you require things in lib in your config.
Now, you can say, in any models, like so (in this example i'm doing some other fields besides username)
class User < ActiveRecord::Base
include Trimmer
trimmed_fields :username, :email, :first_name, :last_name
...
So, that will fix you going forward. The remaining step is to tidy up your existing data. I would do this in a migration. (again, i'm doing some other fields as an example)
tables_and_cols = {"users" => %w(username email first_name last_name), "foos" => %w(bar baz)}
tables_and_cols.each do |table, cols|
cols.each do |col|
ActiveRecord::Base.connection.execute("update #{tablename} set #{col} = trim(#{col})")
end
end
Now, after doing this trim, you may have some duplicate usernames. You will need to decide how you are going to deal with that, since the records involved are no longer valid. If you haven't publically deployed this yet, ie if you don't have any active real users, then it doesn't matter: you can change the usernames to something else. But if you do have real people using it you will probably need to change the username for some of them and inform them. This is unavoidable if you want to maintain a situation where people can't have spaces at the start or end of their username.
You can use mysql's string functions:
User.select("lower(trim(name))").group("lower(trim(name))").having("count(*) > 1")
I have an external service that allows me to log users into my website.
To avoid getting kicked out of it for overuse I use a MySQL table on the following form that caches user accesses:
username (STRING) | last access (TIMESTAMP) | has access? (TINYINT(1) - BOOLEAN)
If the user had access on a given time I trust he has access and don't query the service during a day, that is
query_again = !user["last access"].between?(Time.now, 1.day.ago)
This always returns true for some reason, any help with this logic?
In ranges (which you effectively use here), it is generally expected that the lower number is the start and the higher number is the end. Thus, it should work for you if you just switch the condition in your between? call
query_again = !user["last access"].between?(1.day.ago, Time.now)
You can test this yourself easily in IRB:
1.hour.ago.between?(Time.now, 1.day.ago)
# => false
1.hour.ago.between?(1.day.ago, Time.now)
# => true