Ruby on Rails SQL "SELECT" taking a long time - mysql

I am selecting records from a table named "bookings" that contains over 100,000 records. I am new to SQL and for some reason this is taking many seconds to finish, and even timing out on my production server:
def bookings_in_date_range(division, startdate, enddate)
sql = "SELECT * FROM bookings WHERE division = '#{division}';"
bookings = ActiveRecord::Base.connection.execute(sql) # all bookings from this division
bookingsindaterange = bookings.select { |b| (parsedate(b["date"]) >= parsedate(startdate)) and (parsedate(b["date"]) <= parsedate(enddate)) } # refine to bookings in date range
end
def parsedate(date) # get date from mm/dd/yy format
d = date.split("/")
return Date.parse("#{d[2]}-#{d[0]}-#{d[1]}")
end
I also included the function I'm using to re-format the date, however executing the SQL statement appears to be where the process is hanging up, based on my tests.
My goal is to select all "bookings" in a "division" within a specified date range. Existing code works faster for divisions with low numbers of bookings.
EDIT
Otávio's code below seems to speed things up a bit. However, my requirement is to see if a booking falls within a date range (on or after startdate and on or before enddate). I can't figure out how to get this logic into the .where statement, so I am running a loop like this:
bookings_start_thru_end = []
(startdate..enddate).each do |date|
date_bookings = Booking.where("division = ? AND date = ?",division, date.strftime("%m/%d/%y"))
date_bookings.each do |b|
bookings_start_thru_end.push b
end
end
Also, the issue with crashing was ActiveRecord session store filling up. I was dumping a bunch of data from the report into the session store to save it between requests to avoid doing another database query, but this was killing performance. The database query is still taking 5 seconds or so, but I can live with that.

Use EXPLAIN to see what the query execution plan is:
https://dev.mysql.com/doc/refman/5.6/en/explain.html
https://dev.mysql.com/doc/refman/5.6/en/using-explain.html
Now my guess is that you do not have indexes on the columns that you are referencing in your WHERE and that leads to table scans which are causing your query to run very slowly. But that is just my guess since I do not know your tables.
The indexes will be required whether you are using raw sql or active record (spit).

Whenever possible, you should avoid executing raw SQL in your applications. Prefer to use ActiveRecord interfaces, this not only will make your app more secure but it will also execute queries in a way in which they are optimized to.
In your case, refactor your bookings_in_date_range method to use ActiveRecord's .where method:
def bookings_in_date_range(division, enddate, startdate)
YourModelName.where("division = ? AND enddate = ? AND startdate = ?",division, parsedate(enddate), parsedate(startdate))
end
To look for things in a range, use
YourModelName.where("division = ? AND enddate <= ? AND startdate >= ?",division, parsedate(enddate), parsedate(startdate))

Related

How can I pull data from my database using the Django ORM that annotates values for each day?

I have a Django app that is attached to a MySQL database. The database is full of records - several million of them.
My models look like this:
class LAN(models.Model):
...
class Record(models.Model):
start_time = models.DateTimeField(...)
end_time = models.DateTimeField(...)
ip_address = models.CharField(...)
LAN = models.ForeignKey(LAN, related_name="records", ...)
bytes_downloaded = models.BigIntegerField(...)
bytes_uploaded = models.BigIntegerField(...)
Each record reflects a window of time, and shows if a particular IP address on particular LAN did any downloading or uploading during that window.
What I need to know is this:
Given a beginning date, and end date, give me a table of which DAYS a particular LAN had ANY activity (has any records)
Ex:
Between Jan 1 and Jan 31, tell me which DAYS LAN A had ANY records on
them
Assume that once in a while, a LAN will shut down for days at a time and have no records or any activity on those days.
My Solution:
I can do this the slow way by attaching some methods to my LAN model:
class LAN(models.Model):
...
# Returns True if there are records for the current LAN between 2 given dates
# Returns False otherwise
def online(self, start, end):
criterion1 = Q(start_time__lt=end)
criterion2 = Q(end_time__gt=start)
return self.records.filter(criterion1 & criterion2).exists()
# Returns a list of days that a LAN was online for between 2 given dates
def list_online_days(self, start, end):
start_date = timezone.make_aware(timezone.datetime.strptime(start, "%b %d, %Y"))
end_date = timezone.make_aware(timezone.datetime.strptime(end, "%b %d, %Y"))
end_date = end_date.replace(hour=23, minute=59, second=59, microsecond=999999)
days_online = []
current_date = start.astimezone()
while current_date <= end:
start_of_day = current_date.replace(hour=0, minute=0, second=0, microsecond=0)
end_of_day = current_date.replace(hour=23, minute=59, second=59, microsecond=999999)
if self.online(start=start_of_day, end=end_of_day):
days_online.append(current_date.date())
current_date += timezone.timedelta(days=1)
return days_online
At which point, I can run:
lan = LAN.objects.get(id=1) # Or whatever LAN I'm interested in
days_online = lan.list_online_days(start="Jan 1, 2020", end="Jan 31, 2020")
This works, but results in one query being run per day between my start date and end date. In this case, 31 queries (Jan 1, Jan 2, etc.).
This makes it really, really slow for large time periods, as it needs to go through all the records in the database 31 times. Database indexing helps, but it's still slow with enough data in the database.
Is there a way to do a single database query to give me what I need?
I feel like it would look something like this, but I can't quite get it right:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date').annotate(exists=Exists(SOMETHING))
The first part:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date')
Seems to give me what I want - one value per day, but I'm not sure how to annotate the result with an exists field that shows if any records exist on that day.
Note: This is a simplified version of my app - not the exact models and fields, so if certain things could be improved, like not using CharField for the ip_address field, don't focus too much on that
The answer ended up being simpler than I thought, mostly because I already had it.
This:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date').annotate(exists=Exists(Record.objects.filter(pk=OuterRef('pk'))))
Was what I was expecting, but all it does is return exists=True for all days returned, which is accurate, but not overly helpful. This is because any days that had no records on them are already omitted from the results.
That means I can skip the entire annotate section, and just do this:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date')
which already gives me a list of datetime objects when there were records present, and skips any where there weren't.

Does Django (w/ MySQL) do only 1 table lookup or multiple when counting with CASE

Lets say we have a model of the form
class Person(models.Model):
is_gay = models.BooleanField()
is_tall = models.BooleanField()
is_nice = models.BooleanField()
...
Now lets say we want to know how many people we have meeting different criteria. We can achieve this by counting them
num_gays = models.Person.objects.filter(is_gay=True).count()
num_tall_and_nice = models.Person.objects.filter(is_tall=True, is_nice=True).count()
Unfortunately, this would need 2 database queries. As you can imagine, as the number of types of people grows large, (e.g. an enumeration of 25/30) this can slow down quite a lot.
So now we can optimize by instead using aggregations
aggregations = {}
when = models.When(is_gay=True, then=1)
case = models.Case(when, output_fields=models.IntegerField())
sum = models.Sum(case)
aggregations['num_gays'] = sum
when = models.When(is_tall=True, is_nice=Ttrue, then=1)
case = models.Case(when, output_fields=models.IntegerField())
sum = models.Sum(case)
aggregations['num_tall_and_nice'] = sum
result = models.Person.objects.aggregate(**aggregations)
However, I am curious as to how Django (using MySQL) processes this query.
Does it look at the table only once, and each time it looks at a row, it adds 1 to every single CASE statement that applies. Or does it look at the table N times where N is the number of CASE statements?
No this will be just one query. Because django Case/When more or less directly translates to mysql CASE/WHEN. However when in doubt you can always find out what queries have been executed by django using this bit of code
from django.db import connection
print connection.queries
Any query on and RDBMS without a where clause examines the full table. Every single one of the rows in your table will be looked at. This query doesn't seem to have a where clause.
But as for the number of queries that are executed. It's exactly 1

SSIS For Each loop based on records

I want to accomplish a fairly simple task (I'd think).
I have one table with a shiftid (INT), shiftstart (datetime), shiftend (datetime).
I'd like to query that table, then run a query (on an entirely different database) that asks for production (which is calculated in an odd way - requiring three separate queries) using the start and end times, and store that in the original database with the shiftid and a production amount for the shift.
I've tried to do this using a Foreach Loop and a script task that builds a variable that would contain the query, but I'm continually hitting a brick wall there.
Dts.Variables("User::SQLshiftstart").Value = "SELECT value FROM[dbo].[AnalogHistory] WHERE TagName = 'Z_HISTFMZ_P2_0004' AND DateTime = '" & Dts.Variables("User::shiftstart").ToString
I keep getting an error - "Command text was not set for the command object". And googling that error doesn't push me any further down the path.
Help!
Well, I decided to go a different way instead of using a script object to build a variable. I actually created the variable in my SELECT:
SELECT (CONCAT
('SELECT CAST(value AS DECIMAL(10,4)) AS beg FROM [dbo].[AnalogHistory] WHERE TagName = ''Z_HISTDATA_P1_0007'' AND DateTime = '' ',
DateAdd(hh,-6,shiftstart),
' '' AND wwTimeZone = ''UTC'' '))
This way, I avoid having to build an intermediate script object and can directly query based on the variable name in my FOREACH loop.

Sidekiq: Find last job

I have two Sidekiq jobs. The first loads a feed of articles in JSON and splits it into multiple jobs. It also creates a log and stores a start_time.
class LoadFeed
include Sidekiq::Worker
def perform url
log = Log.create! start_time: Time.now, url: url
articles = load_feed(url) # this one loads the feed
articles.each do |article|
ProcessArticle.perform_async(article, log.id)
end
end
end
The second job processes an article and updates the end_time field of the former created log to find out, how long the whole process (loading the feed, splitting it into jobs, processing the articles) took.
class ProcessArticle
include Sidekiq::Worker
def perform data, log_id
process(data)
Log.find(log_id).update_attribute(:end_time, Time.now)
end
end
But now I have some problems / questions:
Log.find(log_id).update_attribute(:end_time, Time.now) isn't atomic, and because of the async behaviour of the jobs, this could lead to incorrect end_time values. Is there a way to do an atomic update of a datetime field in MySQL with the current time?
The feed can get pretty long (~ 800k articles) and updating a value 800k times when you would just need the last one seems like a lot of unnecessary work. Any ideas how to find out which one was the last job, and only update the end_time field in this job?
For 1) you could do an update with one less query and let MySQL find the time:
Log.where(id: log_id).update_all('end_time = now()')
For 2) one way to solve this would be to update your end time only if all articles have been processed. For example by having a boolean that you could query. This does not reduce the number of queries but would certainly have better performance.
if feed.articles.needs_processing.none?
Log.where(id: log_id).update_all('end_time = now()')
end
This is the problem Sidekiq Pro's Batch feature solves. You create a set of jobs, it calls your code when they are all complete.
class LoadFeed
include Sidekiq::Worker
def on_success(status, options)
Log.find(options['log_id']).update_attribute(:end_time, Time.now)
end
def perform url
log = Log.create! start_time: Time.now, url: url
articles = load_feed(url) # this one loads the feed
batch = Sidekiq::Batch.new
batch.on(:success, self.class, 'log_id' => log.id)
batch.jobs do
articles.each do |article|
ProcessArticle.perform_async(article, log.id)
end
end
end
end

Between two datestime in mysql

I am trying to create a reservation system in php and i have a table(field_data_field_dateres) that has two fields field_dateres_value(start date) and field_dateres_value2(end date). I want to find that if conflict occurs between reservation.
Currently table has a record like this
currently i am writing query like this
SELECT * FROM `field_data_field_dateres` WHERE field_dateres_value>='2014-02-14 20:15:00' and field_dateres_value2<='2014-02-14 20:30:00';
where 2014-02-14 20:15:00,2014-02-14 20:30:00 will come from php side. But its returning empty record. thanks for any help.
Since you want to find times overlapping (conflicting), the query you want is probably instead;
SELECT * FROM `field_data_field_dateres`
WHERE field_dateres_value < '2014-02-14 20:30:00'
AND field_dateres_value2 > '2014-02-14 20:15:00';
Note that the end time is compared to your new time slot's start time, and the start time is compared to your new time slot's end time. This will return all time windows in the database that overlap with your new range.
An SQLfiddle to test with.