Speeding up a Django query for "latest" ForeignKey related object - mysql

With Django, I have two related models. Call the first one BaseObject. The second one is called BaseObjectObservation, where every 6 hours or so I create a new BaseObjectObservation that's linked via ForeignKey to a BaseObject and has another field for a particular data point about that object at that time, along with a timestamp.
As you might expect, one thing I'm always interested in is the "latest" BaseObjectObservation for a given BaseObject. The trouble is that there are now lots of observations for each BaseObject, and even with ~500 BaseObjects, loading a page with all BaseObjects with each one's latest observation gets very slow.
Any recommendations on how to speed up the retrieval of the latest observation?
Bonus question: I'm also interested in how each object's observation has changed over the last 24 hours. Previously I tried querying for the latest observation and the observation closest to 24 hours ago and calculating the difference; this was too slow as well. Any recommendations here?

You could do something like:
class BaseObject(models.Model):
pass
class BaseObjectObservation(models.Model):
base_object = models.ForeignKey(BaseObject, related_name="observations")
last_modification = models.DateTimeField(auto_now=True)
latest = models.BooleanField(default=False)
def save(self, **kwargs)
if not self.pk:
# mark new instance as latest
self.latest = True
# Update previous observations
self.base_object.observations.update(latest=False)
super().save(**kwargs)
Then, if you want to get latest observations with their base object, you can do :
BaseObjectObservation.objects.filter(latest=True).select_related('base_object')
The select_related clause will save you 500 queries, because it will fetch the base object, along the observation.
Since you do everything in a single query, performance should be better. However, some cleanest solutions may exist without needing to store a boolean on each instance.
Bonus
For your bonus question, you can probably get some inspiration here:
import datetime
from django.utils import timezone
24_hours_ago = timezone.now() - datetime.timedelta(hours=24)
current_observation = base_object.observations.get(latest=True)
closest_observation_greater = base_object.observations.filter(creation_date__gt=24_hours_ago).first()
closest_observation_lower = base_object.observations.filter(creation_date__lte=24_hours_ago).first()
if closest_observation_greater - target > target - closest_observation_lower:
return closest_observation_lower
else:
return closest_observation_greater
However, that's still two query for each observation. You can probably optimize it, but you can also reduce
the number of element you display on each page. Do you really need to display 500 elements on the same page ?

Related

How to get records with last dates in Django ORM(MySQL)?

I have models:
class Reference(models.Model):
name = models.CharField(max_length=50)
class Search(models.Model):
reference = models.ForeignKey(Reference)
update_time = models.DateTimeField(auto_now_add=True)
I have an instance of Reference and i need to get all last searches for the reference. Now i am doing it in this way:
record = Search.objects.filter(reference=reference)\
.aggregate(max_date=Max('update_time'))
if record:
update_time = record['max_date']
searches = reference.search_set.filter(update_time=self.update_time)
It is not a big deal to use 2 queries except the one but what if i need to get last searches for each reference on a page? I would have got 2x(count of references) queries and it would not be good.
I was trying to use this solution https://stackoverflow.com/a/9838438/293962 but it didn't work with filter by reference
You probably want to use the latest method.
From the docs, "Returns the latest object in the table, by date, using the field_name provided as the date field."
https://docs.djangoproject.com/en/1.8/ref/models/querysets/#latest
so your query would be
Search.objects.filter(reference=reference).latest('update_time')
I implemented a snippet from someone in gist but I don't remember the user neither have the link.
A bit of context:
I have a model named Medicion that contains the register of mensuration of a machine, machines are created in a model instance of Equipo, Medicion instances have besides of a Foreign key to Equipo, a foreign key to Odometro, this model serves as a kind of clock or metre, that's why when I want to retrieve data (measurements aka instances of Medicion model) for a certain machine, I need to indicate the clock as well, otherwise it would retrieve me a lot of messy and unreadable data.
Here is my implementation:
First I retrieve the last dates:
ult_fechas_reg = Medicion.objects.values('odometro').annotate(max_fecha=Max('fecha')).order_by()
Then I instance an Q object:
mega_statement = Q() # This works as 'AND' Sql Statement
Then looping in every date retrieved in the queryset(annotation) and establishing the Q statement:
for r in ult_fechas_reg:
mega_statement |= (Q(odometro__exact=r['odometro']) & Q(fecha=r['max_fecha']))
Finally passed this mega statement to the queryset that pursues to retrieve the last record of a model filtered by two fields:
resultados = Medicion.objects.filter(mega_query).filter(
equipo=equipo,
odometro__in=lista_odometros).order_by('odometro', 'fecha') # lista_odometros is a python list containing pks of another model, don't worry about it.

Query ActiveRecord for records and relation calculations at once

TL;DR? See Edit 2
I've got a little Rails application that has a few different sort of games people can play: it's based around sports, so they can pick the winners of each game every week (model PickEm, attribute correct boolean with nil for unfinished games), and predict the outcome of a specific team's game (model Guess, attribute score with integer, nil for unfinished games). Every User has_many PickEms and Guesses. And I'm trying to display standings (correct/total - total being all non-nil, score/total possible).
What I'm finding is that I can gather the users and their associated records, but in trying to display standings I'm discovering that every single User is triggering another query - slow and not sustainable as the user base increases. That's because #user.pick_em_score is pick_ems.where(correct: true).size and #user.guess_Score is guesses.where.not(score: nil).sum(:score). So I call user.pick_em_score and it runs that query. I feel like there should be a way to get every User, as well as these specific counts, at once, rather than buffering a whole bunch of needless extra stuff.
What I need:
User record
User.pick_em_score (calculated by counting correct records)
User.pick_ems count where NOT NULL
User.guesses_score (calculated by guesses.sum(:score))
User.guesses count where NOT NULL
Most of the stuff I find on Rails's ActiveRecord helpers, especially related to calculations, is for retrieving only the calculation. It looks like I'll probably need to delve directly into select() etc. But I can't get it working. Can someone point me in the right direction?
Edit
For clarification: I'm aware that I can write this information to the User model, but this is overly restrictive: next season, I'll need to add a new column to the User for that year's results, etc. In addition, this is a third degree of callback updating related models – the Match model already updates related PickEms and Guesses on save. I'm looking for the simplest ActiveRecord query or queries to be able to work with this information, as indicated by the title. Ideally one query that returns the above information, but if it needs to a few, that's OK.
I used to work directly in MySQL with PHP, but those skills have rusted (in raw MySQL, I imagine, I'd have several sub-select statements to help pull these counts) and I'd also like to be able to use Rails's ActiveRecord helpers and such, and avoid constructing raw SQL as much as possible.
Second Edit:
I seem to have it down to one call that starts to work, but I'm writing a lot of SQL. It's also brittle, IMO, and trying to run with it has failed. It also looks like I'm just pushing the million singular SELECT queries from Rails right into SQL, but that may still be a step up.
User.unscoped.select('users.*',
'(SELECT COUNT(*) FROM pick_ems WHERE pick_ems.user_id = users.id AND pick_ems.correct) AS correct_pick_ems',
'(SELECT COUNT(*) FROM pick_ems WHERE pick_ems.user_id = users.id AND pick_ems.correct IS NOT NULL) AS total_pick_ems',
'(SELECT SUM(guesses.score) FROM guesses WHERE guesses.user_id = users.id AND guesses.score IS NOT NULL) AS guesses_score',
'(SELECT COUNT(*) FROM guesses WHERE guesses.user_id = users.id AND guesses.score IS NOT NULL) AS guesses_count' )
The issue seems to be: is there a way to use Rails, and not raw SQL, to link up users.id that we see there with these subqueries? Or just … a better way to construct this, in general?
In addition, I'm running another set of SELECTs for the WHERE, which would hinge on total_pick_ems and guesses_count being > 0 but since I can't use those aliased columns, I have to call the SELECT one more time.
Welcome to AR. Its really only good for simple CRUD like queries. Once you actually want to query your data in anger it just doesn't have the capababilities to do the queries you want without resorting to wholesale SQL strings and often abandoning the ability to chain as a result.
Its precisely why I moved to Sequel as it does have the features to compose queries using a much fuller SQL feature set, including join conditions, window functions, recursive common table expressions, and advanced eager loading. The author is incredibly responsive and documentation is excellent compared to AR and Arel.
I don't expect you will like this answer but a time will come when you will start to look outside the opinionated components that come with rails which I have to say are hardly best of breed. Sequel also sped my application up many times over what I was able to get with AR as well, it not just developer happiness, it means less servers to run. Yes it will be a learning curve but IMO its better to learn tools that have your back covered.
Joins might work. Smthing like below
User.unscoped.joins(:guesses).joins(:pick_ems).
where("guesses.score IS NOT NULL").
select("users.*,
sum(guesses.score) as guesses_score,
count(guesses.id) as guesses_count,
count(case when pick_ems.correct = True then 1 else null end)
as correct_pick_ems,
count(case when pick_ems.correct != null then 1 else null end)
as total_pick_ems,
").
group("users.id")
If you need this information for a limited number of users at a time then above query or eager loading (User.includes(:guesses, :pick_ems)) with class methods like
def correct_pick_ems
pick_ems.count(&:correct)
end
would work.
However If you need this information for all the users most of the time, cached counters within the users table would be more optimal.
What you need is some sort of custom (smart) counter_cache to count only at certain conditions (e.g correct is true)
You can achive this using conditional after_save & after_destroy triggers to build your own custom counter_cache that looks like this:
class PickEm
belongs_to :user
after_save :increment_finished_counter_cache, if: Proc.new { |pick_em| pick_em.correct }
after_destroy :decrement_finished_counter_cache, if: Proc.new { |pick_em| pick_em.correct }
private
def increment_finished_counter_cache
self.user.update_column(:finished_games_counter, self.user.finished_games_counter + 1) #update_column should not trigger any validations or callbacks
end
def decrement_finished_counter_cache
self.user.update_column(:finished_games_counter, self.user.finished_games_counter - 1) #update_column should not trigger any validations or callbacks
end
end
Notes:
Code not tested (only to show the idea)
Some guys said it's better to avoid naming custom counters as rails name them (foo_counter_cache)
You should benchmark it, but my hunch is that adding all of that data into a single SELECT isn't going to be much faster than breaking it up into separate SELECTs (I've actually had cases where the latter was faster). By breaking it up, you can also stick to more ActiveRecord and less raw SQL, e.g.:
user_ids_to_pick_em_score = User.joins(:pick_ems).where(pick_ems: {correct: true}).group(:user_id).count
user_ids_to_pick_ems_count = User.joins(:pick_ems).where.not(pick_ems: {correct: nil}).group(:user_id).count
user_ids_to_guesses_score = Hash[User.select("users.id, SUM(guesses.score) AS total_score").joins(:guesses).group(:user_id).map{|u| [u.id, u.total_score]}]
user_ids_to_guesses_count = User.joins(:guesses).where.not(guesses: {score: nil}).group(:user_id).count
Edit: To display them, you could do like so:
<%- User.select(:id, :name).find_each do |u| -%>
Name: <%= u.name %>
Picks Correct: <%= user_ids_to_pick_em_score[u.id] %>/<%= user_ids_to_pick_ems_count[u.id] %>
Total Score: <%= user_ids_to_guesses_score[u.id] %>/<%= user_ids_to_guesses_count[u.id] %>
<%- end -%>

Can I create sperate queries for different views?

I'm learning sqlalchemy and not sure if I grasp it fully yet(I'm more used to writing queries by hand but I like the idea of abstracting the queries and getting objects). I'm going through the tutorial and trying to apply it to my code and ran into this part when defining a model:
def __repr__(self):
return "<User('%s','%s', '%s')>" % (self.name, self.fullname, self.password)
Its useful because I can just search for a username and get only the info about the user that I want but is there a way to either have multiple of these type of views that I can call? or am I using it wrong and should be writing a specific query for getting different data for different views?
Some context to why I'm asking my site has different templates, and most pages will just need the usersname, first/last name but some pages will require things like twitter or Facebook urls(also fields in the model).
First of all, __repr__ is not a view, so if you have a simple model User with defined columns, and you query for a User, all the columns will get loaded from the database, and not only those used in __repr__.
Lets take model Book (from the example refered to later) as a basis:
class Book(Base):
book_id = Column(Integer, primary_key=True)
title = Column(String(200), nullable=False)
summary = Column(String(2000))
excerpt = Column(Text)
photo = Column(Binary)
The first option to skip loading some columns is to use Deferred Column Loading:
class Book(Base):
# ...
excerpt = deferred(Column(Text))
photo = deferred(Column(Binary))
In this case when you execute query session.query(Book).get(1), the photo and excerpt columns will not be loaded until accessed from the code, at which point another query against the database will be executed to load the missing data.
But if you know before you query for the Book that you need the column photo immediately, you can still override the deferred behavior with undefer option: query = session.query(Book).options(undefer('photo')).get(1).
Basically, the suggestion here is to defer all the columns (in your case: except username, password etc) and in each use case (view) override with undefer those you know you need for that particular view. Please also see the group parameter of deferred, so that you can group the attributes by use case (view).
Another way would be to query only some columns, but in this case you are getting the tuple instance instead of the model instance (in your case User), so it is potentially OK for form filling, but not so good for model validation: session.query(Book.id, Book.title).all()

Django: retrieve distinct QuerySet

I've got the following models in my app. The Addition model is used to govern the many-to-many relationship between the Book model and the Collection model, since I need to include extra fields on the intermediate model.
class Book(models.Model):
name = models.CharField(max_length=200)
picture = models.ImageField(upload_to='img', max_length=1000)
price = models.DecimalField(max_digits=8, decimal_places=2)
class Collection(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=100)
books = models.ManyToManyField(Book, through='Addition')
subscribers = models.ManyToManyField(User, related_name='collection_subscriptions', blank=True, null=True)
class Addition(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
collection = models.ForeignKey(Collection)
created = models.DateTimeField(auto_now=False, auto_now_add=True)
updated = models.DateTimeField(auto_now=True, auto_now_add=True)
In my app users can add books to collections that they create (for example fiction, history, etc.). Other users can then follow those collections that they like.
When a user logs into the site, I'd like to display all of the books that have been recently added to the collections that they follow. With each book, I'd also like to display the name of the person who added it, and the name of the collection it's in.
I can get all of the additions as follows...
additions = Addition.objects.filter(collection__subscribers=user).select_related()
... but this results in duplicate books being retrieved and displayed to the user, often side by side.
If there a way to retrieve a distinct list of books that are in collections the user is following?
I'm using Django 1.3 + MySQL.
Thanks.
UPDATE
I should add that in general I'm not looking for any 'loop through the results and de-duplicate that way' solutions, for a couple of reasons.
There are likely to be tens or even hundreds of thousands of additions (I am also displaying this information on pages that list all new additions added by users), and response time is extremely important.
This solution may become more practical when limiting the initial result set, but it creates problems with pagination, which is also required. Namely how do you paginate the entire result set while also de-duplicating only a small portion of that set. I'm open to any ideas here that may solve this problem.
UPDATE
I should also mention that if the same book gets added by multiple users, I actually don't have a preference for which addition gets used, either the original or the most recent addition would work fine.
How about the following - it's not a pure SQL solution, and it'll cost you an extra database query and some loop time, but it should still perform ok, and it'll give you a lot more control over which additions take precedence over others:
def filter_additions(additions):
# Use a ValuesQuerySet for performance
additions_values = additions.values()
# The following code just eliminates duplicates. You could do
# something much more powerful/interesting here if you like,
# e.g. give preference to additions by a user`s friends
book_pk_registry = {}
excluded_addition_pks = []
for addition in additions_values:
addition_pk = addition['id']
book_pk = addition['book_id']
if book_pk not in book_pk_registry:
book_pk_registry[book_pk] = True
else:
excluded_addition_pks.append(addition_pk)
additions = additions.exclude(pk__in=excluded_addition_pks)
additions = Addition.objects.filter(collection__subscribers=user)
additions = filter_additions(additions)
If there are likely to be more than a thousand or so books involved, you may want to put a limit on the initial additions query. Passing massive lists of ids over in the exclude isn't such a great idea. Using 'values()' is quite important, because Python can cycle through a basic list of dicts a LOT faster than a queryset and it uses a lot less memory.
Assuming there won`t be huge amounts of additions to display, this could easily to the trick:
# duplicated..
additions = Addition.objects.filter(collection__subscribers=user, created__gt=DATE_LAST_LOGIN).select_related()
# remove duplication
added_books = {}
for addition in additions:
added_books[addition.book] = True
added_books = added_books.keys()
By the description you gave of the problem, performance would not be a problem.
additions = Addition.objects.filter(collection__subscribers=user).values('book').annotate(user=Min('user'), collection=Min('collection')).order_by()
This query will give you list of unique books with their users and collections. Books, collections, users will be pk's, not objects. But I hope you will store them in cache so that won't be a problem.
But for your workload I'd think about denormalization. My query is very heavy, and it isn't easy to cache its results if you will have frequent additions. My first approach will be to add latest_additions field to Collection model and to update with signals (not adding duplicates). The format of this field is up to you.
Sometimes it's OK to drop into SQL, especially when the ORM-only solution is not performant. It's easy to get the non-duplicate Addition row IDs in SQL, and then you can switch back to the ORM to select the data. It's two queries, but will outperform any of the single query solutions I've seen so far.
from django.db import connection
from operator import itemgetter
cursor = connection.cursor()
# Select non-duplicate book additions, preferring for most recently updated
query = '''SELECT id, MAX(updated) FROM %s
GROUP BY book_id''' % Addition._meta.db_table
cursor.execute(query)
# Flatten the results to an id list
addition_ids = map(itemgetter(0), cursor.fetchall())
additions = Addition.objects.filter(
collection__subscribers=user, id__in=addition_ids).select_related()

Help me optimize an ActiveRecord object with too many attributes

I'm working on a app which ties to a legacy database. The primary model is based on a stupidly large 100+ column table. I don't know too much about the inner-workings of ActiveRecord but it seems to me that any request on this model is slowing down because it's creating objects with 100+ attributes. Let's call this SlowModel.
Rendering pages with this model sometimes take 17 seconds on my dev computer. Straight up mysql queries only take ~ 0.5 - 1 second.
I've managed to speed up one portion of the app by using a MySQL view that selects a subset of fields (20 or so). We'll call this QuickModel. Using views is OK but isn't the most portable solution.
I will likely continue to try and add this QuickModel into other parts of the site but I was wondering if anyone had other ideas in speeding up the original object. For instance, is there a way to specify in the model what columns activerecord should just ignore and avoid building? Maybe there are specific column types (:text??) that cause bloat in ActiveRecord objects.
Assume that columns have proper indices.
You can specify which columns are returned in the model lookup using the :select option of the ActiveRecord lookup:
SlowModel.all(:select => 'id, col1, col2, col3')
...will load instances of SlowModel with only the specified columns populated.
How about having a completely new QuickModel that sits to its own table... and a QuickModel has_one SlowModel?
You can use SQL to move the most-necessary data into the QuickModel table and only refer to the SlowModel using my_quick_model.slow_model when necessary.
Alternatively, you can add a "select" to the default scope (you can google "rails default scope" for more). By default it'll only fetch the reduced set - but you can ask for all attributes by passing :select => "*" if necessary.
Along the lines of what Winfield is saying, you may want to take a look at using an attribute tracker like SlimScrooge. The tracker attempts to fetch only the data that you're using, which reduces overhead. It attempts to automatically do what Winfield is suggesting.
Example from the Readme:
# 1st request, sql is unchanged but columns accesses are recorded
Brochure Load SlimScrooged 1st time (27.1ms) SELECT * FROM `brochures` WHERE (expires_at IS NULL)
# 2nd request, only fetch columns that were used the first time
Brochure Load SlimScrooged (4.5ms) SELECT `brochures`.expires_at,`brochures`.operator_id,`brochures`.id FROM `brochures` WHERE (expires_at IS NULL)
# 2nd request, later in code we need another column which causes a reload of all remaining columns
Brochure Reload SlimScrooged (0.6ms) `brochures`.name,`brochures`.comment,`brochures`.image_height,`brochures`.id, `brochures`.tel,`brochures`.long_comment,`brochures`.image_name,`brochures`.image_width FROM `brochures` WHERE `brochures`.id IN ('5646','5476','4562','3456','4567','7355')
# 3rd request
Brochure Load SlimScrooged (4.5ms) SELECT `brochures`.expires_at,`brochures`.operator_id,`brochures`.name, `brochures`.id FROM `brochures` WHERE (expires_at IS NULL)