Django: retrieve distinct QuerySet - mysql

I've got the following models in my app. The Addition model is used to govern the many-to-many relationship between the Book model and the Collection model, since I need to include extra fields on the intermediate model.
class Book(models.Model):
name = models.CharField(max_length=200)
picture = models.ImageField(upload_to='img', max_length=1000)
price = models.DecimalField(max_digits=8, decimal_places=2)
class Collection(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=100)
books = models.ManyToManyField(Book, through='Addition')
subscribers = models.ManyToManyField(User, related_name='collection_subscriptions', blank=True, null=True)
class Addition(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
collection = models.ForeignKey(Collection)
created = models.DateTimeField(auto_now=False, auto_now_add=True)
updated = models.DateTimeField(auto_now=True, auto_now_add=True)
In my app users can add books to collections that they create (for example fiction, history, etc.). Other users can then follow those collections that they like.
When a user logs into the site, I'd like to display all of the books that have been recently added to the collections that they follow. With each book, I'd also like to display the name of the person who added it, and the name of the collection it's in.
I can get all of the additions as follows...
additions = Addition.objects.filter(collection__subscribers=user).select_related()
... but this results in duplicate books being retrieved and displayed to the user, often side by side.
If there a way to retrieve a distinct list of books that are in collections the user is following?
I'm using Django 1.3 + MySQL.
Thanks.
UPDATE
I should add that in general I'm not looking for any 'loop through the results and de-duplicate that way' solutions, for a couple of reasons.
There are likely to be tens or even hundreds of thousands of additions (I am also displaying this information on pages that list all new additions added by users), and response time is extremely important.
This solution may become more practical when limiting the initial result set, but it creates problems with pagination, which is also required. Namely how do you paginate the entire result set while also de-duplicating only a small portion of that set. I'm open to any ideas here that may solve this problem.
UPDATE
I should also mention that if the same book gets added by multiple users, I actually don't have a preference for which addition gets used, either the original or the most recent addition would work fine.

How about the following - it's not a pure SQL solution, and it'll cost you an extra database query and some loop time, but it should still perform ok, and it'll give you a lot more control over which additions take precedence over others:
def filter_additions(additions):
# Use a ValuesQuerySet for performance
additions_values = additions.values()
# The following code just eliminates duplicates. You could do
# something much more powerful/interesting here if you like,
# e.g. give preference to additions by a user`s friends
book_pk_registry = {}
excluded_addition_pks = []
for addition in additions_values:
addition_pk = addition['id']
book_pk = addition['book_id']
if book_pk not in book_pk_registry:
book_pk_registry[book_pk] = True
else:
excluded_addition_pks.append(addition_pk)
additions = additions.exclude(pk__in=excluded_addition_pks)
additions = Addition.objects.filter(collection__subscribers=user)
additions = filter_additions(additions)
If there are likely to be more than a thousand or so books involved, you may want to put a limit on the initial additions query. Passing massive lists of ids over in the exclude isn't such a great idea. Using 'values()' is quite important, because Python can cycle through a basic list of dicts a LOT faster than a queryset and it uses a lot less memory.

Assuming there won`t be huge amounts of additions to display, this could easily to the trick:
# duplicated..
additions = Addition.objects.filter(collection__subscribers=user, created__gt=DATE_LAST_LOGIN).select_related()
# remove duplication
added_books = {}
for addition in additions:
added_books[addition.book] = True
added_books = added_books.keys()
By the description you gave of the problem, performance would not be a problem.

additions = Addition.objects.filter(collection__subscribers=user).values('book').annotate(user=Min('user'), collection=Min('collection')).order_by()
This query will give you list of unique books with their users and collections. Books, collections, users will be pk's, not objects. But I hope you will store them in cache so that won't be a problem.
But for your workload I'd think about denormalization. My query is very heavy, and it isn't easy to cache its results if you will have frequent additions. My first approach will be to add latest_additions field to Collection model and to update with signals (not adding duplicates). The format of this field is up to you.

Sometimes it's OK to drop into SQL, especially when the ORM-only solution is not performant. It's easy to get the non-duplicate Addition row IDs in SQL, and then you can switch back to the ORM to select the data. It's two queries, but will outperform any of the single query solutions I've seen so far.
from django.db import connection
from operator import itemgetter
cursor = connection.cursor()
# Select non-duplicate book additions, preferring for most recently updated
query = '''SELECT id, MAX(updated) FROM %s
GROUP BY book_id''' % Addition._meta.db_table
cursor.execute(query)
# Flatten the results to an id list
addition_ids = map(itemgetter(0), cursor.fetchall())
additions = Addition.objects.filter(
collection__subscribers=user, id__in=addition_ids).select_related()

Related

Django GenereicForeignKey v/s custom manual fields performance/optimization

I'm trying to build a typical social networking site. there are two types of objects mainly.
photo
status
a user can like photo and status. (Note that these two are mutually exclusive)
means, We have two table (1) for Image only and other for status only.
now when a user likes an object(it could be a photo or status) how should I store that info.
I want to design a efficient SQL schema for this.
Currently I'm using Genericforeignkey(GFK)
class LikedObject(models.Model):
content_type = models.ForeignKey(ContentType)
object_id = models.PositiveIntegerField()
content_object = GenericForeignKey('content_type', 'object_id')
but yesterday I thought if I can do this without using GFK efficiently?
class LikedObject(models.Model):
OBJECT_TYPE = (
('status', 'Status'),
('image', 'Image/Photo'),
)
user = models.ForeignKey(User, related_name="liked_objects")
obj_id = models.PositiveIntegerField()
obj_type = models.CharField(max_length=63, choices=OBJECT_TYPE)
the only difference I can understand is that I have to make two queries if I want to get all liked_status of a particular user
status_ids = LikedObject.objects.filter(user=user_obj, obj_type='status').values_list('object_id', flat=True)
status_objs = Status.objects.filter(id__in=status_ids)
Am I correct? so What would be the best approach in terms of easy querying/inserting or performance, etc.
You are basically implementing your own Generic Object, only you limit your ContentType to your hard coded OBJECT_TYPE.
If you are only going to access the database as in your example (get all status objects liked by user x), or a couple specific queries, then your own implementation can be a little faster, of course. But obviously, if later you have to add more objects, or do other things, you may find yourself implementing your whole full generic solution. And like they say, why reinvent the wheel.
If you want better performance, and really only have those two Models to worry about, you may just want to have two different Like tables (StatusLike and ImageLike) and use inheritance to share functionality.
class LikedObject(models.Model):
common_field = ...
class Meta:
abstract = True
def some_share_function():
...
class StatusLikeObject(LikedObject):
user = models.ForeignKey(User, related_name="status_liked_objects")
status = models.ForeignKey(Status, related_name="liked_objects")
class ImageLikeObject(LikedObject):
user = models.ForeignKey(User, related_name="image_liked_objects")
image = models.ForeignKey(Image, related_name="liked_objects")
Basically, either you have a lot of Models to worry about, and then you probably want to use the more Django generic object implementation, or you only have two models, and why even bother with a half generic solution. Just use two tables.
In this case, I would check if your data objects Status and Photo may have many common data fields, e.g. Status.user and Photo.user, Status.title and Photo.title, Status.pub_date and Photo.pub_date, Status.text and Photo.caption, etc.
Could you combine them into an Item object maybe? That Item would have a Item.type field, either "photo" or "status"? Then you would only have a single table and a single object type a user can "like". Much simpler at basically no cost.
Edit:
from django.db import models
from django.utils.timezone import now
class Item(models.Model):
data_type = models.SmallIntegerField(
choices=((1, 'Status'), (2, 'Photo')), default=1)
user = models.ForeignKey(User)
title = models.CharField(max_length=100)
pub_date = models.DateTimeField(default=now)
...etc...
class Like(models.Model):
user = models.ForeignKey(User, related_name="liked_objects")
item = models.ForeignKey(Item)

Speeding up a Django query for "latest" ForeignKey related object

With Django, I have two related models. Call the first one BaseObject. The second one is called BaseObjectObservation, where every 6 hours or so I create a new BaseObjectObservation that's linked via ForeignKey to a BaseObject and has another field for a particular data point about that object at that time, along with a timestamp.
As you might expect, one thing I'm always interested in is the "latest" BaseObjectObservation for a given BaseObject. The trouble is that there are now lots of observations for each BaseObject, and even with ~500 BaseObjects, loading a page with all BaseObjects with each one's latest observation gets very slow.
Any recommendations on how to speed up the retrieval of the latest observation?
Bonus question: I'm also interested in how each object's observation has changed over the last 24 hours. Previously I tried querying for the latest observation and the observation closest to 24 hours ago and calculating the difference; this was too slow as well. Any recommendations here?
You could do something like:
class BaseObject(models.Model):
pass
class BaseObjectObservation(models.Model):
base_object = models.ForeignKey(BaseObject, related_name="observations")
last_modification = models.DateTimeField(auto_now=True)
latest = models.BooleanField(default=False)
def save(self, **kwargs)
if not self.pk:
# mark new instance as latest
self.latest = True
# Update previous observations
self.base_object.observations.update(latest=False)
super().save(**kwargs)
Then, if you want to get latest observations with their base object, you can do :
BaseObjectObservation.objects.filter(latest=True).select_related('base_object')
The select_related clause will save you 500 queries, because it will fetch the base object, along the observation.
Since you do everything in a single query, performance should be better. However, some cleanest solutions may exist without needing to store a boolean on each instance.
Bonus
For your bonus question, you can probably get some inspiration here:
import datetime
from django.utils import timezone
24_hours_ago = timezone.now() - datetime.timedelta(hours=24)
current_observation = base_object.observations.get(latest=True)
closest_observation_greater = base_object.observations.filter(creation_date__gt=24_hours_ago).first()
closest_observation_lower = base_object.observations.filter(creation_date__lte=24_hours_ago).first()
if closest_observation_greater - target > target - closest_observation_lower:
return closest_observation_lower
else:
return closest_observation_greater
However, that's still two query for each observation. You can probably optimize it, but you can also reduce
the number of element you display on each page. Do you really need to display 500 elements on the same page ?

Can I create sperate queries for different views?

I'm learning sqlalchemy and not sure if I grasp it fully yet(I'm more used to writing queries by hand but I like the idea of abstracting the queries and getting objects). I'm going through the tutorial and trying to apply it to my code and ran into this part when defining a model:
def __repr__(self):
return "<User('%s','%s', '%s')>" % (self.name, self.fullname, self.password)
Its useful because I can just search for a username and get only the info about the user that I want but is there a way to either have multiple of these type of views that I can call? or am I using it wrong and should be writing a specific query for getting different data for different views?
Some context to why I'm asking my site has different templates, and most pages will just need the usersname, first/last name but some pages will require things like twitter or Facebook urls(also fields in the model).
First of all, __repr__ is not a view, so if you have a simple model User with defined columns, and you query for a User, all the columns will get loaded from the database, and not only those used in __repr__.
Lets take model Book (from the example refered to later) as a basis:
class Book(Base):
book_id = Column(Integer, primary_key=True)
title = Column(String(200), nullable=False)
summary = Column(String(2000))
excerpt = Column(Text)
photo = Column(Binary)
The first option to skip loading some columns is to use Deferred Column Loading:
class Book(Base):
# ...
excerpt = deferred(Column(Text))
photo = deferred(Column(Binary))
In this case when you execute query session.query(Book).get(1), the photo and excerpt columns will not be loaded until accessed from the code, at which point another query against the database will be executed to load the missing data.
But if you know before you query for the Book that you need the column photo immediately, you can still override the deferred behavior with undefer option: query = session.query(Book).options(undefer('photo')).get(1).
Basically, the suggestion here is to defer all the columns (in your case: except username, password etc) and in each use case (view) override with undefer those you know you need for that particular view. Please also see the group parameter of deferred, so that you can group the attributes by use case (view).
Another way would be to query only some columns, but in this case you are getting the tuple instance instead of the model instance (in your case User), so it is potentially OK for form filling, but not so good for model validation: session.query(Book.id, Book.title).all()

sqlalchemy relations and query on relations

Suppose I have 3 tables in sqlalchemy. Users, Roles and UserRoles defined in declarative way. How would one suggest on doing something like this:
user = Users.query.get(1) # get user with id = 1
user_roles = user.roles.query.limit(10).all()
Currently, if I want to get the user roles I have to query any of the 3 tables and perform the joins in order to get the expected results. Calling directly user.roles brings a list of items that I cannot filter or limit so it's not very helpful. Joining things is not very helpful either since I'm trying to make a rest interface with requests such as:
localhost/users/1/roles so just by that query I need to be able to do Users.query.get(1).roles.limit(10) etc etc which really should 'smart up' my rest interface without too much bloated code and if conditionals and without having to know which table to join on what. Users model already has the roles as a relationship property so why can't I simply query on a relationship property like I do with normal models?
Simply use Dynamic Relationship Loaders. Code below verbatim from the documentation linked to above:
class User(Base):
__tablename__ = 'user'
posts = relationship(Post, lazy="dynamic")
jack = session.query(User).get(id)
# filter Jack's blog posts
posts = jack.posts.filter(Post.headline=='this is a post')
# apply array slices
posts = jack.posts[5:20]

Django Combining AND and OR Queries with ManyToMany Field

hoping someone can help me out with this.
I'm trying to figure out whether I can construct a query that will allow me to retrieve items from my db based on a ForeignKey field and a ManyToManyField at the same time. The challenging part is that it will need to filter on multiple ManyToMany objects.
An example will hopefully make this clearer. Here are my (simplified) models:
class Item(models.Model):
name = models.CharField(max_length=200)
brand = models.ForeignKey(User, related_name='brand')
tags = models.ManyToManyField(Tag, blank=True, null=True)
def __unicode__(self):
return self.name
class Meta:
ordering = ['-id']
class Tag(models.Model):
name = models.CharField(max_length=64, unique=True)
def __unicode__(self):
return self.name
I would like to build a query that retrieves items based on two criteria:
Items that were uploaded by users that a user is following (called 'brand' in the model). So for example if a user is following the Paramount user account, I would want all items where brand = Paramount.
Items that match the keywords in saved searches. For example the user could make and save the following search: "80s comedy". In this case I would want all items where the tags include both "80s" and "comedy".
Now I know how to construct the query for each independently. For #1, it's:
items = Item.objects.filter(brand=brand)
And for #2 (based on the docs):
items = Item.objects.filter(tags__name='80s').filter(tags__name='comedy')
My question is: is it possible to construct this as a single query so that I don't have to take the hit of converting each query into a list, joining them, and removing duplicates?
The challenge seems to be that there is no way to use Q objects to construct queries where you need an item's manytomany field (in this case tags) to match multiple values. The following query:
items = Item.objects.filter(Q(tags__name='80s') & Q(tags__name='comedy'))
does NOT work.
(See: https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships)
Thanks in advance for your help on this!
After much research I could not find a way to combine this into a single query, so I ended up converting my QuerySets into lists and combining them.
Django's filters automatically AND. Q objects are only needed if you're trying to add ORs. Also, the __in query filter will help you out alot.
Assuming users have multiple brands they like and you want to return any brand they like, you should use:
`brand__in=brands`
Where brands is the queryset returned by something like someuser.brands.all().
The same can be used for your search parameters:
`tags__name__in=['80s','comedy']`
That will return things tagged with either '80s' or 'comedy'. If you need both (things tagged with both '80s' AND 'comedy'), you'll have to pass each one in a successive filter:
keywords = ['80s','comedy']
for keyword in keywords:
qs = qs.filter(tags__name=keyword)
P.S. related_name values should always specify the opposite relationship. You're going to have logic problems with the way you're doing it currently. For example:
brand = models.ForeignKey(User, related_name='brand')
Means that somebrand.brand.all() will actually return Item objects. It should be:
brand = models.ForeignKey(User, related_name='items')
Then, you can get a brands's items with somebrand.items.all(). Makes much more sense.