What is the best practice for lookup values in SQLAlchemy? - sqlalchemy

I am writing a pretty basic Flask application using Flask-SQLAlchemy for tracking inventory and distribution. I could use some guidance on how the best way to handle a lookup table for common values. My database back end will be MySQL and ElasticSearch for searches.
If I have a common mapping structure where all data going into a specific table, say Vehicle, have a common list of values to look up against for the Vehicle.make column, what would the best way to achieve this be?
My thought for approaching this is one of two ways:
Lookup Table
I could set something up similar to this where I have a relationship, and store the make in VehicleMake. However, if my expected list of makes is low (say 10), this seems unnecessary.
class VehicleMake(Model):
id = Column(Integer, primary_key=True)
name = Column(String(16))
cars = relationship('Vehicle', backref='make', lazy='dynamic')
class Vehicle(Model):
id = Column(Integer, primary_key=True)
name = Column(String(32))
Store as a String
I could just store this as a string on the Vehicle model. But would it be a waste of space to store a common value as a string?
class Vehicle(Model):
id = Column(Integer, primary_key=True)
name = Column(String(32))
make = Column(String(16))
My original idea was just to have a dict containing a mapping like this and reference it as needed within the model. I am just not clear how to tie this in when returning the vehicle model.
MAKE_LIST = {
1: 'Ford',
2: 'Dodge',
3: 'Chevrolet'
}
Any feedback is welcome - and if there is documentation that covers this specific scenario I'm happy to read that and answer this question myself. My expected volume is going to be low (40-80 records per week) so it doesn't need to be ridiculously fast, I just want to follow best practices.

The short answer is it depends.
The long answer is that it depends on what you store along with the make of said vehicles and how often you expect to add new types.
If you need to store more than just the name of each make, but also some additional metadata, like the size of the gas tank, the cargo space, or even a sortkey, go for an additional table. The overhead of such a small table is minimal, and if you communicate with the frontend using make ids instead of make names, there is no problem at all with this. Just remember to add an index to vehicle.make_id to make the lookups efficient.
class VehicleMake(Model):
id = Column(Integer, primary_key=True)
name = Column(String(16))
cars = relationship('Vehicle', back_populates="make", lazy='dynamic')
class Vehicle(Model):
id = Column(Integer, primary_key=True)
name = Column(String(32))
make_id = Column(Integer, ForeignKey('vehicle_make.id'), nullable=False)
make = relationship("VehicleType", innerjoin=True)
Vehicle.query.get(1).make.name # == 'Ford', the make for vehicle 1
Vehicle.query.filter(Vehicle.make_id == 2).all() # all Vehicles with make id 2
Vehicle.query.join(VehicleMake)\
.filter(VehicleMake.name == 'Ford').all() # all Vehicles with make name 'Ford'
If you don't need to store any of that metadata, then the need for a separate table disappears. However, the general problem with strings is that there is a high risk of spelling errors and capital/lowercase letters screwing up your data consistency. If you don't need to add new makes much, it's a lot better to just use Enums, there are even MySQL specific ones in SQLAlchemy.
import enum
class VehicleMake(enum.Enum):
FORD = 1
DODGE = 2
CHEVROLET = 3
class Vehicle(Model):
id = Column(Integer, primary_key=True)
name = Column(String(32))
make = Column(Enum(VehicleMake), nullable=False)
Vehicle.query.get(1).make.name # == 'FORD', the make for vehicle 1
Vehicle.query.filter(Vehicle.make == VehicleMake(2)).all() # all Vehicles with make id 2
Vehicle.query.filter(Vehicle.make == VehicleMake.FORD).all() # all Vehicles with make name 'Ford'
The main drawback of enums is that they might be hard to extend with new values, although at least for Postgres the dialect specific version was a lot better at this than the general SQLAlchemy one, have a look at sqlalchemy.dialects.mysql.ENUM instead. If you want to extend your existing enum, you can always just execute raw SQL in your Flask-Migrate/Alembic migrations.
Finally, the benefits of using strings is that you can always programmatically enforce your data consistency. But, this comes at the cost that you have to programmatically enforce your data consistency. If the vehicle make can be changed or inserted by external users, even colleagues, this will get you in trouble unless you're very strict about what enters your database. For example, it might be nice to uppercase all values for easy grouping, since it effectively reduces how much can go wrong. You can do this during writing, or you can add an index on sqlalchemy.func.upper(Vehicle.make) and use hybrid properties to always query the uppercase value.
class Vehicle(Model):
id = Column(Integer, primary_key=True)
name = Column(String(32))
_make = Column('make', String(16))
#hybrid_property
def make(self):
return self._make.upper()
#make.expression
def make(cls):
return func.upper(cls._make)
Vehicle.query.get(1).make.upper() # == 'FORD', the make for vehicle 1
Vehicle.query.filter(Vehicle.make == 'FORD').all() # all Vehicles with make name 'FORD'
Before you make your choice, also think about how you want to present this to your user. If they should be able to add new options themselves, use strings or the separate table. If you want to show a dropdown of possibilities, use the enum or the table. If you have an empty database, it's going to be difficult to collect all string values to display in the frontend without needing to store this as a list somewhere in your Flask environment as well.

Related

Django Serialize to Json from Super class

I'm trying to figure out if there is any efficient way to serialize a queryset from superclass. My models:
class CampaignContact(models.Model):
campaign = models.ForeignKey(Campaign, related_name="campaign_contacts", null=False)
schedule_day = models.DateField(null=True)
schedule_time = models.TimeField(null=True)
completed = models.BooleanField(default=False, null=False)
class CampaignContactCompany(CampaignContact):
company = models.ForeignKey(Company, related_name='company_contacts', null=False)
class CampaignContactLead(CampaignContact):
lead = models.ForeignKey(Lead, related_name='lead_contacts' ,null=False)
I want to create a json with all campaign contacts may it be leads' or companys'
Django has a built in serializer documented here but it might not work as well considering how you structured your models:
from django.core import serializers
data = serializers.serialize("json", CampaignContactCompany.objects.all())
I imagine you could run that on both tables and combine the two sets but it would introduce a bit of overhead. You could also create a static to_json method in CampaignContact which takes two query sets from the other two tables and formats/combines them into json.
Maybe you have reason to model your tables as you did but based on observation it looks like you will have 3 tables, one never used and two with only a company and lead field different which is probably not ideal. Typically when relating a record to multiple objects you would simply put the lead and company field on the CampaignContact table and let them be null. To get only company contacts you could query company_contacts = CampaignContact.objects.filter(company__isnull=False).all()

Django GenereicForeignKey v/s custom manual fields performance/optimization

I'm trying to build a typical social networking site. there are two types of objects mainly.
photo
status
a user can like photo and status. (Note that these two are mutually exclusive)
means, We have two table (1) for Image only and other for status only.
now when a user likes an object(it could be a photo or status) how should I store that info.
I want to design a efficient SQL schema for this.
Currently I'm using Genericforeignkey(GFK)
class LikedObject(models.Model):
content_type = models.ForeignKey(ContentType)
object_id = models.PositiveIntegerField()
content_object = GenericForeignKey('content_type', 'object_id')
but yesterday I thought if I can do this without using GFK efficiently?
class LikedObject(models.Model):
OBJECT_TYPE = (
('status', 'Status'),
('image', 'Image/Photo'),
)
user = models.ForeignKey(User, related_name="liked_objects")
obj_id = models.PositiveIntegerField()
obj_type = models.CharField(max_length=63, choices=OBJECT_TYPE)
the only difference I can understand is that I have to make two queries if I want to get all liked_status of a particular user
status_ids = LikedObject.objects.filter(user=user_obj, obj_type='status').values_list('object_id', flat=True)
status_objs = Status.objects.filter(id__in=status_ids)
Am I correct? so What would be the best approach in terms of easy querying/inserting or performance, etc.
You are basically implementing your own Generic Object, only you limit your ContentType to your hard coded OBJECT_TYPE.
If you are only going to access the database as in your example (get all status objects liked by user x), or a couple specific queries, then your own implementation can be a little faster, of course. But obviously, if later you have to add more objects, or do other things, you may find yourself implementing your whole full generic solution. And like they say, why reinvent the wheel.
If you want better performance, and really only have those two Models to worry about, you may just want to have two different Like tables (StatusLike and ImageLike) and use inheritance to share functionality.
class LikedObject(models.Model):
common_field = ...
class Meta:
abstract = True
def some_share_function():
...
class StatusLikeObject(LikedObject):
user = models.ForeignKey(User, related_name="status_liked_objects")
status = models.ForeignKey(Status, related_name="liked_objects")
class ImageLikeObject(LikedObject):
user = models.ForeignKey(User, related_name="image_liked_objects")
image = models.ForeignKey(Image, related_name="liked_objects")
Basically, either you have a lot of Models to worry about, and then you probably want to use the more Django generic object implementation, or you only have two models, and why even bother with a half generic solution. Just use two tables.
In this case, I would check if your data objects Status and Photo may have many common data fields, e.g. Status.user and Photo.user, Status.title and Photo.title, Status.pub_date and Photo.pub_date, Status.text and Photo.caption, etc.
Could you combine them into an Item object maybe? That Item would have a Item.type field, either "photo" or "status"? Then you would only have a single table and a single object type a user can "like". Much simpler at basically no cost.
Edit:
from django.db import models
from django.utils.timezone import now
class Item(models.Model):
data_type = models.SmallIntegerField(
choices=((1, 'Status'), (2, 'Photo')), default=1)
user = models.ForeignKey(User)
title = models.CharField(max_length=100)
pub_date = models.DateTimeField(default=now)
...etc...
class Like(models.Model):
user = models.ForeignKey(User, related_name="liked_objects")
item = models.ForeignKey(Item)

SQLAlchemy db.session.query() vs model.query

For a simple return all results query should one method be preferred over the other? I can find uses of both online but can't really find anything describing the differences.
db.session.query([my model name]).all()
[my model name].query.all()
I feel that [my model name].query.all() is more descriptive.
It is hard to give a clear answer, as there is a high degree of preference subjectivity in answering this question.
From one perspective, the db.session is desired, because the second approach requires it to be incorporated in your model as an added step - it is not there by default as part of the Base class. For instance:
Base = declarative_base()
DBSession = scoped_session(sessionmaker())
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
fullname = Column(String)
password = Column(String)
session = Session()
print(User.query)
That code fails with the following error:
AttributeError: type object 'User' has no attribute 'query'
You need to do something like this:
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
fullname = Column(String)
password = Column(String)
query = DBSession.query_property()
However, it could also be argued that just because it is not enabled by default, that doesn't invalidate it as a reasonable way to launch queries. Furthermore, in the flask-sqlalchemy package (which simplifies sqlalchemy integration into the flask web framework) this is already done for you as part of the Model class (doc). Adding the query property to a model can also be seen in the sqlalchemy tutorial (doc):
class User(object):
query = db_session.query_property()
....
Thus, people could argue either approach.
I personally have a preference for the second method when I am selecting from a single table. For example:
serv = Service.query.join(Supplier, SupplierUsr).filter(SupplierUsr.username == usr).all()
This is because it is of smaller line length and still easily readable.
If am selecting from more than one table or specifying columns, then I would use the model query method as it extracting information from more than one model.
deliverables = db.session.query(Deliverable.column1, BatchInstance.column2).\
join(BatchInstance, Service, Supplier, SupplierUser). \
filter(SupplierUser.username == str(current_user)).\
order_by(Deliverable.created_time.desc()).all()
That said, a counter argument could be made in always using the session.query method as it makes the code more consistent, and when reading left to right, the reader immediately knows that the sqlalchemy directive they are going to read will be query, before mentally absorbing what tables and columns are involved.
At the end of the day, the answer to your question is subjective and there is no correct answer, and any code readability benefits either way are tiny. The only thing where I see a strong benefit is not to use model query if you are selecting from many tables and instead use the session.query method.

Django: retrieve distinct QuerySet

I've got the following models in my app. The Addition model is used to govern the many-to-many relationship between the Book model and the Collection model, since I need to include extra fields on the intermediate model.
class Book(models.Model):
name = models.CharField(max_length=200)
picture = models.ImageField(upload_to='img', max_length=1000)
price = models.DecimalField(max_digits=8, decimal_places=2)
class Collection(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=100)
books = models.ManyToManyField(Book, through='Addition')
subscribers = models.ManyToManyField(User, related_name='collection_subscriptions', blank=True, null=True)
class Addition(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
collection = models.ForeignKey(Collection)
created = models.DateTimeField(auto_now=False, auto_now_add=True)
updated = models.DateTimeField(auto_now=True, auto_now_add=True)
In my app users can add books to collections that they create (for example fiction, history, etc.). Other users can then follow those collections that they like.
When a user logs into the site, I'd like to display all of the books that have been recently added to the collections that they follow. With each book, I'd also like to display the name of the person who added it, and the name of the collection it's in.
I can get all of the additions as follows...
additions = Addition.objects.filter(collection__subscribers=user).select_related()
... but this results in duplicate books being retrieved and displayed to the user, often side by side.
If there a way to retrieve a distinct list of books that are in collections the user is following?
I'm using Django 1.3 + MySQL.
Thanks.
UPDATE
I should add that in general I'm not looking for any 'loop through the results and de-duplicate that way' solutions, for a couple of reasons.
There are likely to be tens or even hundreds of thousands of additions (I am also displaying this information on pages that list all new additions added by users), and response time is extremely important.
This solution may become more practical when limiting the initial result set, but it creates problems with pagination, which is also required. Namely how do you paginate the entire result set while also de-duplicating only a small portion of that set. I'm open to any ideas here that may solve this problem.
UPDATE
I should also mention that if the same book gets added by multiple users, I actually don't have a preference for which addition gets used, either the original or the most recent addition would work fine.
How about the following - it's not a pure SQL solution, and it'll cost you an extra database query and some loop time, but it should still perform ok, and it'll give you a lot more control over which additions take precedence over others:
def filter_additions(additions):
# Use a ValuesQuerySet for performance
additions_values = additions.values()
# The following code just eliminates duplicates. You could do
# something much more powerful/interesting here if you like,
# e.g. give preference to additions by a user`s friends
book_pk_registry = {}
excluded_addition_pks = []
for addition in additions_values:
addition_pk = addition['id']
book_pk = addition['book_id']
if book_pk not in book_pk_registry:
book_pk_registry[book_pk] = True
else:
excluded_addition_pks.append(addition_pk)
additions = additions.exclude(pk__in=excluded_addition_pks)
additions = Addition.objects.filter(collection__subscribers=user)
additions = filter_additions(additions)
If there are likely to be more than a thousand or so books involved, you may want to put a limit on the initial additions query. Passing massive lists of ids over in the exclude isn't such a great idea. Using 'values()' is quite important, because Python can cycle through a basic list of dicts a LOT faster than a queryset and it uses a lot less memory.
Assuming there won`t be huge amounts of additions to display, this could easily to the trick:
# duplicated..
additions = Addition.objects.filter(collection__subscribers=user, created__gt=DATE_LAST_LOGIN).select_related()
# remove duplication
added_books = {}
for addition in additions:
added_books[addition.book] = True
added_books = added_books.keys()
By the description you gave of the problem, performance would not be a problem.
additions = Addition.objects.filter(collection__subscribers=user).values('book').annotate(user=Min('user'), collection=Min('collection')).order_by()
This query will give you list of unique books with their users and collections. Books, collections, users will be pk's, not objects. But I hope you will store them in cache so that won't be a problem.
But for your workload I'd think about denormalization. My query is very heavy, and it isn't easy to cache its results if you will have frequent additions. My first approach will be to add latest_additions field to Collection model and to update with signals (not adding duplicates). The format of this field is up to you.
Sometimes it's OK to drop into SQL, especially when the ORM-only solution is not performant. It's easy to get the non-duplicate Addition row IDs in SQL, and then you can switch back to the ORM to select the data. It's two queries, but will outperform any of the single query solutions I've seen so far.
from django.db import connection
from operator import itemgetter
cursor = connection.cursor()
# Select non-duplicate book additions, preferring for most recently updated
query = '''SELECT id, MAX(updated) FROM %s
GROUP BY book_id''' % Addition._meta.db_table
cursor.execute(query)
# Flatten the results to an id list
addition_ids = map(itemgetter(0), cursor.fetchall())
additions = Addition.objects.filter(
collection__subscribers=user, id__in=addition_ids).select_related()

Edit orm object based on query with label fields

Time for more pushing the limits of sqlalchemy. It never ceases to amaze!
Background
I have table for devices, and a table to record physical links between them.
class Device(Base):
__tablename__ = "device"
device_id = sa.Column(sa.Integer, primary_key=True)
name = sa.Column(sa.String(255), nullable=False)
class PhysicalLink(Base):
__tablename__ = "physical_link"
physical_links_id = sa.Column(sa.Integer, primary_key=True)
device_id_1 = sa.Column(sa.types.Integer, sa.ForeignKey(Device.device_id), nullable=False)
device_port_1 = sa.Column(sa.String(255), nullable=False)
device_id_2 = sa.Column(sa.types.Integer, sa.ForeignKey(Device.device_id), nullable=False)
device_port_2 = sa.Column(sa.String(255), nullable=False)
cable_number = sa.Column(sa.String(255), nullable=False)
When I dealing with the physical links for a know device, I don't want to have to always have if statements to decide whether I should be looking at device_[id|port]_ 1 or 2, so I did:
physical_links_table = PhysicalLinks.__table__
physical_links_ua = union_all(
select((
physical_links_table.c.physical_links_id,
label('this_device_id', physical_links_table.c.device_id_1),
label('this_device_port', physical_links_table.c.device_port_1),
label('other_device_id', physical_links_table.c.device_id_2),
label('other_device_port', physical_links_table.c.device_port_2),
physical_links_table.c.cable_number,
),),
select((
physical_links_table.c.physical_links_id,
label('this_device_id', physical_links_table.c.device_id_2),
label('this_device_port', physical_links_table.c.device_port_2),
label('other_device_id', physical_links_table.c.device_id_1),
label('other_device_port', physical_links_table.c.device_port_1),
physical_links_table.c.cable_number,
),),
).alias('physical_links_ua')
class PhysicalLinksDir(object):
pass
physical_links_dir_mapper = orm.mapper(PhysicalLinksDir, physical_links_ua)
physical_links_dir_mapper.add_property(
'this_device', orm.relation(Device, primaryjoin=(PhysicalLinksDir.this_device_id == Device.device_id)))
physical_links_dir_mapper.add_property(
'other_device', orm.relation(Device, primaryjoin=(PhysicalLinksDir.other_device_id == Device.device_id)))
This allows me to do:
physical_links = (db_session
.query(PhysicalLinksDir)
.filter(PhysicalLinksDir.this_device_id = my_device.device_id)
.options(joinedload('other_device')))
for pl in physical_links:
print pl.other_device
(Did I remember to tell you that I think that sqlalchmey rocks!)
Question
What do I need to do to make it possible to modify PhysicalLinksDir instance attributes, and be able to commit them back to the db?
In general, you will have to be very careful with updating it the way you want,
because those view objects PhysicalLinksDir will not always be in-sync with the
underlying Device and PhysicalLink you might have in session/database.
I obviously do not know your requirements, but I prefer not to have such inconsistencies when working with my model.
Also, there is a problem with the kind of mapping you have. You would expect to have 2 rows of PhysicalLinksDir for each row of PhysicalLink (one for each side), but if you try it, you will see this is not the case. The reason for this is that the first column (physical_links_id) is considered to be a primary_key so the
query object will discard the second one with the same value.
In order to fix it, you need to configure the primary_key manually. Assuming there can be only one
connection between two different Devices, the solution below will do the trick. You might need to extend it to include the port as well:
physical_links_dir_mapper = orm.mapper(PhysicalLinksDir, physical_links_ua,
# #note: add this
primary_key=[physical_links_ua.c.physical_links_id, physical_links_ua.c.this_device_id],
)
DELETE: Now, to support delete, all you need to do is to add a relationship between your PLD and the actual PhysicalLink and the session.delete(my_PLD); session.commit() will also delete the PhysicalLink it represents:
physical_links_dir_mapper.add_property(
'physical_link', orm.relation(PhysicalLink, primaryjoin=(
PhysicalLinksDir.physical_links_id == PhysicalLink.physical_links_id),
foreign_keys=[PhysicalLinksDir.physical_links_id]
))
But in fact, the deletion might work out of the box as the model is soft-linked to the physical_link table.
INSERT: Well, this is easily done with the PhysicalLink object directly, so I would just keep it this way.
UPDATE: You could potentially probably achieve this with Session Events, but the most simple way would be just to wrap all the attributes in a #property which would delegate the change to the proper object.
IMPORTANT: I still think that this way of working is not really nice, because the links are not updated automatically and your in-memory UnitOfWork might be inconsistent.
If also would be useful to understand why you think this way of working with your objects would be better? What are the use cases of this app?