How can I create a proper sql statement from this django model? - mysql

I'm trying to insert data retrieved by scraping into the DB created by the following models.
However, I realized that using django's bulk_create or the external library based on it, bulk_create_or_update, is likely to make the logic too complex. (I felt that orm should be used for simple CRUD, etc.)
So I'm thinking of using Row SQL to save the data, for both maintainability and speed.
I'm not familiar with sql at all, so I'd like to get some advice from you guys.
What SQL code is preferable to this?
Each page to be scraped has multiple pieces of information, and there are multiple pages in total. I'd like to scrape all the pages first, add them to a dictionary, and then save them in a batch using sql, but I don't know the best way to do this.
from django.db import models
from django.forms import CharField
# Create your models here.
# province
class Prefecture(models.Model):
name=models.CharField("都道府県名",max_length=10)
# city
class City(models.Model):
prefecture = models.ForeignKey(Prefecture, on_delete=models.CASCADE, related_name='city')
name=models.CharField("市区町村名",max_length=10)
# seller
class Client(models.Model):
prefecture = models.ForeignKey(Prefecture, on_delete=models.CASCADE, related_name='client',null=True,blank=True)
city = models.ForeignKey(City, on_delete=models.CASCADE, related_name='client',null=True,blank=True)
department = models.CharField("部局",max_length=100)
# detail
class Project(models.Model):
name = models.CharField("案件名",max_length=100)
serial_no = models.CharField("案件番号",max_length=100,null=True,blank=True)
client = models.ForeignKey(Client, on_delete=models.CASCADE, related_name='project')
# etc...
# file
class AttachedFile(models.Model):
project = models.ForeignKey(Project, on_delete=models.CASCADE, related_name='attach_file')
name = models.CharField(max_length=100)
path = models.CharField(max_length=255)
# bid company
class Bidder(models.Model):
name = models.CharField("入札業者名",max_length=100)
prefecture = models.ForeignKey(Prefecture, on_delete=models.CASCADE, related_name='bidder',null=True,blank=True)
city = models.ForeignKey(City, on_delete=models.CASCADE, related_name='bidder',null=True,blank=True)
# etc...
# result
class BidResult(models.Model):
project = models.ForeignKey(Project, on_delete=models.CASCADE, related_name='bid_result')
bidder = models.ForeignKey(Bidder, on_delete=models.CASCADE, related_name='bid_result')

I don't think you will have a drastic performance boost by using raw SQL instead of orm. Also, Orm can be used for complex operations, and operations such as bulk update and bulk create are not complex and as fast as normal raw SQL. Things may go slow with ORM when you try to fetch records into memory and then do the operations but in your case, it is updated and created which can be done easily using Django ORM. As far as using a function doing update_or_create, using an external library won't affect your performance but using raw SQL for marginal speed gains may impact your code maintainability as you already told, You are not much familiar with raw sql.

Related

Django DB Relations

I apologize for the novice question but my head is about to explode.
I am trying to learn Django and wanted to create something practical and that I could use. I settled with a small inventory system.
The problem I am having is figuring out the best way to have relationships between models for ideal db setup.
I have models for the following:
Depreciation Policy
Manufacturer
Customer/Owner
Asset Category (Server, laptop etc)
Asset Model (Macbook Pro, Proliant DL380 Gen 9 etc)
Asset Status (Archived, Deployed, Lost etc)
Asset Fields (Generic fields all assets would
have, model(FK), status(FK), purchase date etc.)
Cpu
Server
Network Card
Right now I have both Server & Network Card inheriting Asset Fields.
My goal was to have tables for each type of asset but still have one master table that can reference each asset type's table via a FK so that if I want to display all assets in the database I can just list one table and pull in the relevant information from their related tables instead of having to look through each asset's table.
I want something like:
Asset Table:
id(pk), model(fk), status(fk), serial Number, purchase date, cost, location, server_id/network_card_id(fk)
Server Table:
id(pk), customer, name, memory, cpu(fk), ilo_type, ilo_lic
Asset Model:
class Asset(models.Model):
# class Meta:
# abstract = True
assetTag = models.CharField(max_length=50, unique=True)
model = models.ForeignKey(AssetModel, on_delete=models.SET_NULL, null=True)
status = models.ForeignKey(AssetStatus, on_delete=models.SET_NULL, null=True)
serialNumber = models.CharField(max_length=50, unique=True)
purchaseDate = models.DateTimeField('Date order was placed', blank=True)
purchaseOrder = models.CharField(max_length=25, blank=True)
cost = models.DecimalField(default=0, max_digits=8, decimal_places=2, blank=True)
# location = models.CharField(max_length=25, blank=True, null=True)
def calculate_current_value(self, purchaseDate, cost):
purchase_price = float(cost)
months = self.model.depreciationPolicy.months
if months > 0:
days = months * 30
depreciation_per_day = (float(cost) / days)
days_owned = timezone.now() - purchaseDate
value_lost = days_owned.days * depreciation_per_day
current_asset_value = purchase_price - value_lost
return round(current_asset_value, 2)
def __str__(self):
return self.serialNumber
Server Model:
class Server(Asset):
customer = models.ForeignKey(Customer, on_delete=models.SET_NULL, null=True)
memory = models.IntegerField(default=0)
cpu = models.ForeignKey(Cpu, on_delete=models.SET_NULL, null=True)
ILOType = models.CharField(max_length=50, choices=(('Std', 'Standard'), ('Adv', 'Advanced')))
ILOUsername = models.CharField(max_length=50, null=True, blank=True)
ILOPassword = models.CharField(max_length=100, null=True, blank=True)
ILOLicense = models.CharField(max_length=50, null=True, blank=True)
def __str__(self):
return self.serialNumber
Network_Card Model:
class Nic(Asset):
macAddress = models.CharField(max_length=100, unique=True)
portCount = models.IntegerField(default=0)
portType = models.CharField(max_length=50, choices=(('Ethernet', 'Ethernet'), ('Fiber', 'Fiber')))
def __str__(self):
return self.name
The first thing I would recommend is for you to just forget about tables. You're not dealing with tables in Django, but with models, which are classes that represent the entities of your system. Models become tables later, but you don't have to concern yourself with them right now.
Second, model your classes carefully. Design a nice diagram representing their relationships. In your scenario, one class will contain references to other classes (a pivot class), so model that.
Also, take a moment to read the documentation. Django is a very neatly and thoroughly documented. Read carefully about models and squerysets. You will know everything you need to represent things inside your architecture.
Some hints:
When defining foreign fields, you'll have quite a few options, from ForeighKey(), ManyToManyField(), ManyToOneRel(), etc. Read about each and every one of the options, and chose the one that represents your reality most accurately.
ManyToMany() has an interesting behavior, which is: if you don't pass a table to it, it will create one for you, to act as a pivot. I prefer to create my middle tables, as it is more aligned with the Zen of Python.
When you want to return a coherent representation of your data, you'll have to work with querysets in order to enforce the relationships you have built into your models, much in the same way you'd do with a relational database, either by constructing a query, or designing a view.
Here's some nice links for you:
Fields: https://docs.djangoproject.com/en/1.10/ref/models/fields/
Querysets: https://docs.djangoproject.com/en/1.10/topics/db/queries/
The Zen of Python: https://www.python.org/dev/peps/pep-0020/
Further, I strongly recommend you to go take a look at Django Rest Framework as soon as you get a hold on the basic concepts of Django models. Here's the link: http://www.django-rest-framework.org/
And come back to me when you have more specific questions.
Happy coding!
I believe I have solved this issue.
the solution was to build it out as follows. Instead of the Server & Nic class inheriting the Asset class, I defined a 1:1 relationship with each of them and the Asset class.
asset = models.OneToOneField(Asset, on_delete=models.CASCADE, primary_key=True)
This allowed for the Asset table to track all assets, and the asset_id(PK) is both a foreign_key and a primary_key in Server & Nic.

Django Serialize to Json from Super class

I'm trying to figure out if there is any efficient way to serialize a queryset from superclass. My models:
class CampaignContact(models.Model):
campaign = models.ForeignKey(Campaign, related_name="campaign_contacts", null=False)
schedule_day = models.DateField(null=True)
schedule_time = models.TimeField(null=True)
completed = models.BooleanField(default=False, null=False)
class CampaignContactCompany(CampaignContact):
company = models.ForeignKey(Company, related_name='company_contacts', null=False)
class CampaignContactLead(CampaignContact):
lead = models.ForeignKey(Lead, related_name='lead_contacts' ,null=False)
I want to create a json with all campaign contacts may it be leads' or companys'
Django has a built in serializer documented here but it might not work as well considering how you structured your models:
from django.core import serializers
data = serializers.serialize("json", CampaignContactCompany.objects.all())
I imagine you could run that on both tables and combine the two sets but it would introduce a bit of overhead. You could also create a static to_json method in CampaignContact which takes two query sets from the other two tables and formats/combines them into json.
Maybe you have reason to model your tables as you did but based on observation it looks like you will have 3 tables, one never used and two with only a company and lead field different which is probably not ideal. Typically when relating a record to multiple objects you would simply put the lead and company field on the CampaignContact table and let them be null. To get only company contacts you could query company_contacts = CampaignContact.objects.filter(company__isnull=False).all()

Django GenereicForeignKey v/s custom manual fields performance/optimization

I'm trying to build a typical social networking site. there are two types of objects mainly.
photo
status
a user can like photo and status. (Note that these two are mutually exclusive)
means, We have two table (1) for Image only and other for status only.
now when a user likes an object(it could be a photo or status) how should I store that info.
I want to design a efficient SQL schema for this.
Currently I'm using Genericforeignkey(GFK)
class LikedObject(models.Model):
content_type = models.ForeignKey(ContentType)
object_id = models.PositiveIntegerField()
content_object = GenericForeignKey('content_type', 'object_id')
but yesterday I thought if I can do this without using GFK efficiently?
class LikedObject(models.Model):
OBJECT_TYPE = (
('status', 'Status'),
('image', 'Image/Photo'),
)
user = models.ForeignKey(User, related_name="liked_objects")
obj_id = models.PositiveIntegerField()
obj_type = models.CharField(max_length=63, choices=OBJECT_TYPE)
the only difference I can understand is that I have to make two queries if I want to get all liked_status of a particular user
status_ids = LikedObject.objects.filter(user=user_obj, obj_type='status').values_list('object_id', flat=True)
status_objs = Status.objects.filter(id__in=status_ids)
Am I correct? so What would be the best approach in terms of easy querying/inserting or performance, etc.
You are basically implementing your own Generic Object, only you limit your ContentType to your hard coded OBJECT_TYPE.
If you are only going to access the database as in your example (get all status objects liked by user x), or a couple specific queries, then your own implementation can be a little faster, of course. But obviously, if later you have to add more objects, or do other things, you may find yourself implementing your whole full generic solution. And like they say, why reinvent the wheel.
If you want better performance, and really only have those two Models to worry about, you may just want to have two different Like tables (StatusLike and ImageLike) and use inheritance to share functionality.
class LikedObject(models.Model):
common_field = ...
class Meta:
abstract = True
def some_share_function():
...
class StatusLikeObject(LikedObject):
user = models.ForeignKey(User, related_name="status_liked_objects")
status = models.ForeignKey(Status, related_name="liked_objects")
class ImageLikeObject(LikedObject):
user = models.ForeignKey(User, related_name="image_liked_objects")
image = models.ForeignKey(Image, related_name="liked_objects")
Basically, either you have a lot of Models to worry about, and then you probably want to use the more Django generic object implementation, or you only have two models, and why even bother with a half generic solution. Just use two tables.
In this case, I would check if your data objects Status and Photo may have many common data fields, e.g. Status.user and Photo.user, Status.title and Photo.title, Status.pub_date and Photo.pub_date, Status.text and Photo.caption, etc.
Could you combine them into an Item object maybe? That Item would have a Item.type field, either "photo" or "status"? Then you would only have a single table and a single object type a user can "like". Much simpler at basically no cost.
Edit:
from django.db import models
from django.utils.timezone import now
class Item(models.Model):
data_type = models.SmallIntegerField(
choices=((1, 'Status'), (2, 'Photo')), default=1)
user = models.ForeignKey(User)
title = models.CharField(max_length=100)
pub_date = models.DateTimeField(default=now)
...etc...
class Like(models.Model):
user = models.ForeignKey(User, related_name="liked_objects")
item = models.ForeignKey(Item)

SQLAlchemy db.session.query() vs model.query

For a simple return all results query should one method be preferred over the other? I can find uses of both online but can't really find anything describing the differences.
db.session.query([my model name]).all()
[my model name].query.all()
I feel that [my model name].query.all() is more descriptive.
It is hard to give a clear answer, as there is a high degree of preference subjectivity in answering this question.
From one perspective, the db.session is desired, because the second approach requires it to be incorporated in your model as an added step - it is not there by default as part of the Base class. For instance:
Base = declarative_base()
DBSession = scoped_session(sessionmaker())
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
fullname = Column(String)
password = Column(String)
session = Session()
print(User.query)
That code fails with the following error:
AttributeError: type object 'User' has no attribute 'query'
You need to do something like this:
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
fullname = Column(String)
password = Column(String)
query = DBSession.query_property()
However, it could also be argued that just because it is not enabled by default, that doesn't invalidate it as a reasonable way to launch queries. Furthermore, in the flask-sqlalchemy package (which simplifies sqlalchemy integration into the flask web framework) this is already done for you as part of the Model class (doc). Adding the query property to a model can also be seen in the sqlalchemy tutorial (doc):
class User(object):
query = db_session.query_property()
....
Thus, people could argue either approach.
I personally have a preference for the second method when I am selecting from a single table. For example:
serv = Service.query.join(Supplier, SupplierUsr).filter(SupplierUsr.username == usr).all()
This is because it is of smaller line length and still easily readable.
If am selecting from more than one table or specifying columns, then I would use the model query method as it extracting information from more than one model.
deliverables = db.session.query(Deliverable.column1, BatchInstance.column2).\
join(BatchInstance, Service, Supplier, SupplierUser). \
filter(SupplierUser.username == str(current_user)).\
order_by(Deliverable.created_time.desc()).all()
That said, a counter argument could be made in always using the session.query method as it makes the code more consistent, and when reading left to right, the reader immediately knows that the sqlalchemy directive they are going to read will be query, before mentally absorbing what tables and columns are involved.
At the end of the day, the answer to your question is subjective and there is no correct answer, and any code readability benefits either way are tiny. The only thing where I see a strong benefit is not to use model query if you are selecting from many tables and instead use the session.query method.

Django: retrieve distinct QuerySet

I've got the following models in my app. The Addition model is used to govern the many-to-many relationship between the Book model and the Collection model, since I need to include extra fields on the intermediate model.
class Book(models.Model):
name = models.CharField(max_length=200)
picture = models.ImageField(upload_to='img', max_length=1000)
price = models.DecimalField(max_digits=8, decimal_places=2)
class Collection(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=100)
books = models.ManyToManyField(Book, through='Addition')
subscribers = models.ManyToManyField(User, related_name='collection_subscriptions', blank=True, null=True)
class Addition(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
collection = models.ForeignKey(Collection)
created = models.DateTimeField(auto_now=False, auto_now_add=True)
updated = models.DateTimeField(auto_now=True, auto_now_add=True)
In my app users can add books to collections that they create (for example fiction, history, etc.). Other users can then follow those collections that they like.
When a user logs into the site, I'd like to display all of the books that have been recently added to the collections that they follow. With each book, I'd also like to display the name of the person who added it, and the name of the collection it's in.
I can get all of the additions as follows...
additions = Addition.objects.filter(collection__subscribers=user).select_related()
... but this results in duplicate books being retrieved and displayed to the user, often side by side.
If there a way to retrieve a distinct list of books that are in collections the user is following?
I'm using Django 1.3 + MySQL.
Thanks.
UPDATE
I should add that in general I'm not looking for any 'loop through the results and de-duplicate that way' solutions, for a couple of reasons.
There are likely to be tens or even hundreds of thousands of additions (I am also displaying this information on pages that list all new additions added by users), and response time is extremely important.
This solution may become more practical when limiting the initial result set, but it creates problems with pagination, which is also required. Namely how do you paginate the entire result set while also de-duplicating only a small portion of that set. I'm open to any ideas here that may solve this problem.
UPDATE
I should also mention that if the same book gets added by multiple users, I actually don't have a preference for which addition gets used, either the original or the most recent addition would work fine.
How about the following - it's not a pure SQL solution, and it'll cost you an extra database query and some loop time, but it should still perform ok, and it'll give you a lot more control over which additions take precedence over others:
def filter_additions(additions):
# Use a ValuesQuerySet for performance
additions_values = additions.values()
# The following code just eliminates duplicates. You could do
# something much more powerful/interesting here if you like,
# e.g. give preference to additions by a user`s friends
book_pk_registry = {}
excluded_addition_pks = []
for addition in additions_values:
addition_pk = addition['id']
book_pk = addition['book_id']
if book_pk not in book_pk_registry:
book_pk_registry[book_pk] = True
else:
excluded_addition_pks.append(addition_pk)
additions = additions.exclude(pk__in=excluded_addition_pks)
additions = Addition.objects.filter(collection__subscribers=user)
additions = filter_additions(additions)
If there are likely to be more than a thousand or so books involved, you may want to put a limit on the initial additions query. Passing massive lists of ids over in the exclude isn't such a great idea. Using 'values()' is quite important, because Python can cycle through a basic list of dicts a LOT faster than a queryset and it uses a lot less memory.
Assuming there won`t be huge amounts of additions to display, this could easily to the trick:
# duplicated..
additions = Addition.objects.filter(collection__subscribers=user, created__gt=DATE_LAST_LOGIN).select_related()
# remove duplication
added_books = {}
for addition in additions:
added_books[addition.book] = True
added_books = added_books.keys()
By the description you gave of the problem, performance would not be a problem.
additions = Addition.objects.filter(collection__subscribers=user).values('book').annotate(user=Min('user'), collection=Min('collection')).order_by()
This query will give you list of unique books with their users and collections. Books, collections, users will be pk's, not objects. But I hope you will store them in cache so that won't be a problem.
But for your workload I'd think about denormalization. My query is very heavy, and it isn't easy to cache its results if you will have frequent additions. My first approach will be to add latest_additions field to Collection model and to update with signals (not adding duplicates). The format of this field is up to you.
Sometimes it's OK to drop into SQL, especially when the ORM-only solution is not performant. It's easy to get the non-duplicate Addition row IDs in SQL, and then you can switch back to the ORM to select the data. It's two queries, but will outperform any of the single query solutions I've seen so far.
from django.db import connection
from operator import itemgetter
cursor = connection.cursor()
# Select non-duplicate book additions, preferring for most recently updated
query = '''SELECT id, MAX(updated) FROM %s
GROUP BY book_id''' % Addition._meta.db_table
cursor.execute(query)
# Flatten the results to an id list
addition_ids = map(itemgetter(0), cursor.fetchall())
additions = Addition.objects.filter(
collection__subscribers=user, id__in=addition_ids).select_related()