I am trying to optimize my db queries(mysql) in a django app.
This is the situation:
I need to retrieve some data about sales, stock about some products on a monthly basis. This is the function
def get_magazzino_month(year, month):
from magazzino.models import ddt_in_item, omaggi_item, inventario_item
from corrispettivi.models import corrispettivi_item, corrispettivi
from fatture.models import fatture_item, fatture, fatture_laboratori_item
from prodotti.models import prodotti
qt = 0
val = 0
products = prodotti.objects.all()
invents = inventario_item.objects.all().filter(id_inventario__data__year=year-1)
fatture_lab = fatture_laboratori_item.objects.all().order_by("-id_fattura__data")
for product in products:
inv_instance = filter_for_product(invents, product)
if inv_instance:
qt += inv_instance[0].quantita
lab_instance = fatture_lab.filter(id_prodotti=product).first()
prezzo_prodotto = (lab_instance.costo_acquisto/lab_instance.quantita - ((lab_instance.costo_acquisto/lab_instance.quantita) * lab_instance.sconto / 100)) if lab_instance else product.costo_acquisto
return val, qt
The problem is where I need to filter all the data to get only the product I need. It seems that the .filter option makes django requery the database, although all of the data is there. I tried making a function to filter it myself, but although the queries diminish, loading time increases dramatically.
This is the function to filter:
def filter_for_product(array, product):
result = []
for instance in array:
if instance.id_prodotti.id == product.id:
result.append(instance)
return result
Has anyone ever dealt with this kind of problem?
You can use prefetch_related() to return a queryset of related objects and Prefetch() to further control the operation.
from django.db.models import Prefetch
products = prodotti.objects.all().annotate(
Prefetch(
'product_set',
queryset=inventario_item.objects.all().filter(id_inventario__data__year=year-1),
to_attr='invent'
)
)
Then you can access each product's invent like products[0].invent
Using select_related() will help optimize your queries
A good example of what select_related() does and how to use it is available at simpleisbetterthancomplex.
Related
I want to fetch data of multiple ids which is provided from the User Interface and these ids are stored in a list. So how can I retrieve the data of these ids using Django ORM?
When I try the following approach It returned nothing
def selectassess(request):
if request.method=='POST':
assess=request.POST.getlist("assess")
quesno=request.POST.getlist("quesno")
subid=request.POST.getlist("subid")
print(quesno,subid)
print(assess)
max_id = Sub_Topics.objects.all().aggregate(max_id=Max("id"))['max_id']
print(max_id)
pk = random.sample(range(1, max_id),3)
subss=Sub_Topics.objects.raw("Select * from Sub_Topics where id=%s",(str(pk),))
context4={
'subss':subss,
}
print(pk)
return render(request,"assessment.html",context)
By applying the below-mentioned approach I can get only one id-data which is specified by typing the index value. But I want to get the data of all ids which are stored in the list so how can I get my required output by using Django ORM
def selectassess(request):
if request.method=='POST':
assess=request.POST.getlist("assess")
quesno=request.POST.getlist("quesno")
subid=request.POST.getlist("subid")
print(quesno,subid)
print("youuuuu")
print(assess)
max_id = Sub_Topics.objects.all().aggregate(max_id=Max("id"))['max_id']
print(max_id)
pk = random.sample(range(1, max_id),3)
sub = Sub_Topics.objects.filter(pk=pk[0]).values('id','subtopic')
context4={
'sub':sub,
}
print(pk)
return render(request,"assessment.html",context4)
You can use filter(pk__in=pk).
Try:
list_of_ids = random.sample(range(1, max_id),3)
sub = Sub_Topics.objects.filter(id__in=list_of_ids).values('id','subtopic')
Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))
I have two models like this:
class McqQuestion(models.Model):
mcq_question_id = models.IntegerField()
test_id = models.ForeignKey('exam.Test')
mcq_right_answer = models.IntegerField()
class UserMcqAnswer(models.Model):
user = models.ForeignKey('exam.UserInfo')
test_id = models.ForeignKey('exam.Test')
mcq_question_id=models.ForeignKey('exam.McqQuestion')
user_answer = models.IntegerField()
I need to match the user_answer and mcq_right_answer. Able to do that by executing the below raw query.
rightAns=UserMcqAnswer.objects.raw('SELECT B.id, COUNT(A.mcq_question_id) AS RightAns\
FROM exam_mcqquestion AS A\
LEFT JOIN exam_usermcqanswer AS B\
ON A.mcq_question_id=B.mcq_question_id_id\
WHERE B.test_id_id=%s AND B.user_id=%s AND\
A.mcq_right_answer=B.user_answer',[test_id,user_id])
1) But the problem is that couldn't able to pass the result as JSONResponse because it says TypeError: Object of type 'RawQuerySet' is not JSON serializable
2) Is there any alternative to this raw query by using the objects and filtered querysets?
Django's serialize function's second argument can be any iterator that yields Django model instances.
So, in principle, you can use that raw SQL query that you worked on, using something like this:
query = """SELECT B.id, COUNT(A.mcq_question_id) AS RightAns\
FROM exam_mcqquestion AS A\
LEFT JOIN exam_usermcqanswer AS B\
ON A.mcq_question_id=B.mcq_question_id_id\
WHERE B.test_id_id=%s AND B.user_id=%s AND\
A.mcq_right_answer=B.user_answer"""%(test_id, user_id)
and then getting the json data you'll return, as:
from django.core import serializers
data = serializers.serialize('json', UserMcqAnswer.objects.raw(query), fields=('some_field_you_want', 'another_field', 'and_some_other_field'))
Good luck finding the best way to solve your issue
Edit: small fix, added an import
Using raw query is not recommended in Django.
When the model query APIs don’t go far enough, you can fall back to writing raw SQL.
In your case model query API can solve your problem. You can use the following view:
views.py
def get_answers(request):
test = Test.objects.get(name="Test 1")
answers = UserMcqAnswer.objects.filter(test_id=test, user=request.user).annotate(
is_correct=Case(
When(user_answer=F('mcq_question_id__mcq_right_answer'),
then=Value(True)),
default=Value(False),
output_field=BooleanField())
).values()
return JsonResponse(list(answers), safe=False)
Also you can consider Django Rest Framework for serialization of QuerySet.
I need to
1. run a select query on MYSQL DB and fetch the records.
2. Records are processed by python script.
I am unsure about the way I should proceed. Is xcom the way to go here? Also, MYSQLOperator only executes the query, doesn't fetch the records. Is there any inbuilt transfer operator I can use? How can I use a MYSQL hook here?
you may want to use a PythonOperator that uses the hook to get the data,
apply transformation and ship the (now scored) rows back some other place.
Can someone explain how to proceed regarding the same.
Refer - http://markmail.org/message/x6nfeo6zhjfeakfe
def do_work():
mysqlserver = MySqlHook(connection_id)
sql = "SELECT * from table where col > 100 "
row_count = mysqlserver.get_records(sql, schema='testdb')
print row_count[0][0]
callMYSQLHook = PythonOperator(
task_id='fetch_from_testdb',
python_callable=mysqlHook,
dag=dag
)
Is this the correct way to proceed?
Also how do we use xcoms to store the records for the following MySqlOperator?'
t = MySqlOperator(
conn_id='mysql_default',
task_id='basic_mysql',
sql="SELECT count(*) from table1 where id > 10",
dag=dag)
I was really struggling with this for the past 90 minutes, here is a more declarative way to follow for newcomers:
from airflow.hooks.mysql_hook import MySqlHook
def fetch_records():
request = "SELECT * FROM your_table"
mysql_hook = MySqlHook(mysql_conn_id = 'the_connection_name_sourced_from_the_ui', schema = 'specific_db')
connection = mysql_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
sources = cursor.fetchall()
print(sources)
...your DAG() as dag: code
task = PythonOperator(
task_id = 'fetch_records',
python_callable = fetch_records
)
This returns to the logs the contents of your DB query.
I hope this is of use to someone else.
Sure, just create a hook or operator and call the get_records() method: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/hooks/dbapi.html
Trying to port some SqlAlchemy to Django and I've got this tricky little bit:
version = Column(
BIGINT,
default=literal_column(
'UNIX_TIMESTAMP() * 1000000 + MICROSECOND(CURRENT_TIMESTAMP)'
),
nullable=False)
What's the best option for porting the literal_column bit to Django? Best idea I've got so far is a function to set as the default that executes the same raw sql, but I'm not sure if there's an easier way? My google-foo is failing me there.
Edit: the reason we need to use a timestamp created by mysql is that we are measuring how out of date something is (so we need to actually know time) and we want, for correctness, to have only one time-stamping authority (so that we don't introduce error using python functions that look at system times, which could be different across servers).
At present I've got:
def get_current_timestamp(self):
cursor = connection.cursor()
cursor.execute("SELECT UNIX_TIMESTAMP() * 1000000 + MICROSECOND(CURRENT_TIMESTAMP)")
row = cursor.fetchone()
return row
version = models.BigIntegerField(default=get_current_timestamp)
which, at this point, sounds like my best/only option.
If you don't care about having a central time authority:
import time
version = models.BigIntegerField(
default = lambda: int(time.time()*1000000) )
To bend the database to your will:
from django.db.models.expressions import ExpressionNode
class NowInt(ExpressionNode):
""" Pass this in the same manner you would pass Count or F objects """
def __init__(self):
super(Now, self).__init__(None, None, False)
def evaluate(self, evaluator, qn, connection):
return '(UNIX_TIMESTAMP() * 1000000 + MICROSECOND(CURRENT_TIMESTAMP))', []
### Model
version = models.BigIntegerField(default=NowInt())
because expression nodes are not callables, the expression will be evaluated database side.