I have three tables
User
Device
Log
I want to filter the logs based on devices and logs. I'm using the following querying which iterates over the users and devices in order to get the logs. I feel this will become a performance hit. How to reduce the number of database hits?
for user_obj in User.objects.all():
device_qs = Device.objects.filter(user=user_obj)
if device_qs.exists():
for device_obj in device_qs:
log_count = Log.objects.filter(user=user_obj, device=device_obj, created_at__range(from_date, to_date)).count()
If you only need the log count per user and device (which is what you get from the code you posted), you can get that in just one query:
from django.db.models import Count
logs = (Log.objects
.filter(created_at__range = (from_date, to_date))
.values('user', 'device')
.annotate(log_count=Count('device'))
)
You can modify the query to include any attributes of the user and device models that you need:
.values('user__last_name', 'device__name') # etc.
You can also order the dataset by appending order_by() at the end to be able to iterate over it in the desired order:
.order_by('user__last_name', '-log_count')
What I would do is create a "proxy model" that references a view in your MySQL instance
The view would look like this:
SELECT
t1.*,
t2.*,
t3.*
FROM users t1
RIGHT JOIN device t2 (ON t1.id=t2.user_id)
RIGHT JOIN log t3 (ON t3.device_id=t2.id);
Now to create a proxy model, do this:
class SomeModel(models.Model):
# all fields from the 3 tables here
class Meta:
db_table = 'yourViewNameHere'
managed = False # this keeps django from creating the table
then python manage.py makemigrations + python manage.py migrate as usual
Now, to access the the data you need, you would do something like this:
from django.db import connection
sql = "SELECT * FROM your_view WHERE some_date_column > 'foo' AND some_date_column < 'bar' "
with connection.cursor() as cur:
cur.execute(sql)
data = cur.fetchall()
print(data)
Note that if you are passing parameters to the raw sql query, you should always pass them like this to avoid sql injection:
sql = "SELECT * FROM your_view WHERE some_date_column > %s AND some_date_column < %s"
params = ('foo', 'bar')
with connection.cursor() as cur:
cur.execute(sql, params)
data = cur.fetchall()
Related
Goal: run checks.yml on all Tables in Database, implicitly / dynamically (not naming 100s of Tables).
Following Soda's quick start, I've completed sections:
Install Soda Core
Connect Soda Core to a data source - configuration.yml
Now I'm following Write a check and run a scan - checks.yml.
Problem
However, the documentation only gives examples for checking one Table each.
4 Checks
Sum of Tables (in Database)
Sum of Columns (across all Tables, in Database)
Sum of Tables' descriptions exist
Sum of Columns' descriptions exist
Queries return a COUNT().
So far, checks.yml:
# checks for MY_DATABASE:
sql_metrics:
name: num_tables, num_columns
sum_tables query: |
SELECT COUNT(*)
FROM information_schema.tables
WHERE table_schema = '*';
sum_columns query: |
SELECT COUNT(*)
FROM information_schema.columns
WHERE table_name = '*';
sum_tables_descriptions query: |
-- SQL
sum_columns_descriptions query: |
-- SQL
Your checks file should look like
checks for MY_DATABASE:
- sum_tables > 0:
sum_tables query: |
SELECT COUNT(*)
FROM information_schema.tables
WHERE table_schema ~~ '%'
- sum_columns > 0:
sum_columns query: |
SELECT COUNT(*)
FROM information_schema.columns
Your checks must be based on conditions(to be checked or verified) or filters that are not available yet on documentation.
~~ means like and % is the wildcard. Although condition all and no condition gives same result, therefore where clause is not neccessary.
Soda Core 3.0.5
Scan summary:
2/2 checks PASSED:
MY_DATABASE in postgres
sum_tables > 0 [PASSED]
sum_columns > 0 [PASSED]
All is good. No failures. No warnings. No errors.
Or you could dynamically create checks files with this script, using a list of datasets(tables) with 'for each dataset T: ' clause, like:
import psycopg2
import time
#File name with timestamp
filePath = '.'
timestr = time.strftime("%Y-%m-%d-%-H%M%S")
fileName = 'checks-' + timestr + '.yml'
try:
conn = psycopg2.connect("dbname=postgres user=postgres password=yourpassword host=localhost")
cur = conn.cursor()
cur.execute("SELECT ' - '||table_name||'\n' FROM information_schema.tables WHERE table_schema = 'public' order by 1 limit 5;")
row = cur.fetchone()
with open(fileName,'w') as f:
line="for each dataset T:\n datasets:\n"
f.write(line)
while row is not None:
f.write(row[0])
row = cur.fetchone()
cur.close()
line=" checks:\n - row_count > 0"
f.write(line)
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
I have a data frame in pyspark like below
df = spark.createDataFrame(
[
('2021-10-01','A',25),
('2021-10-02','B',24),
('2021-10-03','C',20),
('2021-10-04','D',21),
('2021-10-05','E',20),
('2021-10-06','F',22),
('2021-10-07','G',23),
('2021-10-08','H',24)],("RUN_DATE", "NAME", "VALUE"))
Now using this data frame I want to update a table in MySql
# query to run should be similar to this
update_query = "UPDATE DB.TABLE SET DATE = '2021-10-01', VALUE = 25 WHERE NAME = 'A'"
# mysql_conn is a function which I use to connect to `MySql` from `pyspark` and run queries
# Invoking the function
mysql_conn(host, user_name, password, update_query)
Now when I invoke the mysql_conn function by passing parameters the query runs successfully and the record gets updated in the MySql table.
Now I want to run the update statement for all the records in the data frame.
For each NAME it has to pick the RUN_DATE and VALUE and replace in update_query and trigger the mysql_conn.
I think we need to a for loop but not sure how to proceed.
Instead of iterating through the dataframe with a for loop, it would be better to distribute the workload across each partitions using foreachPartition. Moreover, since you are writing a custom query instead of executing one query for each query, it would be more efficient to execute a batch operation to reduce the round trips, latency and concurrent connections. Eg
def update_db(rows):
temp_table_query=""
for row in rows:
if len(temp_table_query) > 0:
temp_table_query = temp_table_query + " UNION ALL "
temp_table_query = temp_table_query + " SELECT '%s' as RUNDATE, '%s' as NAME, %d as VALUE " % (row.RUN_DATE,row.NAME,row.VALUE)
update_query="""
UPDATE DBTABLE
INNER JOIN (
%s
) new_records ON DBTABLE.NAME = new_records.NAME
SET
DBTABLE.DATE = new_records.RUNDATE,
DBTABLE.VALUE = new_records.VALUE
""" % (temp_table_query)
mysql_conn(host, user_name, password, update_query)
df.foreachPartition(update_db)
View Demo on how the UPDATE query works
Let me know if this works for you.
I have two tables: User, Posts and association table - Followers.
I am creating a subscriber system in my project (Social Network) and for the sake of speed I want to write instead of this function:
def followed_posts_by_user(self):
followed_posts = Post.query.join(
followers, (followers.c.followed_id == Post.user_id)).filter(
followers.c.follower_id == self.id)
filtered = Post.query.filter_by(user_id=self.id)
return followed_posts.union(filtered).order_by(Posts.timestamp.desc())
A raw SQL because this function seems too complex, I wrote this code
comm = """
SELECT * from posts
INNER JOIN followers
ON followers.followed_id=posts.user_id
WHERE followers.follower_id = '(%s)';
"""
cursor.execute(comm, self.id)
But it doesn't work.
Situation
Working with Python 3.7.2
I have read previlege of a MariaDB table with 5M rows on a server.
I have a local text file with 7K integers, one per line.
The integers represent IDXs of the table.
The IDX column of the table is the primary key. (so I suppose it is automatically indexed?)
Problem
I need to select all the rows whose IDX is in the text file.
My effort
Version 1
Make 7K queries, one for each line in the text file. This makes approximately 130 queries per second, costing about 1 minute to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = (
"SELECT *"
" FROM TABLE1"
" WHERE IDX = %(idx)s;"
)
all_selected = {}
with open("idx_list.txt", "r") as f:
for idx in f:
idx = idx.strip()
if idx:
idx = int(idx)
parameters = {"idx": idx}
cursor.execute(query, parameters)
result = cursor.fetchall()[0]
all_selected[idx] = result
Version 2
Select the whole table, iterate over the cursor and cherry-pick rows. The for-loop over .fetchall_unbuffered() covers 30-40K rows per second, and the whole script costs about 3 minutes to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = "SELECT * FROM TABLE1"
set_of_idx = set()
with open("idx_list.txt", "r") as f:
for line in f:
if line.strip():
line = int(line.strip())
set_of_idx.add(line)
all_selected = {}
cursor.execute(query)
for row in cursor.fetchall_unbuffered():
if row[0] in set_of_idx:
all_selected[row[0]] = row[1:]
Expected behavior
I need to select faster, because the number of IDXs in the text file will grow as big as 10K-100K in the future.
I consulted other answers including this, but I can't make use of it since I only have read previlege, thus impossible to create another table to join with.
So how can I make the selection faster?
A temporary table implementation would look like:
connection = pymysql.connect(....,local_infile=True)
with connection.cursor() as cursor:
cursor.execute("CREATE TEMPORARY TABLE R (IDX INT PRIMARY KEY)")
cursor.execute("LOAD DATA LOCAL INFILE 'idx_list.txt' INTO R")
cursor.execute("SELECT TABLE1.* FROM TABLE1 JOIN R USING ( IDX )")
..
cursor.execute("DROP TEMPORARY TABLE R")
Thanks to the hint (or more than a hint) from #danblack, I was able to achieve the desired result with the following query.
query = (
"SELECT *"
" FROM TABLE1"
" INNER JOIN R"
" ON R.IDX = TABLE1.IDX;"
)
cursor.execute(query)
danblack's SELECT statement didn't work for me, raising an error:
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'IDX' at line 1")
This is probably because of MariaDB's join syntax, so I consulted the MariaDB documentation on joining tables.
And now it selects 7K rows in 0.9 seconds.
Leaving here as an answer just for the sake of completeness, and for future readers.
I'm in the middle of converting an old legacy PHP system to Flask + SQLAlchemy and was wondering how I would construct the following:
I have a model:
class Invoice(db.Model):
paidtodate = db.Column(DECIMAL(10,2))
fullinvoiceamount = db.Column(DECIMAL(10,2))
invoiceamount = db.Column(DECIMAL(10,2))
invoicetype = db.Column(db.String(10))
acis_cost = db.Column(DECIMAL(10,2))
The query I need to run is:
SELECT COUNT(*) AS the_count, sum(if(paidtodate>0,paidtodate,if(invoicetype='CPCN' or invoicetype='CPON' or invoicetype='CBCN' or invoicetype='CBON' or invoicetype='CPUB' or invoicetype='CPGU' or invoicetype='CPSO',invoiceamount,
fullinvoiceamount))) AS amount,
SUM(acis_cost) AS cost, (SUM(if(paidtodate>0,paidtodate,invoiceamount))-SUM(acis_cost)) AS profit FROM tblclientinvoices
Is there an SQLAlchemyish way to construct this query? - I've tried googling for Mysql IF statments with SQlAlchemy but drew blanks.
Many thanks!
Use func(documentation) to generate SQL function expression:
qry = select([
func.count().label("the_count"),
func.sum(func.IF(
Invoice.paidtodate>0,
Invoice.paidtodate,
# #note: I prefer using IN instead of multiple OR statements
func.IF(Invoice.invoicetype.in_(
("CPCN", "CPON", "CBCN", "CBON", "CPUB", "CPGU", "CPSO",)
),
Invoice.invoiceamount,
Invoice.fullinvoiceamount)
)
).label("amount"),
func.sum(Invoice.acis_cost).label("Cost"),
(func.sum(func.IF(
Invoice.paidtodate>0,
Invoice.paidtodate,
Invoice.invoiceamount
))
- func.sum(Invoice.acis_cost)
).label("Profit"),
],
)
rows = session.query(qry).all()
for row in rows:
print row