I have ~6 tables where I have to count or sum fields based on matching site_ids and date. I have the following query, with many subqueries which takes an extraordinary amount of time to run. I am certain there is an easier, more efficient way, however I am rather new to these more complex queries. I have read regarding optimizations, specifically using joins ON but struggling to understand and implement.
The goal is to speed this up and not bring my small server to it's knees when running. Any assistance or direction would be VERY much appreciated!
SELECT date(date_added) as dt_date,
site_id as dt_site_id,
(SELECT site_id from branch_mappings bm WHERE mark_id_site = dt.site_id) as site_id,
(SELECT parent_id from branch_mappings bm WHERE mark_id_site = dt.site_id) as main_site_id,
(SELECT corp_owned from branch_mappings bm WHERE mark_id_site = dt.site_id) as corp_owned,
count(id) as dt_calls,
(SELECT count(date_submitted) FROM mark_unbounce ub WHERE date(date_submitted) = dt_date AND ub.site_id = dt.site_id) as ub,
(SELECT count(timestamp) FROM mark_wordpress_contact wp WHERE date(timestamp) = dt_date AND wp.site_id = dt.site_id) as wp,
(SELECT count(added_on) FROM m_shrednations sn WHERE date(added_on) = dt_date AND sn.description = dt.site_id) as sn,
(SELECT sum(users) FROM mark_ga ga WHERE date(ga.date) = dt_date AND channel LIKE 'Organic%' AND ga.site_id = dt.site_id) as ga_organic
FROM mark_dialogtech dt
WHERE site_id is not null
GROUP BY site_name, dt_date
ORDER BY site_name, dt_date;
What you're doing is the equivalent of asking your server to query 7+ different tables every time you run this query. Personally, I use Joins and nested queries because I can whittle down do what I need.
The first 3 subqueries can be replaced with...
SELECT date(date_added) as dt_date,
dt.site_id as dt_site_id,
bm.site_id as site_id,
bm.parent_id as main_site_id,
bm.corp_owned as corp_owned,
FROM mark_dialogtech dt
INNER JOIN branch_mappings bm
ON bm.mark_id_site = dt.site_id
I'm not sure why you are running the last 3. Is there a business requirement? If so, consider how often this is to be run and when.
If absolutely necessary, add those to the joins like...
FROM mark_dialogtech dt
INNER JOIN
(SELECT site_id, count(date_submitted) FROM mark_unbounce GROUP BY site_id) ub
on ub.site_id = dt.site_id
This should limit the results to only records where the site_id exists in both the mark_dialogtech and mark_unbounce (or whatever table). From my experience, this method has sped things up.
Still, my concern is the number of aggregations you're performing. If they can be cached to a dashboard and pulled during slow times, that would be best.
Its hard to analyze how big is your query(no data examples) but in your case I hightly recommend to use CTE(Common Table Expressions). Check this :
https://www.sqlpedia.pl/cte-common-table-expressions/
CTEs do not have a physical representation in tempdb like temporary tables or tabular variables. CTE can be viewed as such a temporary, non-materialized view. When MSSQL executes a query and encounters a CTE, it replace the reference to that CTE with definition. Therefore, if the CTE data is used several times in a given query, the same code will be executed several times and MSSQL does not optimize it. Soo... it will work just for few data like you want to do.
Appreciate all the responses.
I ended up creating a python script to run the queries separately and inserting the results into the table for the respective KPI. So, I scrapped the idea of a single query due to performance. I concatenated each date and site_id to create the id, then leveraged an ON DUPLICATE KEY UPDATE with each INSERT statement.
The python dictionary looks like this, and I simply looped. Again, thanks for the help.
SELECT STATEMENTS (Python Dict)
"dt":"SELECT date(date_added) as dt_date, site_id as dt_site, count(site_id) as dt_count FROM mark_dialogtech WHERE site_id is not null GROUP BY dt_date, dt_site ORDER BY dt_date, dt_site;",
"ub":"SELECT date_submitted as ub_date, site_id as ub_site, count(site_id) as ub_count FROM mark_unbounce WHERE site_id is not null GROUP BY ub_date, ub_site;",
"wp":"SELECT date(timestamp) as wp_date, site_id as wp_site, count(site_id) as wp_count FROM mark_wordpress_contact WHERE site_id is not null GROUP BY wp_date, wp_site;",
"sn":"SELECT date(added_on) as sn_date, description as sn_site, count(description) as sn_count FROM m_shrednations WHERE description <> '' GROUP BY sn_date, sn_site;",
"ga":"SELECT date as ga_date, site_id as ga_site, sum(users) as ga_count FROM mark_ga WHERE users is not null GROUP BY ga_date, ga_site;"
INSERT STATEMENTS (Python Dict)
"dt":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, dt_calls, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE dt_Calls={dbdata[3]}, added_on='{dbdata[4]}';",
"ub":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, ub, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE ub={dbdata[3]}, added_on='{dbdata[4]}';",
"wp":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, wp, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE wp={dbdata[3]}, added_on='{dbdata[4]}';",
"sn":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, sn, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE sn={dbdata[3]}, added_on='{dbdata[4]}';",
"ga":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, ga_organic, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE ga_organic={dbdata[3]}, added_on='{dbdata[4]}';",
It would be very difficult to analyze the query with out the data, Any ways!
try joining the tables and group it, that should improve the performance
here is a left join sample
SELECT column names
FROM table1
LEFT JOIN table2
ON table1.common_column = table2.common_column;
check this for more detailed inform https://learnsql.com/blog/how-to-left-join-multiple-tables/
Related
I have two tables with huge amount of data in them (~1.8mil in the main one, ~1.2mil in the secondary one), as follows:
subscriber_table (id, name, email, country, account_status, ...)
subscriber_payment_table (id, subscriber_id, payment_type, payment_credential)
My end goal is having a table, containing all the users and their payment tables (null if non existing), up to yesterday, and with account_status = 1 (active)
Mot all subscribers have a corresponding subscriber_payment, so using an INNER JOIN isn't a viable option, and using a LEFT JOIN has me end up with SQL timing out my query after 2 hrs after much processing effort.
SELECT
`subscribers`.`id` AS `id`,
`subscribers`.`email` AS `email`,
`subscribers`.`name` AS `name`,
`subscribers`.`geoloc_country` AS `country`,
`subscribers_payment`.`payment_type` AS `paymentType`,
`subscribers_payment`.`payment_credential` AS `paymentCredential`
`subscribers`.`create_datetime` AS `createdAt`
FROM
`subscribers`
LEFT JOIN
`subscribers_payment` ON (`subscribers_payment`.`subscriberId` = `subscribers`.`id`)
WHERE
`subscribers`.`account_status` = 1
AND DATE_FORMAT(CAST(`subscribers`.`create_datetime` AS DATE), '%Y-%m-%d') < curdate())
As mentioned, this query takes too much time and ends up timing out and not working.
I've also considered having a UNION, between "All the Subscribers" and "Subscribers with Payment".
(
SELECT
`subscribers`.`id` AS `id`,
`subscribers`.`email` AS `email`,
`subscribers`.`name` AS `name`,
`subscribers`.`geoloc_country` AS `country`,
null AS `paymentType`,
null AS `paymentCredential`
`subscribers`.`create_datetime` AS `createdAt`
FROM
`subscribers`
WHERE
`subscribers`.`account_status` = 1
AND DATE_FORMAT(CAST(`subscribers`.`create_datetime` AS DATE), '%Y-%m-%d') < curdate()))
UNION
(
SELECT
`subscribers`.`id` AS `id`,
`subscribers`.`email` AS `email`,
`subscribers`.`name` AS `name`,
`subscribers`.`geoloc_country` AS `country`,
`subscribers_payment`.`payment_type` AS `paymentType`,
`subscribers_payment`.`payment_credential` AS `paymentCredential`
`subscribers`.`create_datetime` AS `createdAt`
FROM
`subscribers`
INNERJOIN
`subscribers_payment` ON (`subscribers_payment`.`subscriberId` = `subscribers`.`id`)
WHERE
`subscribers`.`account_status` = 1
AND DATE_FORMAT(CAST(`subscribers`.`create_datetime` AS DATE), '%Y-%m-%d') < curdate()))
The problem with that current implementation is that I'm getting duplicate queries (I'm using a UNION but it's not grouping my results together and removing non-distinct values, that's because I have a different value in the paymentType and paymentCredential columns)
This query runs in about ~2mins, so this is more feasible for me. I just need to eliminate duplicate records.. unless there's a wiser option here
Disclaimer: we're using MyISAM tables, so having foreign keys to speed up the queries is a no-go.
For this query:
SELECT . . .
FROM subscribers s LEFT JOIN
subscribers_payment sp
ON sp.subscriberId = s.id
WHERE s.account_status = 1 AND
s.create_datetime < curdate();
Then, you want an index on subscribers(account_status, create_datetime, id) and on subscribers_payment(subscriberId).
I am guessing that the index on subscriber_payment is missing, which explains the performance problems.
Notes:
Use table aliases -- they make the query easier to write and read.
There should be no need to convert a datetime to a string for comparison purposes.
There is no need to use backticks for all identifiers. They just make the query harder to write and read.
I'm not super experienced, though I do have SOME experience with MySQL. I have a problem I'm trying to solve with a trigger but it's proving much more complex than I thought and would appreciate some advice.
I have two tables. TableA and TableB.
We have customer requests coming in with their data. Each new batch of requests is scraped by a web scraper (this is the only way we can do this, so ignore how odd the process sounds), dumped into A, and then it's supposed to go to B, get rid of duplicates, then send them an email based on the data. I can't see how the web scraper is inserting the data so that's out.
Because customers submit multiple requests or the same person has different requests, the data needs to be unique, but not that unique. We want to record each request as a unique request, even if it's from the same customer. Some customers share a name, or they come back with a different request.
Therefore I made table B have the unique primary keys: name, email, address, and notes. (If I'm right about unique index, any matching index will update which would be bad if there were two John Smiths, So primary key it is).
I've tried different ways of doing this, following examples on multiple threads throughout this website, but I've been on this issue for days and I'm losing it!!!! I know I'm doing something wrong, but what?! What I ended up doing is this:
TRIGGER ON TABLEA, AFTER_INSERT:
tableA_to_email
(
customer_name, customer_email, customer_phone, customer_address)
VALUES ((
SELECT
new.customer_name
FROM
tableA
WHERE
customer_name = new.customer_name),
(
SELECT
new.customer_email
FROM
tableA
WHERE
customer_email = new.customer_email),
(
SELECT
new.customer_phone
FROM
tableA
WHERE
customer_phone = new.customer_phone),
(
SELECT
new.customer_address
FROM
tableA
WHERE
customer_address = new.customer_address))
ON DUPLICATE KEY UPDATE
customer_phone = VALUES(customer_phone)
Input into an empty table: insert INTO tableA (customer_name, customer_phone, customer_email, customer_address) VALUES("7", "0", "8", "0");
Output: MySQL said: Documentation
1242 - Subquery returns more than 1 row
I understand the error, but the input above isn't more than one row? I tried it on an empty table so...
Basically in your approach you can only enter 1 Result from all the selects you do, and apparently you got one that has more than one member
SO
INSERT INTO tableA_to_email
(
customer_name, customer_email, customer_phone, customer_address)
VALUES ((
SELECT
new.customer_name
FROM
tableA
WHERE
customer_name = new.customer_name LIMIT 1),
(
SELECT
new.customer_email
FROM
tableA
WHERE
customer_email = new.customer_email LIMIT 1),
(
SELECT
new.customer_phone
FROM
tableA
WHERE
customer_phone = new.customer_phone LIMIT 1),
(
SELECT
new.customer_address
FROM
tableA
WHERE
customer_address = new.customer_address LIMIT 1))
ON DUPLICATE KEY UPDATE
customer_phone = VALUES(customer_phone)
Will run without problems because every select returns only 1 row.
But you can do only
INSERT INTO tableA_to_email
( customer_name, customer_email, customer_phone, customer_address)
VALUES (
new.customer_name,
new.customer_email,
new.customer_phone,
new.customer_address
)
ON DUPLICATE KEY UPDATE
customer_phone = VALUES(customer_phone)
That works as well
I have a location table in my database which contains location data of all the users of my system.
The table design is something like
id| user_id| longitude| latitude| created_at|
I have an array of users. Now I want to select the latest(sorted according to created at) location of all these users.
I am able to figure out the sql query for same
SELECT * FROM my_table
WHERE (user_id , created_at) IN (
SELECT user_id, MAX(created_at)
FROM my_table
GROUP BY user_id
)
AND user_id IN ('user1', 'user2', ... );
Now as I am working in Ruby On Rails, I want to write this sql query to activerecord rails. Can anyone please help me with this ?
I think this will give the correct result:
MyModel.order(created_at: :desc).group(:user_id).distinct(:user_id)
If you want to generate the exact same query, this will do it:
MyModel.where("(user_id, created_at) IN (SELECT user_id, MAX(created_at) from my_table GROUP BY user_id)")
I think the subquery will probably not scale well with a large data set, but I understand if you just want to get it into rails and optimize later.
How about adding a scope, and getting the same result in a slightly different way:
class UserLocation
def self.latest_per_user
where("user_locations.created_at = (select Max(ul2.created_at) from user_locations ul2 where ul2.user_id = user_locations.user_id)")
end
end
Then you just use:
UserLocation.latest_per_user.where(:user_id => ['user1', 'user2'])
... to get the required data set.
I'm not an expert in SQL, i have an sql statement :
SELECT * FROM articles WHERE article_id IN
(SELECT distinct(content_id) FROM contents_by_cats WHERE cat_id='$cat')
AND permission='true' AND date <= '$now_date_time' ORDER BY date DESC;
Table contents_by_cats has 11000 rows.
Table articles has 2700 rows.
Variables $now_date_time and $cat are php variables.
This query takes about 10 seconds to return the values (i think because it has nested SELECT statements) , and 10 seconds is a big amount of time.
How can i achieve this in another way ? (Views or JOIN) ?
I think JOIN will help me here but i don't know how to use it properly for the SQL statement that i mentioned.
Thanks in advance.
A JOIN is exactly what you are looking for. Try something like this:
SELECT DISTINCT articles.*
FROM articles
JOIN contents_by_cats ON articles.article_id = contents_by_cats.content_id
WHERE contents_by_cats.cat_id='$cat'
AND articles.permission='true'
AND articles.date <= '$now_date_time'
ORDER BY date DESC;
If your query is still not as fast as you would like then check that you have an index on articles.article_id and contents_by_cats.content_id and contents_by_cats.cat_id. Depending on the data you may want an index on articles.date as well.
Do note that if the $cat and $now_date_time values are coming from a user then you should really be preparing and binding the query rather than just dumping these values into the query.
This is the query we are starting with:
SELECT a.*
FROM articles a
WHERE article_id IN (SELECT distinct(content_id)
FROM contents_by_cats
WHERE cat_id ='$cat'
) AND
permission ='true' AND
date <= '$now_date_time'
ORDER BY date DESC;
Two things will help this query. The first is to rewrite it using exists rather than in and to simplify the subquery:
SELECT a.*
FROM articles a
WHERE EXISTS (SELECT 1
FROM contents_by_cats cbc
WHERE cbc.content_id = a.article_id and cat_id = '$cat'
) AND
permission ='true' AND
date <= '$now_date_time'
ORDER BY date DESC;
Second, you want indexes on both articles and contents_by_cats:
create index idx_articles_3 on articles(permission, date, article_id);
create index idx_contents_by_cats_2 on contents_by_cat(content_id, cat_id);
By the way, instead of $now_date_time, you can just use the now() function in MySQL.
I have this query:
select *
from transaction_batch
where id IN
(
select MAX(id) as id
from transaction_batch
where status_id IN (1,2)
group by status_id
);
The inner query runs very fast (less than 0.1 seconds) to get two ID's, one for status 1, one for status 2, then it selects based on primary key so it is indexed. The explain query says that it's searching 135k rows using where only, and I cannot for the life of me figure out why this is so slow.
The inner query is run seperatly for every row of your table over and over again.
As there is no reference to the outer query in the inner query, I suggest you split those two queries and just insert the results of the inner query in the WHERE clause.
select b.*
from transaction_batch b
inner join (
select max(id) as id
from transaction_batch
where status_id in (1, 2)
group by status_id
) bm on b.id = bm.id
my first post here.. sorry about the lack of formatting
I had a performance problem shown below:
90sec: WHERE [Column] LIKE (Select [Value] From [Table]) //Dynamic, slow
1sec: WHERE [Column] LIKE ('A','B','C') //Hardcoded, fast
1sec: WHERE #CSV like CONCAT('%',[Column],'%') //Solution, below
I had tried joining rather than subquerying.
I had also tried a hardcoded CTE.
I had lastly tried a temp table.
None of these standard options worked, and I was not willing to dosp_execute option.
The only solution that worked as:
DECLARE #CSV nvarchar(max) = Select STRING_AGG([Value],',') From [Table];
// This yields #CSV = 'A,B,C'
...
WHERE #CSV LIKE CONCAT('%',[Column],'%')