I have two tables with huge amount of data in them (~1.8mil in the main one, ~1.2mil in the secondary one), as follows:
subscriber_table (id, name, email, country, account_status, ...)
subscriber_payment_table (id, subscriber_id, payment_type, payment_credential)
My end goal is having a table, containing all the users and their payment tables (null if non existing), up to yesterday, and with account_status = 1 (active)
Mot all subscribers have a corresponding subscriber_payment, so using an INNER JOIN isn't a viable option, and using a LEFT JOIN has me end up with SQL timing out my query after 2 hrs after much processing effort.
SELECT
`subscribers`.`id` AS `id`,
`subscribers`.`email` AS `email`,
`subscribers`.`name` AS `name`,
`subscribers`.`geoloc_country` AS `country`,
`subscribers_payment`.`payment_type` AS `paymentType`,
`subscribers_payment`.`payment_credential` AS `paymentCredential`
`subscribers`.`create_datetime` AS `createdAt`
FROM
`subscribers`
LEFT JOIN
`subscribers_payment` ON (`subscribers_payment`.`subscriberId` = `subscribers`.`id`)
WHERE
`subscribers`.`account_status` = 1
AND DATE_FORMAT(CAST(`subscribers`.`create_datetime` AS DATE), '%Y-%m-%d') < curdate())
As mentioned, this query takes too much time and ends up timing out and not working.
I've also considered having a UNION, between "All the Subscribers" and "Subscribers with Payment".
(
SELECT
`subscribers`.`id` AS `id`,
`subscribers`.`email` AS `email`,
`subscribers`.`name` AS `name`,
`subscribers`.`geoloc_country` AS `country`,
null AS `paymentType`,
null AS `paymentCredential`
`subscribers`.`create_datetime` AS `createdAt`
FROM
`subscribers`
WHERE
`subscribers`.`account_status` = 1
AND DATE_FORMAT(CAST(`subscribers`.`create_datetime` AS DATE), '%Y-%m-%d') < curdate()))
UNION
(
SELECT
`subscribers`.`id` AS `id`,
`subscribers`.`email` AS `email`,
`subscribers`.`name` AS `name`,
`subscribers`.`geoloc_country` AS `country`,
`subscribers_payment`.`payment_type` AS `paymentType`,
`subscribers_payment`.`payment_credential` AS `paymentCredential`
`subscribers`.`create_datetime` AS `createdAt`
FROM
`subscribers`
INNERJOIN
`subscribers_payment` ON (`subscribers_payment`.`subscriberId` = `subscribers`.`id`)
WHERE
`subscribers`.`account_status` = 1
AND DATE_FORMAT(CAST(`subscribers`.`create_datetime` AS DATE), '%Y-%m-%d') < curdate()))
The problem with that current implementation is that I'm getting duplicate queries (I'm using a UNION but it's not grouping my results together and removing non-distinct values, that's because I have a different value in the paymentType and paymentCredential columns)
This query runs in about ~2mins, so this is more feasible for me. I just need to eliminate duplicate records.. unless there's a wiser option here
Disclaimer: we're using MyISAM tables, so having foreign keys to speed up the queries is a no-go.
For this query:
SELECT . . .
FROM subscribers s LEFT JOIN
subscribers_payment sp
ON sp.subscriberId = s.id
WHERE s.account_status = 1 AND
s.create_datetime < curdate();
Then, you want an index on subscribers(account_status, create_datetime, id) and on subscribers_payment(subscriberId).
I am guessing that the index on subscriber_payment is missing, which explains the performance problems.
Notes:
Use table aliases -- they make the query easier to write and read.
There should be no need to convert a datetime to a string for comparison purposes.
There is no need to use backticks for all identifiers. They just make the query harder to write and read.
Related
I have ~6 tables where I have to count or sum fields based on matching site_ids and date. I have the following query, with many subqueries which takes an extraordinary amount of time to run. I am certain there is an easier, more efficient way, however I am rather new to these more complex queries. I have read regarding optimizations, specifically using joins ON but struggling to understand and implement.
The goal is to speed this up and not bring my small server to it's knees when running. Any assistance or direction would be VERY much appreciated!
SELECT date(date_added) as dt_date,
site_id as dt_site_id,
(SELECT site_id from branch_mappings bm WHERE mark_id_site = dt.site_id) as site_id,
(SELECT parent_id from branch_mappings bm WHERE mark_id_site = dt.site_id) as main_site_id,
(SELECT corp_owned from branch_mappings bm WHERE mark_id_site = dt.site_id) as corp_owned,
count(id) as dt_calls,
(SELECT count(date_submitted) FROM mark_unbounce ub WHERE date(date_submitted) = dt_date AND ub.site_id = dt.site_id) as ub,
(SELECT count(timestamp) FROM mark_wordpress_contact wp WHERE date(timestamp) = dt_date AND wp.site_id = dt.site_id) as wp,
(SELECT count(added_on) FROM m_shrednations sn WHERE date(added_on) = dt_date AND sn.description = dt.site_id) as sn,
(SELECT sum(users) FROM mark_ga ga WHERE date(ga.date) = dt_date AND channel LIKE 'Organic%' AND ga.site_id = dt.site_id) as ga_organic
FROM mark_dialogtech dt
WHERE site_id is not null
GROUP BY site_name, dt_date
ORDER BY site_name, dt_date;
What you're doing is the equivalent of asking your server to query 7+ different tables every time you run this query. Personally, I use Joins and nested queries because I can whittle down do what I need.
The first 3 subqueries can be replaced with...
SELECT date(date_added) as dt_date,
dt.site_id as dt_site_id,
bm.site_id as site_id,
bm.parent_id as main_site_id,
bm.corp_owned as corp_owned,
FROM mark_dialogtech dt
INNER JOIN branch_mappings bm
ON bm.mark_id_site = dt.site_id
I'm not sure why you are running the last 3. Is there a business requirement? If so, consider how often this is to be run and when.
If absolutely necessary, add those to the joins like...
FROM mark_dialogtech dt
INNER JOIN
(SELECT site_id, count(date_submitted) FROM mark_unbounce GROUP BY site_id) ub
on ub.site_id = dt.site_id
This should limit the results to only records where the site_id exists in both the mark_dialogtech and mark_unbounce (or whatever table). From my experience, this method has sped things up.
Still, my concern is the number of aggregations you're performing. If they can be cached to a dashboard and pulled during slow times, that would be best.
Its hard to analyze how big is your query(no data examples) but in your case I hightly recommend to use CTE(Common Table Expressions). Check this :
https://www.sqlpedia.pl/cte-common-table-expressions/
CTEs do not have a physical representation in tempdb like temporary tables or tabular variables. CTE can be viewed as such a temporary, non-materialized view. When MSSQL executes a query and encounters a CTE, it replace the reference to that CTE with definition. Therefore, if the CTE data is used several times in a given query, the same code will be executed several times and MSSQL does not optimize it. Soo... it will work just for few data like you want to do.
Appreciate all the responses.
I ended up creating a python script to run the queries separately and inserting the results into the table for the respective KPI. So, I scrapped the idea of a single query due to performance. I concatenated each date and site_id to create the id, then leveraged an ON DUPLICATE KEY UPDATE with each INSERT statement.
The python dictionary looks like this, and I simply looped. Again, thanks for the help.
SELECT STATEMENTS (Python Dict)
"dt":"SELECT date(date_added) as dt_date, site_id as dt_site, count(site_id) as dt_count FROM mark_dialogtech WHERE site_id is not null GROUP BY dt_date, dt_site ORDER BY dt_date, dt_site;",
"ub":"SELECT date_submitted as ub_date, site_id as ub_site, count(site_id) as ub_count FROM mark_unbounce WHERE site_id is not null GROUP BY ub_date, ub_site;",
"wp":"SELECT date(timestamp) as wp_date, site_id as wp_site, count(site_id) as wp_count FROM mark_wordpress_contact WHERE site_id is not null GROUP BY wp_date, wp_site;",
"sn":"SELECT date(added_on) as sn_date, description as sn_site, count(description) as sn_count FROM m_shrednations WHERE description <> '' GROUP BY sn_date, sn_site;",
"ga":"SELECT date as ga_date, site_id as ga_site, sum(users) as ga_count FROM mark_ga WHERE users is not null GROUP BY ga_date, ga_site;"
INSERT STATEMENTS (Python Dict)
"dt":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, dt_calls, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE dt_Calls={dbdata[3]}, added_on='{dbdata[4]}';",
"ub":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, ub, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE ub={dbdata[3]}, added_on='{dbdata[4]}';",
"wp":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, wp, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE wp={dbdata[3]}, added_on='{dbdata[4]}';",
"sn":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, sn, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE sn={dbdata[3]}, added_on='{dbdata[4]}';",
"ga":f"INSERT INTO mark_helper_rollup (id, on_date, site_id, ga_organic, added_on) VALUES ('{dbdata[0]}','{dbdata[1]}',{dbdata[2]},{dbdata[3]},'{dbdata[4]}') ON DUPLICATE KEY UPDATE ga_organic={dbdata[3]}, added_on='{dbdata[4]}';",
It would be very difficult to analyze the query with out the data, Any ways!
try joining the tables and group it, that should improve the performance
here is a left join sample
SELECT column names
FROM table1
LEFT JOIN table2
ON table1.common_column = table2.common_column;
check this for more detailed inform https://learnsql.com/blog/how-to-left-join-multiple-tables/
This my query with its performance (slow_query_log):
SELECT j.`offer_id`, o.`offer_name`, j.`success_rate`
FROM
(
SELECT
t.`offer_id`,
(
SUM(CASE WHEN `offer_id` = t.`offer_id` AND `sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*)
) AS `success_rate`
FROM `tblSales` AS t
WHERE DATE(t.`sales_time`) = CURDATE()
GROUP BY t.`offer_id`
ORDER BY `success_rate` DESC
) AS j
LEFT JOIN `tblOffers` AS o
ON j.`offer_id` = o.`offer_id`
LIMIT 5;
# Time: 180113 18:51:19
# User#Host: root[root] # localhost [127.0.0.1] Id: 71
# Query_time: 10.472599 Lock_time: 0.001000 Rows_sent: 0 Rows_examined: 1156134
Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).
The query works fine and provides the output I needed. But it appears to be that its a bit slower.
offer_id and sales_status are already indexed in the tblSales. So do you have any suggestion on improving the inner query (where it calculates the success rate) so that performance can be improved? I have been playing with the math for more than 2hrs. But couldn't get a better way.
Btw, tblSales has lots of data. It contains those sales which are SUCCESSFUL, FAILED, PENDING, etc.
Thank you
EDIT
As you requested am including the table design also(only relevant fields are included):
tblSales
`sales_id` bigint UNSIGNED NOT NULL AUTO_INCREMENT,
`offer_id` bigint UNSIGNED NOT NULL DEFAULT '0',
`sales_time` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`sales_status` ENUM('WAITING', 'SUCCESS', 'FAILED', 'CANCELLED') NOT NULL DEFAULT 'WAITING',
PRIMARY KEY (`sales_id`),
KEY (`offer_id`),
KEY (`sales_status`)
There are some other fields also in this table, that holds some other info. Amount, user_id, etc. which are not relevant for my question.
Numerous 'problems', none of which involve "math".
JOINs make things difficult. LEFT JOIN says "I don't care whether the row exists in the 'right' table. (I suspect you don't need LEFT??) But it also says "There may be multiple rows in the right table. Based on the column names, I will guess that there is only one offer_name for each offer_id. If this is correct, then here my first recommendation. (This will convince the Optimizer that there is no issue with the JOIN.) Change from
SELECT ..., o.offer_name, ...
LEFT JOIN `tblOffers` AS o ON j.`offer_id` = o.`offer_id`
...
to
SELECT ...,
( SELECT offer_name FROM tbloffers WHERE offer_id j.offer_id
) AS offer_name, ...
It also gets rid of a bug wherein you are assuming that the inner ORDER BY will be preserved for the LIMIT. This used to be the case, but in newer versions of MariaDB / MySQL, it is not. The ORDER BY in a "derived table" (your subquery) is now ignored.
2 down, a few more to go.
"Don't hide an indexed column in a function." I am referring to DATE(t.sales_time) = CURDATE(). Assuming you have no sales_time values for the 'future', then that test can be changed to t.sales_time >= CURDATE(). If you really need to restrict to just today, then do this:
AND sales_time >= CURDATE()
AND sales_time < CURDATE() + INTERVAL 1 DAY
The ORDER BY and the LIMIT should usually be put together. In your case, you may as well add the LIMIT to the "derived table", thereby leading to only 5 rows for the outer query to work with. But... There is still the question of getting them sorted correctly. So change from
SELECT ...
FROM ( SELECT ...
ORDER BY ... )
LIMIT ...
to
SELECT ...
FROM ( SELECT ...
ORDER BY ...
LIMIT 5 ) -- trim sooner
ORDER BY ... -- deal with the loss of ordering from derived table
Rolling it all together, I have
SELECT j.`offer_id`,
( SELECT offer_name
FROM tbloffers
WHERE offer_id = j.offer_id
) AS offer_name,
j.`success_rate`
FROM
( SELECT t.`offer_id`,
AVG(t.sales_status = 'SUCCESS') AS `success_rate`
FROM `tblSales` AS t
WHERE t.sales_time >= CURDATE()
GROUP BY t.`offer_id`
ORDER BY `success_rate` DESC
LIMIT 5
) AS j
ORDER BY `success_rate` DESC;
(I took the liberty of shortening the SUM(...) in two ways.)
Now for the indexes...
tblSales needs at least (sales_time), but let's go for a "covering" (with sales_time specifically first):
INDEX(sales_time, sales_status, order_id)
If tbloffers has PRIMARY KEY(offer_id), then no further index is worth adding. Else, add this covering index (in this order):
INDEX(offer_id, offer_name)
(Apologies to other Answerers; I stole some of your ideas.)
Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).
Approach this with a simple JOIN and GROUP BY:
SELECT s.offer_id, o.offer_name,
AVG(s.sales_status = 'SUCCESS') as success_rate
FROM tblSales s JOIN
tblOffers o
ON o.offer_id = s.offer_id
WHERE s.sales_time >= CURDATE() AND
s.sales_time < CURDATE() + INTERVAL 1 DAY
GROUP BY s.offer_id, o.offer_name
ORDER BY success_rate DESC;
Notes:
The use of date arithmetic allows the query to make use of an index on tblSales(sales_time) -- or better yet tblSales(salesTime, offer_id, sales_status).
The arithmetic for success_rate has been simplified -- although this has minimal impact on performance.
I added offer_name to the GROUP BY. If you are learning SQL, you should always have all the unaggregated keys in the GROUP BY clause.
A LEFT JOIN is only needed if you have offers in tblSales which are not in tblOffers. I am guessing you have proper foreign key relationships defined, and this is not the case.
Based on not much information that you have provided (i mean table schema) you could try the following.
SELECT `o`.`offer_id`, `o`.`offer_name`, SUM(CASE WHEN `t`.`sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) AS `success_rate`
FROM `tblOffers` `o`
INNER JOIN `tblSales` `t`
ON `o`.`offer_id` = `t`.`offer_id`
WHERE DATE(`t`.`sales_time`) = CURDATE()
GROUP BY `o`.`offer_id`
ORDER BY `success_rate` DESC
LIMIT 0,5;
You can find a sample of this query in this SQL Fiddle example
Without knowing your schema, the lowest hanging fruit I see is this part....
WHERE DATE(t.`sales_time`) = CURDATE()
Try changing that to something that looks like
Where t.sales_time >= #12-midnight-of-current-date and t.sales_time <= #23:59:59-of-current-date
Question: What is the quickest way to assign values from multiple rows of a single column into separate variables?
Using MySQL 5.6, I have a table that stores data in regular time intervals. In the insert trigger, I am attempting to detect when, among other similar things, a "peak" in the values has occurred, defined as two consecutive increasing values to a highest value, followed by two consecutive decreasing values.
Because there are several similar calculations to perform on the same data, I am first retrieving the values from a column for the last five rows into variables. I have the ID numbers of the rows.
One way to do this would be individual SELECT queries for each ID. However, I would like to consolidate them into a single query for speed, since the data will be entered into the database in blocks of 40k rows at a time.
Individual Selects:
SET currentValue = (SELECT `value` FROM `table` WHERE `tableid`=currentID);
SET prev1Value = (SELECT `value` FROM `table` WHERE `tableid`=prev1ID);
SET prev2Value = (SELECT `value` FROM `table` WHERE `tableid`=prev2ID);
SET prev3Value = (SELECT `value` FROM `table` WHERE `tableid`=prev3ID);
SET prev4Value = (SELECT `value` FROM `table` WHERE `tableid`=prev4ID);
I know another way to do this would be using joins on the same table. However, this seems like this could be a slow way to go about it. Is there a faster way to do this without using JOINs? Thanks.
Multiple JOINs:
SELECT a0.`value`, a1.`value`, a2.`value`, a3.`value`, a4.`value`
INTO currentValue, prev1Value, prev2Value, prev3Value, prev4Value
FROM (SELECT `value` FROM `table` WHERE `tableid`=currentID) AS a0
INNER JOIN (SELECT `value` FROM `table` WHERE `tableid`=prev1ID) AS a1
INNER JOIN (SELECT `value` FROM `table` WHERE `tableid`=prev2ID) AS a2
INNER JOIN (SELECT `value` FROM `table` WHERE `tableid`=prev3ID) AS a3
INNER JOIN (SELECT `value` FROM `table` WHERE `tableid`=prev4ID) AS a4
Another thing I thought of is something like:
SELECT IF(`tableid`=currentID, `value`, NULL)
, IF(`tableid`=prev1ID, `value`, NULL)
, IF(`tableid`=prev2ID, `value`, NULL)
, IF(`tableid`=prev3ID, `value`, NULL)
, IF(`tableid`=prev4ID, `value`, NULL)
INTO currentValue, prev1Value, prev2Value, prev3Value, prev4Value
FROM `table`
WHERE `tableid` IN (currentID, prev1ID, prev2ID, prev3ID, prev4ID);
But I haven't tested it yet.
For now I am going to go with the JOINs until everything is in place, and I can test the IF statement model. However, if someone knows that isn't going to work or has another method that would work better, I would appreciate it. Also, would putting this into a VIEW help with the speed of the query?
Thanks.
I was wondering if it is possible using MySQL to somehow, wrap the select of the inner query so that the outer query can use it in its where clause.
SELECT
`FirstName`,
`Surname`,
`AddressLine1`,
`AddressLine2`,
`AddressLine3`,
`AddressLocale`,
`AddressRegion`,
`AddressZip`,
`AddressCountry`,
`CopyShipAddress`
FROM `Contacts`
WHERE `EEID`
IN
(SELECT CONCAT_WS(',', `Sender`, `Receiver`, `Buyer` ) AS EEID_list
FROM `Transactions`
WHERE `TransactionID` = 3)
Sender, Receiver and Buyer are EEIDs. Perhaps there is a function other than CONCAT_WS I can use that will provide me with this functionality.
dont use concat_ws records on IN query , it may not give correct data
concat_ws may work perfectly for IN query with integers but may not work for strings because they need to be enclosed in quotes '
try below instead
SELECT
`FirstName`,
`Surname`,
`AddressLine1`,
`AddressLine2`,
`AddressLine3`,
`AddressLocale`,
`AddressRegion`,
`AddressZip`,
`AddressCountry`,
`CopyShipAddress`
FROM `Contacts`
WHERE `EEID`
IN
(
select Sender as eeid FROM Transactions WHERE TransactionId=3
UNION ALL
select Receiver as eeid FROM Transactions WHERE TransactionId=3
UNION ALL
select Buyer as eeid FROM Transactions WHERE TransactionId=3
)
I have a table whose data looks like this:
INSERT INTO `cm_case_notes` (`id`, `case_id`, `date`, `time`, `description`, `username`, `supervisor`, `datestamp`) VALUES
(45977, '1175', '2010-11-19 16:27:15', 600, 'Motion hearing...Denied.', 'bjones', 'jharvey,', '2010-11-19 21:27:15'),
(46860, '1175', '2010-12-11 16:11:19', 300, 'Semester Break Report', 'bjones', 'jharvey,', '2010-12-11 21:11:19'),
(48034, '1175', '2011-05-04 17:30:03', 300, 'test', 'bjones', 'jharvey,', '2011-05-04 22:30:03'),
(14201, '1175', '2009-02-06 00:00:00', 3600, 'In court to talk to prosecutor, re: the file', 'csmith', 'sandrews', '2009-02-07 14:33:34'),
(14484, '1175', '2009-02-13 00:00:00', 6300, 'Read transcript, note taking', 'csmith', 'sandrews', '2009-02-16 17:22:36');
I'm trying to select the most recent case note (by date) on each case by each user. The best I've come up with is:
SELECT * , MAX( `date` ) FROM cm_case_notes WHERE case_id = '1175' GROUP BY username
This, however, doesn't give the most recent entry, but the first one for each user. I've seen several similar posts here, but I just can't seem to get my brain around them. Would anybody take pity on the sql-deficient and help?
If you want only the dates of the most recent case note for every user and every case, you can use this:
--- Q ---
SELECT case_id
, username
, MAX( `date` ) AS recent_date
FROM cm_case_notes
GROUP BY case_id
, username
If you want all the columns from these row (with most recent date) follow the Quassnoi link for various solutions (or the other provided links). The easiest to write would be to make the above query into a subquery and join it to cm_case_notes:
SELECT cn.*
FROM
cm_case_notes AS cn
JOIN
( Q ) AS q
ON ( q.case_id, q.username, q.recent_date )
= ( cn.case_id, cn.username, cn.`date` )
If you just want the lastet case note but only for a particular case_id, then you could add the where condition in both cn and Q (Q slightly modified):
SELECT cn.*
FROM
cm_case_notes AS cn
JOIN
( SELECT username
, MAX( `date` ) AS recent_date
FROM cm_case_notes
WHERE case_id = #particular_case_id
GROUP BY username
) AS q
ON ( q.username, q.recent_date )
= ( cn.username, cn.`date` )
WHERE cn.case_id = #particular_case_id
the reason why you don't get what would like to fetch from the database is the use of SELECT * together with GROUP.
In fact, only the results of aggregate functions and / or the group field(s) itself can be safely SELECTed. selecting anything else leads to unexpected results. (the exact result depends on order, query optimization and such).
What you are trying to achieve is called fetching a "groupwise maximum". This is a common problem / common task in SQL, you can read a nice writeup here:
http://jan.kneschke.de/projects/mysql/groupwise-max/
or in the MySQL manual here:
http://dev.mysql.com/doc/refman/5.1/en/example-maximum-column-group-row.html
or a detailed long explanation by stackoverflow user Quassnoi here:
http://explainextended.com/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/
Have you considered a DESC ordering and simply limiting 1?