I am having performance issues with a query, I have 21 million records across the table, and 2 of the tables I'm looking in here have 8 million each; individually, they are very quick. But I've done a query that, in my opinion, isn't very good, but it's the only way I know how to do it.
This query takes 65 seconds, I need to get it under 1 second and I think it's possible if I don't have all the SELECT queries, but once again, I am not sure how else to do it with my SQL knowledge.
Database server version is MariaDB 10.6.
SELECT
pa.`slug`,
(
SELECT
SUM(`impressions`)
FROM `rh_pages_gsc_country`
WHERE `page_id` = pa.`page_id`
AND `country` = 'aus'
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as au_impressions,
(
SELECT
SUM(`clicks`)
FROM `rh_pages_gsc_country`
WHERE `page_id` = pa.`page_id`
AND `country` = 'aus'
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as au_clicks,
(
SELECT
COUNT(`keywords_id`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as keywords,
(
SELECT
AVG(`position`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as avg_pos,
(
SELECT
AVG(`ctr`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN
(
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
)
) as avg_ctr
FROM `rh_pages` pa
WHERE pa.`site_id` = 13
ORDER BY au_impressions DESC, keywords DESC, slug DESC
If anyone can help, I don't think the table structure is needed here as it's basically shown in the query, but here is a photo of the constraints and table types.
Anyone that can help is greatly appreciated.
Do NOT normalize any column that will be regularly used in a "range scan", such as date. The following is terribly slow:
AND `date_id` IN (
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH
AND NOW() )
It also consumes extra space to have BIGINT (8 bytes) pointing to a DATE (5 bytes).
Once you move the date to the various tables, the subqueries simplify, such as
SELECT AVG(`position`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date_id` IN (
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH
AND NOW() )
becomes
SELECT AVG(`position`)
FROM `rh_pages_gsc_keywords`
WHERE `page_id` = pa.`page_id`
AND `date` >= NOW() - INTERVAL 12 MONTH
I'm assuming that nothing after "NOW" has yet been stored.
If there are dates in the future, then add
AND `date` < NOW()
Each table will probably need a new index, such as
INDEX(page_id, date) -- in that order
(Yes, the "JOIN" suggestion by others is a good one. It's essentially orthogonal to my suggestions above and below.)
After you have made those changes, if the performance is not good enough, we can discuss Summary Tables
Your query is aggregating (summarizing) rows from two different detail tables, rh_pages_gsc_country and rh_pages_gsc_keywords, and doing so for a particular date range. And it has a lot of correlated subqueries.
The first steps in your path to better performance are
Converting your correlated subqueries to independent subqueries, then JOINing them.
Writing one subquery for each detail table, rather than one for each column you need summarized.
You mentioned you've been struggling with this. The concept I hope you learn from this answer is this: you can often refactor away your correlated subqueries if you can come up with independent subqueries that give the same results, and then join them together. If you mention subqueries in your SELECT clause -- SELECT ... (SELECT whatever) whatever ... -- you probably have an opportunity to do this refactoring.
Here goes. First you need a subquery for your date range. You have this one right, just repeated.
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
Next you need a subquery for rh_pages_gsc_country. It is a modification of what you have. We'll fetch both SUMs in one subquery.
SELECT SUM(`impressions`) impressions,
SUM(`clicks`) clicks,
page_id, date_id
FROM `rh_pages_gsc_country`
WHERE `country` = 'aus'
GROUP BY page_id, date_id
See how this goes? This subquery yields a virtual table with exactly one row for every combination of page_id and date_id, containing the number of impressions and the number of clicks.
Next, let's join the subqueries together in a main query. This yields some columns of your result set.
SELECT pa.slug, country.impressions, country.clicks
FROM rh_pages pa
JOIN (
SELECT SUM(`impressions`) impressions,
SUM(`clicks`) clicks,
page_id, date_id
FROM `rh_pages_gsc_country`
WHERE `country` = 'aus' -- constant for country code
GROUP BY page_id, date_id
) country ON country.page_id = pa.page_id
JOIN (
SELECT `date_id`
FROM `rh_pages_gsc_dates`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
) dates ON dates.date_id = country.date_id
WHERE pa.site_id = 13 -- constant for page id
ORDER BY country.impressions DESC
This runs through the rows of rh_pages_gsc_dates and rh_pages_gsc_country just once to satisfy your query. So, faster.
Finally let's do the same thing for your rh_pages_gsc_keywords table's summary.
SELECT pa.slug, country.impressions, country.clicks,
keywords.keywords, keywords.avg_pos, keywords.avg_ctr
FROM rh_pages pa
JOIN (
SELECT SUM(`impressions`) impressions,
SUM(`clicks`) clicks,
page_id, date_id
FROM `rh_pages_gsc_country`
WHERE `country` = 'aus' -- constant for country code
GROUP BY page_id, date_id
) country ON country.page_id = pa.page_id
JOIN (
SELECT SUM(`keywords_id`) keywords,
AVG(`position`) position,
AVG(`ctr`) avg_ctr,
page_id, date_id
FROM `rh_pages_gsc_keywords`
GROUP BY page_id, date_id
) keywords ON keywords.page_id = pa.page_id
JOIN (
SELECT `date_id`
FROM `rh_pages_gsc_keywords`
WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
) dates ON dates.date_id = country.date_id
AND dates.date_id = keywords.date_id
WHERE pa.site_id = 13 -- constant for page id
ORDER BY impressions DESC, keywords DESC, slug DESC
This will almost certainly be faster than what you have now. If it's fast enough, great. If not, please don't hesitate to ask another question for help, tagging it query-optimization. We will need to see your table definitions, your index definitions, and the output of EXPLAIN. Please read this before asking a followup question.
I did not, repeat not, debug any of this. That's up to you.
Related
I'm trying to select data at two different resolutions based on data points per unit of time. Right now I'm just running two queries and joining them with a UNION. To get the number resolutions I want I'm using this to achieve 1 data point per minute:
GROUP BY UNIX_TIMESTAMP(`datetime`) DIV 60
Just wondering if there is a more efficient way to do this?
SELECT (UNIX_TIMESTAMP(`datetime`)*1000) as `dt`, `value1`, `value2`
FROM `table`
WHERE `datetime` BETWEEN '2017-01-01' AND '2018-01-01'
GROUP BY UNIX_TIMESTAMP(`datetime`) DIV 240
UNION
SELECT (UNIX_TIMESTAMP(`datetime`)*1000) as `dt`, `value1`, `value2`
FROM `table`
WHERE `datetime` BETWEEN '2017-01-01' AND '2018-01-01'
AND TIME(`datetime`) BETWEEN TIME('12:00:00') AND TIME('13:00:00')
GROUP BY UNIX_TIMESTAMP(`datetime`) DIV 60
ORDER BY `dt` ASC;
Here's an alternative I came across. This one seems a little quicker and returns the actual values for the times selected instead of mySQL just picking one from each time group.
On this table datetime is an index.
SELECT a.`datetime`, a.`value1`, a.`value2`
FROM `table` a
INNER JOIN
(
SELECT `datetime`
FROM `table`
WHERE DATE(`datetime`) BETWEEN '2017-01-01' AND '2018-01-01'
GROUP BY UNIX_TIMESTAMP(`datetime`) DIV 240
UNION
SELECT `datetime`
FROM `table`
WHERE DATE(`datetime`) BETWEEN '2017-01-01' AND '2018-01-01'
AND TIME(`datetime`) BETWEEN '12:00:00' AND '13:00:00'
GROUP BY UNIX_TIMESTAMP(`datetime`) DIV 60
ORDER BY `datetime`
) b on a.`datetime` = b.`datetime`;
I have this query i use to get statistics of blogs in our own tracking system.
I use union select over 2 tables as we daily aggregate data in 1 table and keeps todays data in another table.
I want to have the last 10 months of traffic show.. This query does that, but of there is no traffic in a specific month that row is not in the result.
I have previously used a calendar table in mysql to join against to at avoid that, but im simply not skilled enoght to rewrite this query to join against that calendar table.
The calendart table has 1 field called "datefield" which i date format YYY-MM-DD
This is the current query i use
SELECT FORMAT(SUM(`count`),0) as `count`, DATE(`date`) as `date`
FROM
(
SELECT count(distinct(uniq_id)) as `count`, `timestamp` as `date`
FROM tracking
WHERE `timestamp` > now() - INTERVAL 1 DAY AND target_bid = 92
group by `datestamp`
UNION ALL
select sum(`count`),`datestamp` as `date`
from aggregate_visits
where `datestamp` > now() - interval 10 month
and target_bid = 92
group by `datestamp`
) a
GROUP BY MONTH(date)
Something like this?
select sum(COALESCE(t.`count`,0)),s.date as `date`
from DateTable s
LEFT JOIN (SELECT * FROM aggregate_visits
where `datestamp` > now() - interval 10 month
and target_bid = 92) t
ON(s.date = t.datestamp)
group by s.date
I have a table, that pretty much looks like this:
users (id INT, masterId INT, date DATETIME)
Every user has exactly one master. But masters can have n users.
Now I want to find out how many users each master has. I'm doing that this way:
SELECT `masterId`, COUNT(`id`) AS `total` FROM `users` GROUP BY `masterId` ORDER BY `total` DESC
But now I also want to know how many new users a master has since the last 14 days. I could do it with this query:
SELECT `masterId`, COUNT(`id`) AS `last14days` FROM `users` WHERE `date` > DATE_SUB(NOW(), INTERVAL 14 DAY) GROUP BY `masterId` ORDER BY `total` DESC
Now the question: Could I somehow get this information with one query, instead of using 2 queries?
You can use conditional aggregation to do this by only counting rows for with the condition is true. In standard SQL this would be done using a case expression inside the aggregate function:
SELECT
masterId,
COUNT(id) AS total,
SUM(CASE WHEN date > DATE_SUB(NOW(), INTERVAL 14 DAY) THEN 1 ELSE 0 END) AS last14days
FROM users
GROUP BY masterId
ORDER BY total DESC
Sample SQL Fiddle
I'm trying to count how many result in each month.
This is my query :
SELECT
COUNT(*) as nb,
CONCAT(MONTH(t.date),0x3a,YEAR(t.date)) as period
FROM table1 t
WHERE t.criteria = 'value'
GROUP BY MONTH(t.date)
ORDER BY YEAR(t.date)
My Result:
nb period
---------------
7 6:2009
46 8:2009
2 10:2009
1 11:2009
14 1:2009
9 9:2010
161 7:2010
5 2:2010
88 3:2010
28 4:2010
4 5:2011
2 12:2011
The problem is, I'm sure that I've result between 5:2011 & 12:2011 , and each other period
since 2009 ... :/
This is a problem of my request or mysql configuration ?
Thx a lot
You have to group by both the year and the month. Otherwise your April 2012 rows are grouped with April 2011 (and April 2010 ...) rows as well.
SELECT
COUNT(*) AS nb,
CONCAT(MONTH(t.date), ':', YEAR(t.date)) AS period
FROM table1 AS t
WHERE t.criteria = 'value'
GROUP BY YEAR(t.date)
, MONTH(t.date) ;
(and is there a reason you used 0x3a and not ':'?)
You could also use some other DATE and TIME functions of MySQL so there are fewer functions calls per row and probably a more efficient query:
SELECT
COUNT(*) AS nb,
DATE_FORMAT(t.date, '%m:%Y') AS period
FROM table1 AS t
WHERE t.criteria = 'value'
GROUP BY EXTRACT( YEAR_MONTH FROM t.date) ;
For several queries, it's useful to have a permanent Calendar table in your database (with all dates or all year-months) or even several Calendar tables. Example:
CREATE TABLE CalendarYear
( Year SMALLINT UNSIGNED NOT NULL
, PRIMARY KEY (Year)
) ENGINE = InnoDB ;
INSERT INTO CalendarYear
(Year)
VALUES
(1900), (1901), ..., (2099) ;
CREATE TABLE CalendarMonth
( Month TINYINT UNSIGNED NOT NULL
, PRIMARY KEY (Month)
) ENGINE = InnoDB ;
INSERT INTO CalendarMonth
(Month)
VALUES
(1), (2), ..., (12) ;
Those can also help us make the one we'll need here:
CREATE TABLE CalendarYearMonth
( Year SMALLINT UNSIGNED NOT NULL
, Month TINYINT UNSIGNED NOT NULL
, FirstDay DATE NOT NULL
, NextMonth_FirstDay DATE NOT NULL
, PRIMARY KEY (Year, Month)
) ENGINE = InnoDB ;
INSERT INTO CalendarYearMonth
(Year, Month, FirstDay, NextMonth_FirstDay)
SELECT
y.Year
, m.Month
, MAKEDATE(y.Year, 1) + INTERVAL (m.Month-1) MONTH
, MAKEDATE(y.Year, 1) + INTERVAL (m.Month) MONTH
FROM
CalendarYear AS y
CROSS JOIN
CalendarMonth AS m ;
Then you can use the Calendar tables to write more complex queries, like the variation you want (with missing months) and probably more efficiently. Tested in SQL-Fiddle:
SELECT
COUNT(t.date) AS nb,
CONCAT(cal.Month, ':', cal.Year) AS period
FROM
CalendarYearMonth AS cal
JOIN
( SELECT
YEAR(MIN(date)) AS min_year
, MONTH(MIN(date)) AS min_month
, YEAR(MAX(date)) AS max_year
, MONTH(MAX(date)) AS max_month
FROM table1
WHERE criteria = 'value'
) AS mm
ON (cal.Year, cal.Month) >= (mm.min_year, mm.min_month)
AND (cal.Year, cal.Month) <= (mm.max_year, mm.max_month)
LEFT JOIN
table1 AS t
ON t.criteria = 'value'
AND t.date >= cal.FirstDay
AND t.date < cal.NextMonth_FirstDay
GROUP BY
cal.Year, cal.Month ;
You must also GROUP BY the year:
GROUP BY MONTH(t.date), YEAR(t.date)
Your original query uses YEAR(t.date) in the SELECT clause outside of any aggregate function without grouping by it -- as a result, you get exactly 12 groups (one for each possible month) and for each group (that possibly contains dates across many years) a "random" year is chosen by MySql for selection. Strictly speaking, this is meaningless and the query should never have been allowed to execute. But MySql... sigh.
first of all, this is the query which creates the "player history"
it can be executed as often as you want and it will only create new history rows for the players if there is no history row for yesterday or if the values changed since the latest history entry in the past.
INSERT INTO `player_history` (`player_id`, `date`, `races`, `maps`, `playtime`, `points`)
SELECT `p`.`id`, DATE_SUB(NOW(), INTERVAL 1 DAY), `p`.`races`, `p`.`maps`, `p`.`playtime`, `p`.`points`
FROM `player` `p`
WHERE `p`.`playtime` IS NOT NULL
AND `p`.`playtime` > 0
AND (
SELECT `player_id`
FROM `player_history`^
WHERE `player_id` = `p`.`id`
AND (
`date` = DATE_SUB(NOW(), INTERVAL 1 DAY)
OR (
`date` < DATE_SUB(NOW(), INTERVAL 1 DAY)
AND `races` = `p`.`races`
AND `points` = `p`.`points`
AND `maps` = `p`.`maps`
AND `playtime` = `p`.`playtime`
)
)
ORDER BY `date` DESC
LIMIT 1
) IS NULL;
now the problem is i also want to cleanup the history table using a single query. this already selects all history entries older than 10 days but the latest. but i cant just like do DELETE instead of SELECT *.
SELECT *
FROM `player_history` `ph`
WHERE `date` < DATE_SUB(NOW(), INTERVAL 10 DAY)
AND `date` != (SELECT `date`
FROM `player_history`
WHERE `player_id` = `ph`.`player_id`
ORDER BY `date` DESC
LIMIT 1);
so is tehre a way to do what i want using a single delete query?
Your query looks right in my eyes but you don't have the interval in the subquery.
I would do this:
DELETE FROM player_history
WHERE date < DATE_SUB(NOW(), INTERVAL 10 DAY)
AND date != (
SELECT MAX(date) FROM player_history
WHERE date < DATE_SUB(NOW(), INTERVAL 10 DAY)
)
What's the error message from mysql?
Probably you can't do this in a single query because the documentation states:
Currently, you cannot delete from a table and select from the same table in a subquery.
As a workaround you could select the ids of the rows that have to be deleted into a temporary table and then use a multi-table delete statement to delete the records from the original table.