I have arrived at a query that gives me what I want but it is not efficient and takes over 45 seconds to execute. How can I modify to make this quicker?
SELECT *
FROM (SELECT DISTINCT email,
title,
first_name,
last_name,
'chauntry' AS source,
post_code AS postcode
FROM chauntry
WHERE mailing_indicator = 1) AS x
LEFT JOIN (SELECT email,
Avg(amount_paid) AS avg_paid,
Count(*) AS no_times_booked,
Count(DISTINCT( Month(added) )) AS unique_months
FROM chauntry
WHERE added >= Now() - INTERVAL 1 year
GROUP BY email) AS y
ON x.email = y.email
here are the data fields
here are the column headings I am after
DRapp, appreciate the detailed feedback - spot on about the date being an oversight.
Are we talking about something like the below to speed up my query? I cant find much info about creating a covering index and the source of the below one was questionable.
ALTER TABLE `chauntry`
ADD INDEX(`mailing_indicator`, `email`);
ALTER TABLE `chauntry`
ADD INDEX covering_index (`added`, `email`, `amount_paid`);
You use SELECT DISTINCT in the first subquery, and GROUP BY in the second subquery.
Which has the same effect.
Subqueries in a from clause are often redundant, they produce derived tables which aren't indexed. When you run explain over the query you'll see 4 tabels in the execution plan. This can be rewritten to a query without subqueries:
SELECT x.email,
x.title,
x.first_name,
x.last_name,
'chauntry' AS source,
post_code AS postcode,
Avg(y.amount_paid) AS avg_paid,
Count(y.email) AS no_times_booked,
Count(DISTINCT( Month(y.added) )) AS unique_months
FROM
chauntry x
LEFT JOIN
chaunrty y
ON x.email = y.email AND y.added >= CURRENT_DATE - INTERVAL 1 YEAR
GROUP BY x.email
However your model isn't properly normalized, you should have two tables, one with the account data, and one with the payments
To help your performance, you really need indexes. Since you are in essence running two different queries, I would have the following indexes on your CHAUNTRY table
First... by having the mailing_indicator first, you are jumping directly to those, then getting email too which is a basis for the join after. You could actually extend the index to include title, first, last, post_code to be a covering index, but that might be overkill.
( mailing_indicator, email )
Your LEFT JOIN query, it appears you want the count, avg, etc regardless of the mailing_indicator status. To help optimize this, I would have an index on
( added, email, amount_paid )
This WOULD be a covering index so the engine does not need to go to the raw data pages to query the data, but gets them directly from the index.
One additional note about your count of distinct months. You MIGHT be missing a count entry. Consider the middle of the month, such as now Jan 28.
if you have entries for Jan 29, 2014 and Jan 27, 2015, they will fall into the same distinct month count basis as 1 and not 2 representing two different months as they span month AND year. You might want to change that to
DATE_FORMAT(added, '%M %Y') as unique_months_yrs
Syntax to create index
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name
[index_type]
ON tbl_name (index_col_name,...)
[index_type]
Create index Chauntry_MailInd_EMail on Chauntry ( mailing_indicator, email );
Create index Chauntry_Add_Email_Paid on Chauntry ( added, email, amount_paid );
Related
I have read through quite a few posts with greatest-n-per-group but still don't seem to find a good solution in terms of performance. I'm running 10.1.43-MariaDB.
I'm trying to get the change in data values in given time frame and so I need to get the earliest and latest row from this period. The largest number of rows in a time frame that needs to be calculated right now is around 700k and it's only going to be growing. For now I have just resulted into doing two queries, one for the latest and one for the earliest date, but even this has slow performance on currently. The table looks like this:
user_id data date
4567 109 28/06/2019 11:04:45
4252 309 18/06/2019 11:04:45
4567 77 18/02/2019 11:04:45
7893 1123 22/06/2019 11:04:45
4252 303 11/06/2019 11:04:45
4252 317 19/06/2019 11:04:45
The date and user_id columns are indexed. Without ordering the rows aren't in any particular order in the database if that makes a difference.
The furthest I have gotten with this issue is query like this for year period currently (700k datapoints):
SELECT user_id,
MIN(date) as date, data
FROM datapoint_table
WHERE date >= '2019-01-14'
GROUP BY user_id
This gives me the right date and user_id in around very fast in around ~0.05s. But like the common issue with the greatest-n-per-group is, the rest of the row (data in this case) is not from the same row with date. I have read about other similar questions and tried with subquery like this:
SELECT a.user_id, a.date, a.data
FROM datapoint_table a
INNER JOIN (
SELECT datapoint_table.user_id,
MIN(date) as date, data
FROM datapoint_table
WHERE date >= '2019-01-01'
GROUP BY user_id
) b ON a.user_id = b.user_id AND a.date = b.date
This query takes around 15s to complete and gets the correct data value. The 15s tho is just way too long and I must be doing something wrong when the first query is so fast. I also tried doing (MAX)-(MIN) for the data with group by for user_id but it also had slow performance.
What would be more efficient way of getting the same data value as the date or even the difference in latest and earliest data for each user?
Assuming you are using a fairly recent version of either MariaDB or MySQL, then ROW_NUMBER would probably be the most efficient way to find the earliest record for each user:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date) rn
FROM datapoint_table
WHERE date > '2019-01-14'
)
SELECT user_id, data, date
FROM cte
WHERE rn = 1;
To the above you could also consider adding the following index:
CREATE INDEX ON datapoint_table (user_id, date);
You could also try the following variant index with the columns reversed:
CREATE INDEX ON datapoint_table (date, user_id);
It is not clear which version of the index would perform the best, which would depend on your data and the execution plan. Ideally one of the above two indices would help the database execute ROW_NUMBER, along with the WHERE clause.
If your database version does not support ROW_NUMBER, then you may continue with your current approach:
SELECT d1.user_id, d1.data, d1.date
FROM datapoint_table d1
INNER JOIN
(
SELECT user_id, MIN(date) AS min_date
FROM datapoint_table
WHERE date > '2019-01-14'
GROUP BY user_id
) d2
ON d1.user_id = d2.user AND d1.date = d2.min_date
WHERE
d1.date > '2019-01-14';
Again, the indices suggested should at least speed up the execution of the GROUP BY subquery.
I have a subquery that aggregates some UNION ALL selects. Over that I prepare the SELECT to create cross-tab and limit it to let's say 20. I would like to be able to retrieve the total COUNT of sub query results before I am limiting them in main query. This is for the purpose of trying to build a pagination that receives the total number of records and then the specific page record grid.
Sample query:
SELECT
name,
sumIf(metric_value, metric_name = 'data') AS data,
sumif(....
FROM
(SELECT
name, metric_name, SUM(metric_value) as metric_value
FROM
(SELECT
name, 'data' AS metric_name, SUM(data) AS metric_value
FROM
table
WHERE
date > '2017-01-01 00:00:00'
GROUP BY
name
UNION ALL
SELECT
name, 'data' AS metric_name, SUM(data) AS metric_value
FROM
table2
WHERE
date > '2017-01-01 00:00:00'
GROUP BY
name
UNION ALL
SELECT
name, 'data' AS metric_name, SUM(data) AS metric_value
FROM
table3
WHERE
date > '2017-01-01 00:00:00'
GROUP BY
name
UNION ALL
.
.
.)
GROUP BY
name, metric_name)
GROUP BY
name
ORDER BY
name ASC
LIMIT 0,20;
The first subselect returns tons of data, so I thought I can count it and return as one column value, or row and it would propagate to main select that limits 20 results. Because I need to know the entire set of results but don;t want to call the same query twice without limit and with limit just to get COUNT. There are at least 12 UNION ALL third level sub selects, so why waste resources. I am looking to try generic SQL solutions not necessarily related to ClickHouse
I was thinking of using count(*) OVER (), however that is not supported, so if thats only option I know I need to run query twice.
The first thing that one should mention is that nobody is usually interested in the exact number of pages on a query. It can be easily estimated and almost no one will care how exact is the estimation. However, if you have a link to the last page in your GUI, people will often click to link just to see whether it works.
Nevertheless, there are cases when an analyst should visit all the pages, and then the GUI should display the exact amount of work. A good news is that in that latter case, a better strategy is to cache a snapshot of the whole results table and counting the rows in the table becomes not a problem anymore.
I mean, it makes sense to discuss with the customers whether they really need it, because unneeded full scans many times per day may have effect on the database load and billing sums.
Anyway, if you still need to estimate the number of rows, you can simplify the query just to count the number of rows. As I understand this is something like:
SELECT SUM(cnt) as row_count
FROM (
SELECT COUNT(DISTINCT name) as cnt FROM table1 WHERE date > ...
UNION ALL
SELECT COUNT(DISTINCT name) as cnt FROM table2 WHERE date > ...
...
) as counts;
or if data is a constant metric name
SELECT COUNT(DISTINCT name) as row_count
FROM (
SELECT DISTINCT name FROM table1 WHERE date > ...
UNION ALL
SELECT DISTINCT name FROM table2 WHERE date > ...
...
) as names;
I have a table that stores simple log data:
CREATE TABLE chronicle (
id INT auto_increment PRIMARY KEY,
data1 VARCHAR(256),
data2 VARCHAR(256),
time DATETIME
);
The table is approaching 1m records, so I'd like to start consolidating data.
I want to be able to take the first and last record of each DISTINCT(data1, data2) each day and delete all the rest.
I know how to just pull in the data and process it in whatever language I want then delete the records with a huge IN (...) query, but it seems like a better alternative would to use SQL directly (am I wrong?)
I have tried several queries, but I'm not very good with SQL beyond JOINs.
Here is what I have so far:
SELECT id, Max(time), Min(time)
FROM (SELECT id, data1 ,data2, time, Cast(time AS DATE) AS day
FROM chronicle) AS initial
GROUP BY day;
This gets me the first and last time for each day, but it's not separated out by the data (i.e. I get the last record of each day, not the last record for each distinct set of data for each day.) Additionally, the id is just for the Min(time).
The information I've found on this particular problem is only for finding the the last record of the day, not each last record for sets of data.
IMPORTANT: I want the first/last record for each DISTINCT(data1, data2) for each day, not just the first/last record for each day in the table. There will be more than 2 records for each day.
Solution:
My solution thanks to Jonathan Dahan and Gordon Linoff:
SELECT o.data1, o.data2, o.time FROM chronicle AS o JOIN (
SELECT Min(id) as id FROM chronicle GROUP BY DATE(time), data1, data2
UNION SELECT Max(id) as id FROM test_chronicle GROUP BY DATE(time), data1. data2
) AS n ON o.id = n.id;
From here it's a simple matter of referencing the same table to delete rows.
this will improve performance when searching on dates.
ALTER TABLE chronicle
ADD INDEX `ix_chronicle_time` (`time` ASC);
This will delete the records:
CREATE TEMPORARY TABLE #tmp_ids (
`id` INT NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO #tmp_ids (id)
SELECT
min(id)
FROM
chronicle
GROUP BY
CAST(day as DATE),
data1,
data2
UNION
SELECT
Max(id)
FROM
chronicle
GROUP BY
CAST(day as DATE),
data1,
data2;
DELETE FROM
chronicle
WHERE
ID not in (select id FROM #tmp_ids)
AND date <= '2015-01-01'; -- if you want to consider all dates, then remove this condition
You have the right idea. You just need to join back to get the original information.
SELECT c.*
FROM chronicle c JOIN
(SELECT date(time) as day, min(time) as mint, max(time) as maxt
FROM chronicle
GROUP BY date(time)
) cc
ON c.time IN (cc.mint, cc.maxt);
Note that the join condition doesn't need to include day explicitly because it is part of the time. Of course, you could add date(c.time) = cc.day if you wanted to.
Instead of deleting rows in your original table, I would suggest that you make a new table. Something lie this:
create table ChronicleByDay like chronicle;
insert into ChronicleByDay
SELECT c.*
FROM chronicle c JOIN
(SELECT date(time) as day, min(time) as mint, max(time) as maxt
FROM chronicle
GROUP BY date(time)
) cc
ON c.time IN (cc.mint, cc.maxt);
That way, you can have the more detailed information if you ever need it.
I have four queries that run on one web page. I use them for statistics and they are taking too long to load.
Here are my current configurations
use the text wrapping button on pastebin to make it easier to read.
I have a lot of RAM dedicated to mysql but it still takes a long time. I have also index most of the columns.
I'm just trying to see what other options I have.
I put "show create table" and total count(*) in here. I'm going to rename everything and paste in SO. I agree that someone in the future may use it.
QUERY ONE
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsresults
WHERE
DID = 28
AND ActionTypeID = 1
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
This, I would have a covering index based on your key elements so the engine does not have to go back to the raw data... Based on this and your following queries, I would have THAT column in the primary index position such as
StatisticsResults -- index ( DID, ActionTypeID, DateActioned )
The order by by respective year() descending and month() descending will do the same thing as your hard-coded references to FIND the field in the list.
QUERY TWO
-- 381.812
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsdivision
WHERE
DID = 28
AND ActionTypeID = 9
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
ON this one, the DID = '28', I changed to DID = 28. If the column is numeric, don't offer confusion to the engine to try and convert one to the other. The same indexes from option 1 would apply here too.
QUERY THREE
-- 33.899
SELECT SQL_NO_CACHE DISTINCT
AID,
COUNT(*) AS acount
FROM
db.statisticsresults
JOIN db.division_id USING(AID)
WHERE
DID = 28
GROUP BY
AID
ORDER BY
count(*) DESC
LIMIT
19
This one looks like a bit of a waste... you are joining to the division table based on an "AID" column in the stats table. Why are you doing the join unless you actually are expecting some invalid "AID" values not in the division table? Again, change your "DID" column to 28 instead of '28'. Ensure your division table has its index on "AID" for the join. The SECOND index from query 1 appears to be your better option
QUERY FOUR
-- 21.403
SELECT SQL_NO_CACHE DISTINCT
TID,
tax,
agent,
COUNT(*) AS t_count
FROM
db.statisticsresults sr
JOIN db.tax_id USING(TID)
JOIN db.agent_id ai ON(ai.AID = sr.AID)
WHERE
DID = 28
GROUP BY
TID,
sr.AID
ORDER BY
COUNT(*) DESC
LIMIT 19
Again, "DID" column from '28' to 28
FOR your TAX_ID table, have a covering index on that too so it can handle the join
TO the agent table without going TO the raw page data
Tax_ID -- index ( tid, aid )
Finally, if you are dealing with your original list finding things only from Jan 2012 to Dec 2013, you can simplify querying the ENTIRE table of stats by adding to your WHERE clause...
AND DateActioned >= '2012-01-01'
So you completely skip over anything prior to 2012 (old data I presume?)
I am trying to output the total content views from my stats table and group by the year... My stats table is INNODB and has 8M lines and growing...
The table is essentially ID, DATE, MAKE, IP, REFERRER (indexes on id,date,make)
Each entry has an auto-incremented ID, the entry date YYYY-MM-DD HH:MM:SS, and a product make like 'sony', 'panasonic' etc...
I am trying to make a query that does not kill my server that sums up the total content views per year and shows them in order from most viewed to least viewed...(for this year 2011) so that I can use that data to populate a JS chart comparing this year with the past years. I can do this with multiple queries and walking through arrays in PHP but I think there should be a way to get this in one query, but hell if I can figure it out.
Any ideas? Also, am I better to make three independent queries and deal with the results in PHP or can I get this into one query that is more MYSQL efficient.
The output I would like to see (although I cannot seem to make it do this), is simply
MAKE 2009Total 2010Total 2011Total
---- --------- --------- ---------
Panasonic 800 2345 3456
Sony 998 5346 2956
JVC 1300 1234 1944
Assume my table has data in it from 2009 to now, I need my array to contain one line per make...
Any help would be appreciated... I am amazed at how fast results like this come back from analytics tools and mine take about 75seconds on 4x Quad-core XEON RAID mysql server... this stats table is not being written to but once a day to dump in the previous day's stats so I am not sure why my 3 sep queries are so slow... hence my question... maybe a single query won't be any faster?
Anyway, any help would be appreciated and opinions about speeding up stats queries from a generic view stats table would be welcomed!
I have made an observation. Your query is requesting by year. You should do two things:
store the year
create a better index (product,year)
Here is how yuou can do so:
CREATE TABLE stats_entry_new LIKE stats_entry;
ALTER TABLE stats_entry_new ADD COLUMN entryyear SMALLINT NOT NULL AFTER date;
ALTER TABLE stats_entry_new ADD INDEX product_year_ndx (product,year);
ALTER TABLE stats_entry_new DISABLE KEYS;
INSERT INTO stats_entry_new
SELECT ID, DATE,YEAR(date),product,IP,REFERRER FROM state_entry;
ALTER TABLE stats_entry_new ENABLE KEYS;
ALTER TABLE stats_entry RENAME stats_entry_old;
ALTER TABLE stats_entry_new RENAME stats_entry;
Now the query looks like this:
SELECT A.product,B.cnt "2009Total",C.cnt "2010Total",D.cnt "2011Total"
FROM
(SELECT DISTINCT product FROM stats_entry) A
INNER JOIN
(SELECT product,COUNT(1) cnt FROM stats_entry WHERE entryyear=2009 GROUP BY product) B
USING (product)
(SELECT product,COUNT(1) cnt FROM stats_entry WHERE entryyear=2010 GROUP BY product) C
USING (product)
(SELECT product,COUNT(1) cnt FROM stats_entry WHERE entryyear=2011 GROUP BY product) D
USING (product);
Now to be fair, if you do not want to add a year to the table then you still have to make an index
ALTER TABLE stats_entry ADD INDEX product_date_ndx (product,date);
Your query looks like this now
SELECT A.product,B.cnt "2009Total",C.cnt "2010Total",D.cnt "2011Total"
FROM
(SELECT DISTINCT product FROM stats_entry) A
INNER JOIN
(SELECT product,COUNT(1) cnt FROM stats_entry
WHERE date >= '2009-01-01 00:00:00'
AND date <= '2009-12-31 23:59:59'
GROUP BY product) B
USING (product)
(SELECT product,COUNT(1) cnt FROM stats_entry
WHERE date >= '2010-01-01 00:00:00'
AND date <= '2010-12-31 23:59:59'
GROUP BY product) C
USING (product)
(SELECT product,COUNT(1) cnt FROM stats_entry
WHERE date >= '2011-01-01 00:00:00'
AND date <= '2011-12-31 23:59:59'
GROUP BY product) D
USING (product);
Give it a Try !!!
SELECT make,year(date) as year,sum(views)
FROM `stats `
group by make,year
o/p :
MAKE year sum
------- ------- ---------
Panasonic 2009 800
Panasonic 2010 2345
Panasonic 2011 3456
....
you can later seggregate on the php side.
or:
select make ,group_concat(cast(yr_views as char)) as year_views
from (SELECT make,concat(year(date),':',sum(views)) as yr_views
FROM `stats`
group by make,year(date))as make_views
group by make
o/p:
make year_views
------ ---------------
panasonic 2009:800,2010:2345,2011:3456
...
Later, explode at the PHP level & have the result.