Optimize MIN & MAX query - mysql

My database table consists of more than 10 million records. I am writing a query containing MIN and MAX functions on the created_date column which I already indexed. But when I am running my select statement it takes too much time and some times execution time get over and do not receive any output.
Is there any way to optimize my query. The query I am trying is below.
SELECT MIN(created_date) AS Min, MAX(created_date) as Max FROM network ORDER
BY id DESC LIMIT 1000000
The above query will give you MIN AND MAX,created_date from the last latest 1 000 000 rows.

SELECT MIN(created_date) AS Min,
MAX(created_date) AS Max -- Get min and max from the 1M rows
FROM (
SELECT created_date
FROM network
ORDER BY created_date desc
LIMIT 1000000
) AS recent -- Collect the latest 1M rows
This index would help some:
INDEX(created_date)
Rereading question
The latest date is simply MAX(created_date). The millionth date is `( SELECT created_date FROM network ORDER BY created_date DESC LIMIT 1000000, 1 )
So, this is the first choice:
SELECT ( SELECT created_date FROM network
ORDER BY created_date DESC LIMIT 1000000, 1 ) AS Min,
MAX(created_date) AS Max
FROM network;
Summary Table
CREATE TABLE Dates (
create_date DATETIME NOT NULL,
ct INT UNSIGNED NOT NULL,
PRIMARY KEY(ct)
) ENGINE=InnoDB;
Then, every hour (or other unit of time), count the number of records and store it there.
To find MIN(created_date) is a bit messy; it means summing through that table to find when the count adds up to about 1M, and declaring the hour (or whatever) is when it happened.
Alternatively (and probably better) is to capture the exact DATETIME of each 1000th row. This means probing network frequently, and storing just the created_date (drop the ct column). Then this finds the approximate time of the 1M ago:
SELECT created_date
FROM Dates
ORDER BY created_date DESC
LIMIT 1000, 1
(And use that as the subquery for Min.)

Related

Select rows with unique values for one specific column in large table

table1 has 3 columns in my database: id, timestamp, cluster and it has about 1M rows. I want to query the newest 24 rows with unique cluster value (no row must have repeated cluster value in the returned 24 rows). the usual solution would be:
SELECT
*
FROM table1
GROUP BY cluster
ORDER BY timestamp DESC
LIMIT 24
however, since I have 1M rows, this query takes so long to be executed. so my solution was to run:
WITH x AS
(
SELECT
*
FROM `table1`
ORDER BY timestamp DESC
LIMIT 50
)
SELECT
*
FROM x
GROUP BY x.cluster
ORDER BY x.timestamp DESC
LIMIT 24
which assumes we can find 24 rows with unique cluster value in every 50 rows. this query runs much faster (~.007 sec). now I want to ask is there any more efficient/routine way for such case?
You can use row_number(), but you need the right indexes:
select t.*
from (select t.*,
row_number() over (partition by cluster order by timestamp desc) as seqnum
from t
) t
where seqnum = 1
order by timestamp desc
limit 24;
The index you want is on (cluster, timestamp desc).
For your purposes, this may still not be sufficient because it is still processing all the rows, even with an index, when you only need a couple of dozen.
I don't know how many recent rows you need to be sure that you have 24 clusters. However, you might find that this works better if we assume that the most recent 1000 rows have at least 24 clusters:
select t.*
from (select t.*,
row_number() over (partition by cluster order by timestamp desc) as seqnum
from (select t.*
from t
order by timestamp desc
limit 1000
) t
) t
where seqnum = 1
order by timestamp desc
limit 24;
For this, you want an index only on (timestamp desc).
Note: You might find that a where clause on the timestamp works better in this case:
where timestamp > now() - interval 24 hour
for instance to only consider rows in the past 24 hours.
Your assumption that in the last 50 rows you will find 24 different clusters may not be correct.
Try with ROW_NUMBER() window function:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY cluster ORDER BY timestamp DESC) rn
FROM table1
) t
WHERE rn = 1
ORDER BY timestamp DESC LIMIT 24
Since you want "one specific cluster value", this will be fast:
SELECT
*
FROM table1
WHERE cluster = ?
ORDER BY timestamp DESC
LIMIT 24
And have
INDEX(cluster, timestamp)
If that is not what you want, please reword the title and the Question.

MYSQL Average over day statististics

Use Case: I have a cron checking every 5 minutes some statistics and insert it into the database table stats
**Structure**
`time` as DATETIME (index)
`skey` as VARCHAR(50) (index)
`value` as BIGINT
Primary (time and skey)
Now I want to create a graph to display the daily average in progress over the day - so i.E. a graph for playing users:
from 0-1 i have 10 playing users (avg value from 0-1 is now 10)
from 1-2 i have 6 playing users (avg value is now 8 => (10+6) / 2)
from 2-3 i have 14 playing users (avg value is no 10 => (10+6+14) / 3
and next day it begins from start
I got already queries running, but it takes 3.5+ seconds to run
First attempt:
SELECT *
, (SELECT AVG(value)
FROM stats as b
WHERE b.skey = stats.skey
AND b.time <= stats.time
AND DATE(b.time) = DATE(stats.time))
FROM stats
ORDER
BY stats.time DESC
Second attempt:
SELECT *
, (SELECT AVG(b.value)
FROM stats as b
WHERE b.skey = stats.skey
AND DATE(b.time) = DATE(stats.time)
AND b.time <= stats.time) as avg
FROM stats
WHERE skey = 'playingUsers'
GROUP
BY HOUR(stats.time)
, DATE(stats.time)
First try was to get each entry and calculate the average
Second try was to group by hour (like my example)
Anyway, this does not change anything in performance
Is there anyway to boost performance in mysql or do i have to change the full logic behind it?
DB Fiddle:
https://www.db-fiddle.com/f/krFmR1yPsmnPny2zi5NJGv/4
I suggest to separate the calculation of the average per hour from the calculation of the days average and to calculate these values only once per hour via grouping.
If you are on MySQL 8, I suggest to use CTE as follows:
with HOURLY AS (
SELECT distinct
DATE_,
HOUR_,
AVG(b.value) as avg_per_hour
FROM (SELECT s.value, DATE(s.time) DATE_, HOUR(s.time) HOUR_
FROM stats s
where skey = 'playingUsers'
) b
GROUP BY b.DATE_, b.HOUR_
ORDER BY b.DATE_ DESC, b.HOUR_ DESC
)
SELECT *
, (SELECT AVG(b.avg_per_hour)
FROM HOURLY as b
WHERE b.DATE_ = HOURLY.DATE_
AND b.HOUR_ <= HOURLY.HOUR_) as avg
FROM HOURLY
This statement lasts < 300 ms in the given fiddle.
The calculation corresponds to the algorithm you described in the table above.
However, the results differ from those of the statements presented.

SQL queries optimization

I'm having trouble optimizing some sql queries that take in account datetime fields.
First of all, my table structure is the following:
CREATE TABLE info (
id int NOT NULL auto_increment,
name varchar(20),
infoId int,
shortInfoId int,
text varchar(255),
token varchar(60),
created_at DATETIME,
PRIMARY KEY(id)
KEY(created_at));
After using explain on some of the simple queries I added the created_at key, that improved most of my simple queries performance. I'm having now trouble with the following query:
SELECT min(created_at), max(created_at) from info order by id DESC limit 10000
With this query I want to get the timespan between tha last 10k results.
After using explain I get the following results:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE info ALL NULL NULL NULL NULL 4 NULL
Any idea on how can I improve the performance of this query?
If you want to examine the first 10k rows ordered by id then you need to use a sub-query to achieve your goal:
SELECT MIN(created_at), MAX(created_at)
FROM (
SELECT created_at
FROM info
ORDER BY id DESC
LIMIT 10000
) tenK
The inner query gets the first 10k rows from the table, sorted by id (only the field created_at is needed). The outer table computes the minimum and maximum value of created_at from the results set generated by the inner query.
I didn't run an EXPLAIN on it but I think it says 'Using temporary' in the 'Extra' column (which is not good but you cannot do better for this request). However, 10,000 rows is not that much; it runs fast and the performance does not degrade as the table size increases.
Update:
Now I noticed this sentence in the question:
With this query I want to get the timespan between tha last 10k results.
If you want to get the value of created_at of the most recent row and the row that is 10k rows in the past then you can use two simple queries that use the index on created_at and run fast:
(
SELECT created_at
FROM info
ORDER BY id DESC
LIMIT 1
)
UNION ALL
(
SELECT created_at
FROM info
ORDER BY id DESC
LIMIT 9999,1
)
ORDER BY created_at
This query produces 2 rows, the first one is the value of created_at of the 10000th row in the past, the second one is the created_at of the most recent row (I assume created_at always grows).
SELECT min(created_at), max(created_at) from info order by id DESC limit 10000
The above query will give you one row containing the minimum and maximum created_at values from info table. Because it only returns 1 row, the order by and limit clauses don't come into play.
The 10000-th record from the end can be accessed with the order by & limit condition ORDER BY id DESC LIMIT 1 OFFSET 9999 (thanks #Mörre Noseshine for the correction)
So, we can write the intended query as follows:
SELECT
min_created_at.value,
max_created_at.value
FROM
(SELECT
created_at value
FROM info
ORDER BY id DESC
LIMIT 1 OFFSET 9999) min_created_at,
(SELECT
created_at value
FROM info
ORDER BY id DESC
LIMIT 1) max_created_at

DISTINCT ON query w/ ORDER BY max value of a column

I've been tasked with converting a Rails app from MySQL to Postgres asap and ran into a small issue.
The active record query:
current_user.profile_visits.limit(6).order("created_at DESC").where("created_at > ? AND visitor_id <> ?", 2.months.ago, current_user.id).distinct
Produces the SQL:
SELECT visitor_id, MAX(created_at) as created_at, distinct on (visitor_id) *
FROM "profile_visits"
WHERE "profile_visits"."social_user_id" = 21
AND (created_at > '2015-02-01 17:17:01.826897' AND visitor_id <> 21)
ORDER BY created_at DESC, id DESC
LIMIT 6
I'm pretty confident when working with MySQL but I'm honestly new to Postgres. I think this query is failing for multiple reasons.
I believe the distinct on needs to be first.
I don't know how to order by the results of max function
Can I even use the max function like this?
The high level goal of this query is to return the 6 most recent profile views of a user. Any pointers on how to fix this ActiveRecord query (or it's resulting SQL) would be greatly appreciated.
The high level goal of this query is to return the 6 most recent
profile views of a user.
That would be simple. You don't need max() nor DISTINCT for this:
SELECT *
FROM profile_visits
WHERE social_user_id = 21
AND created_at > (now() - interval '2 months')
AND visitor_id <> 21 -- ??
ORDER BY created_at DESC NULLS LAST, id DESC NULLS LAST
LIMIT 6;
I suspect your question is incomplete. If you want:
the 6 latest visitors with their latest visit to the page
then you need a subquery. You cannot get this sort order in one query level, neither with DISTINCT ON, nor with window functions:
SELECT *
FROM (
SELECT DISTINCT ON (visitor_id) *
FROM profile_visits
WHERE social_user_id = 21
AND created_at > (now() - interval '2 months')
AND visitor_id <> 21 -- ??
ORDER BY visitor_id, created_at DESC NULLS LAST, id DESC NULLS LAST
) sub
ORDER BY created_at DESC NULLS LAST, id DESC NULLS LAST
LIMIT 6;
The subquery sub gets the latest visit per user (but not older than two months and not for a certain visitor21. ORDER BY must have the same leading columns as DISTINCT ON.
You need the outer query to get the 6 latest visitors then.
Consider the sequence of events:
Best way to get result count before LIMIT was applied
Why NULLS LAST? To be sure, you did not provide the table definition.
PostgreSQL sort by datetime asc, null first?

MySQL select given number of rows and always select all rows within the same day

I want to do a MySQL Query which selects a given number of Rows from a single table from a given offset like
SELECT * FROM table
WHERE timestamp < '2011-11-04 09:01:05'
ORDER BY timestamp DESC
LIMIT 100
My problem is that i always want all rows within a day if one row of a day will be included in the result.
It would be no problem to have a result with e.g. 102 rows instead of 100
Can i realize this with a single SQL statement?
Thanks for your help!
This seems to work on my system:
SELECT UserID, Created
FROM some_user
WHERE Created < '2011-11-04 09:10:11'
AND Created >= (
SELECT DATE(Created) -- note: DATE() strips out the time portion from datetime
FROM some_user
WHERE Created < '2011-11-04 09:10:11'
ORDER BY Created DESC
LIMIT 99, 1 -- note: counting starts from 0 so LIMIT 99, 1 returns 100th row
)
ORDER BY Created DESC
-- 0 rows affected, 102 rows found. Duration for 1 query: 0.047 sec.
There might be a faster alternative.
If I understand your question correctly, you're intrested in retrievieng 100 rows, + any rows that are on the same day as ones already retrieved. You can do this using a subquery:
SELECT table.*
FROM table, (
SELECT DISTINCT day
FROM (
SELECT TO_DAYS(timestamp) day
FROM table
WHERE timestamp < :?
LIMIT 100
)
) days
WHERE TO_DAYS(table.timestamp) = days.day
ORDER BY timestamp
Exclude the time part in the query and remove the LIMIT.
SELECT * FROM table
WHERE timestamp < '2011-11-04 00:00:00'
ORDER BY timestamp DESC