Faster way of retrieving aggregate data from large table?

Faster way of retrieving aggregate data from large table? - mysql

I have a table that grows by tens of millions of rows each day. The rows in the table contain hourly information about page view traffic.
The indices on the table are on url and datetime.
I want to aggregate the information by day, rather than hourly. How should I do this? This is a query that exemplifies what I am trying to do:
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= "2012-08-29 00:00:00" AND datetime <= "2012-08-29 23:00:00"
GROUP BY url
ORDER BY pageviews DESC
LIMIT 10;
The above query never finishes, though. There are millions of rows in the table. Is there a more efficient way that I can get this aggregate data?

Tens of millions of rows per day is quite a lot.
Assuming:
only 10 million new records per day;
your table contains only the columns that you mention in your question;
url is of type TEXT with a mean (Punycode) length of ~77 characters;
pageviews is of type INT;
int_views is of type INT;
ext_views is of type INT; and
datetime is of type DATETIME
then each day's data will occupy around 9.9 × 108 bytes, which is almost 1GiB/day. In reality it may be considerably more, because the above assumptions were quite conservative.
MySQL's maximum table size is determined, amongst other things, by the underlying filesystem on which its data files reside. If you're using the MyISAM engine (as suggested by your comment beneath) without partitioning on Windows or Linux, then a limit of a few GiB is not uncommon; which implies the table will reach its capacity well within a working week!
As #Gordon Linoff mentioned, you should partition your table; However, each table has a limit of 1024 partitions. With 1 partition/day (which would be imminently sensible in your case), you will be limited to storing under 3 years of data in a single table before the partitions start getting reused.
I would therefore advise that you keep each year's data in its own table, each partitioned by day. Furthermore, as #Ben explained, a composite index on (datetime, url) would help (I actually propose creating a date column from DATE(datetime) and indexing that, because it will enable MySQL to prune the partitions when performing your query); and, if row-level locking and transactional integrity are not important to you (for a table of this sort, they may not be), using MyISAM may not be daft:
CREATE TABLE news_2012 (
INDEX (date, url(100))
)
Engine = MyISAM
PARTITION BY HASH(TO_DAYS(date)) PARTITIONS 366
SELECT *, DATE(datetime) AS date FROM news WHERE YEAR(datetime) = 2012;
CREATE TRIGGER news_2012_insert BEFORE INSERT ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
CREATE TRIGGER news_2012_update BEFORE UPDATE ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
If you choose to use MyISAM, you can not only archive completed years (using myisampack) but can also replace your original table with a MERGE one comprising the UNION of all of your underlying year tables (an alternative that would also work in InnoDB would be to create a VIEW, but it would only be useful for SELECT statements as UNION views are neither updatable nor insertable):
DROP TABLE news;
CREATE TABLE news (
date DATE,
INDEX (date, url(100))
)
Engine = MERGE
INSERT_METHOD = FIRST
UNION = (news_2012, news_2011, ...)
SELECT * FROM news_2012 WHERE FALSE;
You can then run your above query (along with any other) on this merge table:
SELECT url, SUM(pageviews), SUM(int_views), SUM(ext_views)
FROM news
WHERE date = '2012-08-29'
GROUP BY url
ORDER BY SUM(pageviews) DESC
LIMIT 10;

A few points:
Also, as the only predicate that you're filtering on you should
probably have an index with datetime as the first column.
You're ordering by pageviews. I would have assumed that you want to order by sum(pageviews).
You're querying 23 hours of data not 24. You probably want to use an explicit less than, <, from midnight the next day to avoid missing anything.
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= '2012-08-29 00:00:00'
AND datetime < '2012-08-30 00:00:00'
GROUP BY url
ORDER BY sum(pageviews) DESC
LIMIT 10;
You could index this on datetime, url, pageviews, int_views, ext_views but I think that would be overkill; so, if the index isn't too big datetime, url seems like a good way to go. The only way to be certain is to test it and decide if any performance improvements in querying are worth the extra time taken in index maintenance.
As Gordon's just mentioned in the comments you may need to look into partitioning. This enables you to query a smaller "table" that is part of the larger one. If all your queries are based at the day level it sounds like you might need to create a new one each day.

Related

AWS RDS MySql Simple query with composite index and date range took too long to execute out of ~8 millions data

Query is really simple i.e
SELECT
col1 , date_col
FROM table USE INDEX (device_date_col)
WHERE
device_id = "some_value"
AND date_col BETWEEN "2020-03-16 00:00:00" and "2020-04-16 00:00:00"
limit 1000000 ;
but it takes 30 to 60 seconds to finally returns the result, when running first time. And then it returns result under 10 seconds. And another problem is that when I change the device_id it again takes long time. I cannot understand why it's happening beside using proper indexing.
We know that, API Gateway has 30 seconds limit due to this our API encounter timeout. It happens suddenly from today.
Main goal is to retrieve minutely data, it returns less data but also takes long time i.e
....
AND col1 IS NOT NULL
GROUP BY
DATE(date_col),
HOUR(date_col),
MINUTE(date_col)
Below are some useful info
AWS RDS having instance db.m4.large (vCPU 2 and RAM 8GB).
MySql version 5.6.x
composite index on date_col and device_col
using InnoDB
table has no id field (primary key)
total rows in table are 7.5 million
each device has data every 3 seconds
query return rows around 600k
explain query shows it is using indexing
UPDATE
MySql Workbench shows that when I run query without group by it takes 2 seconds to execute but > 30 seconds to retrieve and when I use group by then server takes > 30 to execute but 2 seconds to retrieve.
I think we need to more
CPU for processed data using group by
More RAM for extracting all data (without group by)
Below Image is showing the query response without group by. Look at the duration/Fetch time

(original query)
SELECT col1 , date_col
FROM table USE INDEX (device_date_col)
WHERE device_id = "some_value"
AND date_col BETWEEN "2020-03-16 00:00:00"
AND "2020-04-16 00:00:00"
limit 1000000 ;
Discussion of INDEX(device_id, date_col, col1)
Start an index with = column(s), name,y device_id. This focuses the search somewhat.
Within that, further focus on the date range. So, add date_col to the index. You now have the optimal index for the WHERE
Tack on all the other columns showing up anywhere in the SELECT if it is not too many columns and includes no TEXT columns. Now you have a "covering" index. This allows the query to be performed using just the index's BTree, thereby giving a further boost in speed.
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Other notes
LIMIT without ORDER BY is usually not meaningful -- you are at risk of getting a random set of rows.
That BETWEEN includes an extra midnight. I suggest
AND date_col >= "2020-03-16"
AND date_col < "2020-03-16" + INTERVAL 1 MONTH
Remove the USE INDEX -- It may help today, but it could hurt tomorrow, when the data changes or the constants change.
LIMIT 1000000 -- This could choke some clients. Do you really need that many rows? Perhaps more processing could be done in the database?
Adding on the GROUP BY -- Could there be two values for col1 within some of the minutes? Which value of col1 will you get? Consider MAX(col1), ANY_VALUE(col1), or GROUP_CONCAT(DISTINCT col1).

SQL Performance of grouping by DATE(TIMESTAMP) vs separate columns for DATE and TIME

I'm facing a problem of displaying data from MySQL database.
I have a table with all user requestes in format:
| TIMESTAMP Time / +INDEX | Some other params |
I want to show this data on my website as a table with number of requests in each day.
The query is quite simple:
SELECT DATE(Time) as D, COUNT(*) as S FROM Stats GROUP BY D ORDER BY D DESC
But when looking into EXPLAIN this drives me mad:
Using index; **Using temporary; Using filesort**
From MySQL docs it says that it creates temporary table for this query on hard drive.
How fast it would be with 1.000.000 records? And how fast with 100.000.000?
Is there any way to put INDEX on result of function?
Maybe I should create separate columns for DATE and TIME and than group by DATE column?
What are other good ways of dealing with such problem? Caching? Another DB engine?

If you have an index on your Time column this operation is going to perform tolerably well. I'm guessing you do have that index, because your EXPLAIN output says it's using an index.
Why does this work well? Because MySQL can access this index in order -- it can scan the index -- to satisfy your query.
Don't be confused by Using temporary; Using filesort. This simply means MySQL needs to create and return a virtual table with a row for each day. That's pretty small and almost surely fits in memory. filesort doesn't necessarily mean the file has spilled to a temp file on disk; it just means MySQL has to sort the virtual table. It has to sort it to get the last day first.
By the way, if you can restrict the date range of the query you'll get predictable performance on this query even when your application has been in use for years. Try something this:
SELECT DATE(Time) as D, COUNT(*) as S
FROM Stats
WHERE Time >= CURDATE() - INTERVAL 30 DAY
GROUP BY D ORDER BY D DESC

First: a GROUP BY means sorting and it is an expensive operation. The data in the index is sorted but even in this case the ddbb needs to groups dates. So I feel that indexing by DATE may help as it will improve the speed of the query at the cost of refreshing another index at every insert. Please test it, i am not 100% sure.
Other alternatives are:
Using a partitioned table by month.
Using a materialized views
Updating a counter with every visit.
Precalculating and storing yesterday's data. Just refresh your daily visits with a WHERE DAY(timestamp) = TODAY. This way the serer will have to sort a smaller amount of data.
Dependes on how often do user visit your page and when you do need this data. Do not optimize prematuraly if you do not need it.

mysql partitioning with int and timestamp

I have MySQL 5.6.12 Community Server.
I am trying to partition my MySQL innoDB table which contains 5M(and growing always) rows of history data. It is getting slower and slower and I figured partitioning will solve it.
I have columns.
stationID int(4)
valueNumberID(int 5)
logTime(timestamp)
value(double)
(stationID,valueNumberID,logTime) is my PRIMARY key.
I have 50 different stationID's. From each station comes history data and I need to store it for 5 years. There are only 2-5 different valueNumberID's from each stationID but hundreds of value changes per day. Each query in the system uses stationID,valueNumberID and logTime in that order. In most cases the queries are limited to current year.
I would like to create partitioning with stationID with each stationID having own partition so the queries use smaller physical table to scan, and further reduce the size of the table by subpartitioning it by logTime. I do not know how to create own partition for 50 different stationID's and create subpartitions for them using timestamp.
Thank you for your replies. SELECT queries are getting slower. To me it seems like they are getting slower linearly with the speed the table is growing. The issue must be with the GROUP-statement.This is an example query. SELECT DATE_FORMAT(logTime,"%Y%m%d%H%i%s") AS 'logTime', SUM(Value) FROM His WHERE stationID=23 AND valueNumberID=4 AND logTime > '2013-01-01 00:00:00' AND logTime < '2013-11-14 00:00:00' GROUP BY DATE_FORMAT( logPVM,"%Y%m") ORDER BY logTime LIMIT 0,120;
Objective of this query/queries like this is to give either AVG,MAX,MIN,SUM in hour,day,week,month intervals. Result of the query is bound tightly to how the results are presented to the user in various ways(graph,excel file) and it would take long time to change if I would change the queries. So I was looking an easy way out with partitioning.
And estimate of 1.2-1.4M rows per month comes to this table.
Thank you

Maximize efficiency of SQL SELECT statement

Assum that we have a vary large table. for example - 3000 rows of data.
And we need to select all the rows that thire field status < 4.
We know that the relevance rows will be maximum from 2 months ago (of curse that each row has a date column).
does this query is the most efficient ??
SELECT * FROM database.tableName WHERE status<4
AND DATE< '".date()-5259486."' ;
(date() - php , 5259486 - two months.)...

Assuming you're storing dates as DATETIME, you could try this:
SELECT * FROM database.tableName
WHERE status < 4
AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS)
Also, for optimizing search queries you could use EXPLAIN ( http://dev.mysql.com/doc/refman/5.6/en/explain.html ) like this:
EXPLAIN [your SELECT statement]
Another point where you can tweak response times is by carefully placing appropriate indexes.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data.
Here are some explanations & tutorials on MySQL indexes:
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
However, keep in mind that using TIMESTAMP instead of DATETIME is more efficient; the former is 4 bytes; the latter is 8. They hold equivalent information (except for timezone issues).

3,000 rows of data is not large for a database. In fact, it is on the small side.
The query:
SELECT *
FROM database.tableName
WHERE status < 4 ;
Should run pretty quickly on 3,000 rows, unless each row is very, very large (say 10k). You can always put an index on status to make it run faster.
The query suggested by cassi.iup makes more sense:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS);
It will perform better with a composite index on status, date. My question is: do you want all rows with a status of 4 or do you want all rows with a status of 4 in the past two months? In the first case, you would have to continually change the query. You would be better off with:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < date('2013-06-19');
(as of the date when I am writing this.)

Improve performance for query to delete duplicates

My hosting company recently gave me this entry from the slow-query log. The rows examined seem excessive and might be helping to slow down the server. A test in phpMyAdmin resulted in duration of 0.9468 seconds.
The Check_in table ordinarily contains 10,000 to 17,000 rows. It also has one index: Num, unique = yes, cardinality = 10852, collation = A.
I would like to improve this query. The first five conditions following WHERE contain the fields to check to throw out duplicates.
# User#Host: fxxxxx_member[fxxxxx_member] # localhost []
# Query_time: 5 Lock_time: 0 Rows_sent: 0 Rows_examined: 701321
use fxxxxx_flifo;
SET timestamp=1364277847;
DELETE FROM Check_in USING Check_in,
Check_in as vtable WHERE
( Check_in.empNum = vtable.empNum )
AND ( Check_in.depCity = vtable.depCity )
AND ( Check_in.travelerName = vtable.travelerName )
AND ( Check_in.depTime = vtable.depTime )
AND ( Check_in.fltNum = vtable.fltNum )
AND ( Check_in.Num > vtable.Num )
AND ( Check_in.accomp = 'NO' )
AND Check_in.depTime >= TIMESTAMPADD ( MINUTE, 3, NOW() )
AND Check_in.depTime < TIMESTAMPADD ( HOUR, 26, NOW() );
Edit:
empNum int (6)
lastName varchar (30)
travelerName varchar (40) (99.9% = 'All')
depTime datetime
fltNum varchar (6)
depCity varchar (4)
23 fields total (including one blob, holding 25K images)
Edit:
ADD INDEX deleteQuery (empNum, lastName, travelerName, depTime, fltNum, depCity, Num)
Is this a matter of creating an index? If so, what type and what fields?
The last 3 conditions limit the number of rows, by asking if accomplished and within time period. Could they be better positioned (earlier) in the query? Is the 5th AND ... necessary?
Open to all ideas. Thanks for looking.

It's hard to know exactly how to help without seeing the table definition.
Don't delete the self-join (the same table mentioned twice) because this query is clearing out duplicates (check_in.Num > vtable.Num).
Do you have an index on depTime? If not, add one.
You may also want to add a compound index on
(empNum,depCity,travelerName,depTime,fltNum)
to optimize the self-join. You probably have to muck about a bit to figure out what works.

If your objective is to delete dupicates, the the solution is to avoid having duplicates in the first place - define a unique index across the fields that you deeem to collectively define a duplicate (but you won't be able to create the index while you have duplicates in the database).
The index you need for this query is on (deptime,empnum,depcity,travellername,fltnum,num,accomp} in that order. The deptime field has to come first for it to optimize the 2 accesses on the table. Once you've removed the duplicates, make the index unique.
Leaving that aside for now, you've got a whole load of performance problems.
1) you appear to be offering some sort of commercial service - so why are you waiting for your ISP to tell you that your site is running like a dog?
2) while your indexes should be designed to prevent duplicates, there are many cases where other indexes will help with performance - but in order to understand what those are you needto look at all the queries running against your data.
3) the blob should probably be in a separate table
Could they be better positioned (earlier) in the query?
Order of predicates at the same level in the query hierarchy has no impact on performance.
is the 5th AND necessary?
If you mean 'AND ( Check_in.Num > vtable.Num )', then yes - without that it will remove all the rows that are duplicated - i.e. it won't leave one row behid.

The purpose of indexes is to speed up searches and filters... the index is (in layman terms) a sorted table that pin-points each row of the data (which may be itself unsorted).
So, if you want to speed your delete query, it would help to know where the data is. So, as a set of thumb rules, you will need to add indexes to the following fields:
Every primary or foreign key
Every date on which you perform frequent searches / filters
Every numeric field on which you perform frequent searches / filters
I avoid indexes on text fields, since they are quite expensive (in terms of space), but if you need to perform frequent searches on text fields, you should also index them.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008