I am working on designing a large database. In my application I will have many rows for example I currently have one table with 4 million records. Most of my queries use datetime clause to select data. Is it a good idea to index datetime fields in mysql database?
Select field1, field2,.....,field15
from table where field 20 between now() and now + 30 days
I am trying to keep my database working good and queries being run smoothly
More, what idea do you think I should have to create a high efficiency database?
MySQL recommends using indexes for a variety of reasons including elimination of rows between conditions: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
This makes your datetime column an excellent candidate for an index if you are going to be using it in conditions frequently in queries. If your only condition is BETWEEN NOW() AND DATE_ADD(NOW(), INTERVAL 30 DAY) and you have no other index in the condition, MySQL will have to do a full table scan on every query. I'm not sure how many rows are generated in 30 days, but as long as it's less than about 1/3 of the total rows it will be more efficient to use an index on the column.
Your question about creating an efficient database is very broad. I'd say to just make sure that it's normalized and all appropriate columns are indexed (i.e. ones used in joins and where clauses).
Here author performed tests showed that integer unix timestamp is better than DateTime. Note, he used MySql. But I feel no matter what DB engine you use comparing integers are slightly faster than comparing dates so int index is better than DateTime index. Take T1 - time of comparing 2 dates, T2 - time of comparing 2 integers. Search on indexed field takes approximately O(log(rows)) time because index based on some balanced tree - it may be different for different DB engines but anyway Log(rows) is common estimation. (if you not use bitmask or r-tree based index). So difference is (T2-T1)*Log(rows) - may play role if you perform your query oftenly.
I know this question was asked years ago, but I just found my solution.
I added an index to a datetime column.
I went from 1.6 seconds of calling the last 600 records sorted by datetime, and after adding the index, it came down to 0.0028 seconds. I'd say it's a win.
ALTER TABLE `database`.`table`
ADD INDEX `name_of_index` (`datetime_field_from_table`);
Related
So my colleague created this query which will run every hour on a table with 500K+ records.
Delete from table where timestamp> now() - interval 24 hour
I am having a feeling that this would be slower as it is computing time at each row, am I right? How can I optimize it?
Update
With 2.8 Million records it took around 12 seconds to delete the matched rows.
I am having a feeling that this would be slower as it is computing time at each row, am I right?
No, the time calculation is done once at the start of the query. It is a constant value for the duration of the query.
https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_now says:
NOW() returns a constant time that indicates the time at which the statement began to execute.
https://dev.mysql.com/doc/refman/8.0/en/where-optimization.html says:
Constant expressions used by indexes are evaluated only once.
You also asked:
How can I optimize it?
The easiest thing to do is make sure there is an index on the timestamp column.
A different solution is to use partitioning by the timestamp column, and drop 1 partition per day. This blog has a description of this solution: http://mysql.rjweb.org/doc.php/partitionmaint
run the query more frequently (say, hourly)
have an index on that column
PARTITION BY RANGE and use DROP PARTITION; suggest by hour; see Partition
More tips: http://mysql.rjweb.org/doc.php/deletebig
I have a monitoring table which holds monitoring data for some 200+ servers.
Each server adds 3 records of data to the table every minute of every day.
I hold 6 months of data for historical reports for customers, and as you can imagine the table gets pretty large.
My issue currently is that running SELECT queries on this table is taking an age.
I understand why; It's the sheer amount of rows its working through whilst performing the SELECT, but I have tried to reduce the result set significantly by adding in time lookups...
SELECT * FROM `host_monitoring_data`
WHERE parent_id = 47 AND timestamp > (NOW() - INTERVAL 5 MINUTE);
... but still I'm looking at a long time before the data is returned to me.
I'm used to working with fairly small tables and this is by far the biggest that I've ever worked with, so I'm not familiar with how to overcome this sort of issue.
Any help at all is vastly appriciated.
My table structure is currently id, parent_id, timestamp, type, U, A, T
U,A,T is Used/Available/Total, Type tells me what kind of measurable we are working with, Timestamp is exactly that, parent_id is the id of the parent host to which the data belongs, and id is an auto-incrementing id for the record in question.
When I'm doing lookups, I'm basically trying to get the most recent 20 rows where parent_id = x or whatever, so I just do...
SELECT u,a,t from host_monitoring_data
WHERE parent_id=X AND timestamp > (NOW() - INTERVAL 5 MINUTE)
ORDER BY timestamp DESC LIMIT 20
EDIT 1 - Including the results of EXPLAIN:
EXPLAIN SELECT * FROM `host_monitoring_data`
WHERE parent_id=36 AND timestamp > (NOW() - INTERVAL 5 MINUTE)
ORDER BY timestamp DESC LIMIT 20
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE host_monitoring_data ALL NULL NULL NULL NUL 2865454
Using where; Using filesort
Based on your EXPLAIN report, I see it says "type: ALL" which means it's scanning all the rows (the whole table) for every query.
You need an index to help it scan fewer rows.
Your first condition for parent_id=X is an obvious choice. You should create an index starting with parent_id.
The other condition on timestamp >= ... is probably the best second choice. Your index should include timestamp as the second column.
You can create this index this way:
ALTER TABLE host_monitoring_data ADD INDEX (parent_id, timestamp);
You might like my presentation How to Design Indexes, Really and a video of me presenting it: https://www.youtube.com/watch?v=ELR7-RdU9XU
P.S.: Please when you ask questions about query optimization, run SHOW CREATE TABLE <tablename> and include its output in your question. This shows us your columns, data types, current indexes, and constraints. Don't make us guess! Help us help you!
Three good tips:
EXPLAIN (as others said), will tell you what are you doing and hints to do it better.
Avoid using "*", instead, select fields you need.
Use procedure analyse to know what are the most recommended type of variables you need (and change them if needed).
https://dev.mysql.com/doc/refman/5.7/en/procedure-analyse.html
I also avoid using "order by" whenever you can.
I use following query frequently:
SELECT * FROM table WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime] and publish = 1 and type = 2 order by Timestamp
I would like to optimize this query, and I am thinking about put timestamp as part of primary key for clustered index, I think if timestamp is part of primary key , data inserted in table has write to disk sequentially by timestamp field.Also I think this improve my query a lot, but am not sure if this would help.
table has 3-4 million+ rows.
timestamp field never changed.
I use mysql 5.6.11
Anothet point is : if this is improve my query , it is better to use timestamp(4 byte in mysql 5.6) or datetime(5 byte in mysql 5.6)?
Four million rows isn't huge.
A one-byte difference between the data types datetime and timestamp is the last thing you should consider in choosing between those two data types. Review their specs.
Making a timestamp part of your primary key is a bad, bad idea. Think about reviewing what primary key means in a SQL database.
Put an index on your timestamp column. Get an execution plan, and paste that into your question. Determine your median query performance, and paste that into your question, too.
Returning a single day's rows from an indexed, 4 million row table on my desktop computer takes 2ms. (It returns around 8000 rows.)
1) If values of timestamp are unique you can make it primary key. If not, anyway create index on timestamp column as you frequently use it in "where".
2) using BETWEEN clause looks more natural here. I suggest you use TREE index (default index type) not HASH.
3) when timestamp column is indexed, you don't need call order by - it already sorted.
(of course, if your index is TREE not HASH).
4) integer unix_timestamp is better than datetime both from memory usage side and performance side - comparing dates is more complex operation than comparing integer numbers.
Searching data on indexed field takes O(log(rows)) tree lookups. Comparison of integers is O(1) and comparison of dates is O(date_string_length). So, difference is (number of tree lookups) * (difference_comparison) = O(date_string_length)/O(1))* O(log(rows)) = O(date_string_length)* O(log(rows))
I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.
Hey I have a very slow MySQL query. I'm sure all I need to do is add the correct index but all the things I try don't work.
The query is:
SELECT DATE(DateTime) as 'SpeedDate', avg(LoadTime) as 'LoadTime'
FROM SpeedMonitor
GROUP BY Date(DateTime);
The Explain for the query is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE SpeedMonitor ALL 7259978 Using temporary; Using filesort
And the table structure is:
CREATE TABLE `SpeedMonitor` (
`SMID` int(10) unsigned NOT NULL auto_increment,
`DateTime` datetime NOT NULL,
`LoadTime` double unsigned NOT NULL,
PRIMARY KEY (`SMID`)
) ENGINE=InnoDB AUTO_INCREMENT=7258294 DEFAULT CHARSET=latin1;
Any help would be greatly appreciated.
You're just asking for two columns in your query, so indexes could/should go there:
DateTime
LoadTime
Another way to speed your query up could be split DateTime field in two: date and time.
This way db can group directly on date field instead of calculating DATE(...).
EDITED:
If you prefer using a trigger, create a new column(DATE) and call it newdate, and try with this (I can't try it now to see if it's correct):
CREATE TRIGGER upd_check BEFORE INSERT ON SpeedMonitor
FOR EACH ROW
BEGIN
SET NEW.newdate=DATE(NEW.DateTime);
END
EDITED AGAIN:
I've just created a db with the same table speedmonitor filled with about 900,000 records.
Then I run the query SELECT newdate,AVG(LoadTime) loadtime FROM speedmonitor GROUP BY newdate and it took about 100s!!
Removing index on newdate field (and clearing cache using RESET QUERY CACHE and FLUSH TABLES), the same query took 0.6s!!!
Just for comparison: query SELECT DATE(DateTime),AVG(LoadTime) loadtime FROM speedmonitor GROUP BY DATE(DateTime) took 0.9s.
So I suppose that the index on newdate is not good: remove it.
I'm going to add as many records as I can now and test two queries again.
FINAL EDIT:
Removing indexes on newdate and DateTime columns, having 8mln records on speedmonitor table, here are results:
selecting and grouping on newdate column: 7.5s
selecting and grouping on DATE(DateTime) field: 13.7s
I think it's a good speedup.
Time is taken executing query inside mysql command prompt.
The problem is that you're using a function in your GROUP BY clause, so MySQL has to evaluate the expression Date(DateTime) on every record before it can group the results. I'd suggest adding a calculated field for Date(DateTime), which you could then index and see if that helps your performance.
I hope you'll permit me to point out that before you put a table into production with millions of records you should seriously consider how that data is going to be used and plan accordingly.
What is happening right now is that your query cannot use any indexes and hence scans the entire table building a response. Not the fastest way to work with relatively large tables.
You have some things to consider if you want to get to a better state:
How fast is it collecting data?
How much history do you need?
How granular are your reporting requirements?
Are you able to suspend logging to make table changes?
If the answer is "No" to the last question you could always create a new table/solution and start writing records there... importing in old data if/as needed.
Reporting granularity is important as you could, for example, compress a day's worth of data into 24 records. Load the current day into an index free loading table and then process it the next day into per hour averages. Name each loading table based on the sample date and you can delete old tables as processed.
Of course, hourly may not be fine grained enough.
Depending on your retention needs you might want to consider some type of partitioned storage. This can let you query against subsets of sample data and simply drop or archive old partitions when they are no long current enough to be relevant.
Anyhow, you seem to be on the edge of having some type of massive sampling, reporting and/or monitoring system (particularly if you were reporting on a variety of sites or pages with different characteristics). You may want to put some effort into designing this so it will fit your needs... ;)