MySQL poor performance with a large table - mysql

I have a monitoring table which holds monitoring data for some 200+ servers.
Each server adds 3 records of data to the table every minute of every day.
I hold 6 months of data for historical reports for customers, and as you can imagine the table gets pretty large.
My issue currently is that running SELECT queries on this table is taking an age.
I understand why; It's the sheer amount of rows its working through whilst performing the SELECT, but I have tried to reduce the result set significantly by adding in time lookups...
SELECT * FROM `host_monitoring_data`
WHERE parent_id = 47 AND timestamp > (NOW() - INTERVAL 5 MINUTE);
... but still I'm looking at a long time before the data is returned to me.
I'm used to working with fairly small tables and this is by far the biggest that I've ever worked with, so I'm not familiar with how to overcome this sort of issue.
Any help at all is vastly appriciated.
My table structure is currently id, parent_id, timestamp, type, U, A, T
U,A,T is Used/Available/Total, Type tells me what kind of measurable we are working with, Timestamp is exactly that, parent_id is the id of the parent host to which the data belongs, and id is an auto-incrementing id for the record in question.
When I'm doing lookups, I'm basically trying to get the most recent 20 rows where parent_id = x or whatever, so I just do...
SELECT u,a,t from host_monitoring_data
WHERE parent_id=X AND timestamp > (NOW() - INTERVAL 5 MINUTE)
ORDER BY timestamp DESC LIMIT 20
EDIT 1 - Including the results of EXPLAIN:
EXPLAIN SELECT * FROM `host_monitoring_data`
WHERE parent_id=36 AND timestamp > (NOW() - INTERVAL 5 MINUTE)
ORDER BY timestamp DESC LIMIT 20
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE host_monitoring_data ALL NULL NULL NULL NUL 2865454
Using where; Using filesort

Based on your EXPLAIN report, I see it says "type: ALL" which means it's scanning all the rows (the whole table) for every query.
You need an index to help it scan fewer rows.
Your first condition for parent_id=X is an obvious choice. You should create an index starting with parent_id.
The other condition on timestamp >= ... is probably the best second choice. Your index should include timestamp as the second column.
You can create this index this way:
ALTER TABLE host_monitoring_data ADD INDEX (parent_id, timestamp);
You might like my presentation How to Design Indexes, Really and a video of me presenting it: https://www.youtube.com/watch?v=ELR7-RdU9XU
P.S.: Please when you ask questions about query optimization, run SHOW CREATE TABLE <tablename> and include its output in your question. This shows us your columns, data types, current indexes, and constraints. Don't make us guess! Help us help you!

Three good tips:
EXPLAIN (as others said), will tell you what are you doing and hints to do it better.
Avoid using "*", instead, select fields you need.
Use procedure analyse to know what are the most recommended type of variables you need (and change them if needed).
https://dev.mysql.com/doc/refman/5.7/en/procedure-analyse.html
I also avoid using "order by" whenever you can.

Related

Optimizing Datetime searches in huge MySQL InnoDB table

I am trying to optimize a big MySQL InnoDB Table with 50 million rows in it. It is a kind of a log. Each row contains some columns with information and a Datetime column.
These 50 million rows contain only 5-6 dates, so there are only a few distinct dates but with different hours, minutes and seconds. Each row has a unique ID (primary key). The DateTime column has an index.
The searches are performed with the only date (w/o using hours, minutes, and sec), f.e.
select * from table where date(datetime_column) = '2021-03-08'
I've already tried to rewrite the queries without date() function, like:
select * from table where datetime_column >= '2021-03-08' and datetime_column <='2021-03-08 23:59:59'
But it's only a bit faster.
Also, I've created a new table, put the ID (primary key from the main table), year, month, day, hour, minutes, and seconds to tyniints (the year is int(4)), made a combined index on them and performed the select from the main table with join to this new table, but it's still not fast enough, because index for hours, minutes and seconds become useless while these columns are not mentioned in the "where" clause.
Also, I've thought about partitioning, but I think it won't help too.
Any ideas on how to solve it?
Change from
PRIMARY KEY(id),
INDEX(datetime)
to
PRIMARY KEY(datetime, id), -- to greatly speed up your range query
INDEX(id) -- sufficient to keep AUTO_INCREMENT happy
Do not use the DATE(datetime) = constant; it cannot use any index. Your other attempt can use an index in some situations. I like this way to phrase it:
WHERE datetime >= '2021-03-08'
AND datetime < '2021-03-08' + INTERVAL 1 DAY
Oh, you say there is more to the WHERE? Let's see them; it may make a big difference! Also, let us know whether the datetime range does most of the filtering or the other clause(s) do more.
Many queries look something like
WHERE datetime in some range AND abc=123
That benefits from INDEX(abc, datetime), in that order. Pulling the PK trick above may also be beneficial: PRIMARY KEY(abc, datetime, id), INDEX(id).

Optimize SQL to fetch 1 day data

I need to fetch last 24 hrs data frequently and this query runs frequently.
Since this scans many rows, using it frequently, affects the database performance.
MySql execution strategy picks index on created_at and that returns 1,00,000 rows approx. and these rows are scanned one by one to filter customer_id = 10 and my final result has 20000 rows.
How can I optimize this query?
explain SELECT *
FROM `order`
WHERE customer_id = 10
and `created_at` >= NOW() - INTERVAL 1 DAY;
id : 1
select_type : SIMPLE
table : order
partitions : NULL
type : range
possible_keys : idx_customer_id, idx_order_created_at
key : idx_order_created_at
key_len : 5
ref : NULL
rows : 103357
filtered : 1.22
Extra : Using index condition; Using where
The first optimization I would do is on the access to the table:
create index ix1 on `order` (customer_id, created_at);
Then, if the query is still slow I would try appending the columns you are selecting to the index. If, for example, you are selecting the columns order_id, amount, and status:
create index ix1 on `order` (customer_id, created_at,
order_id, amount, status);
This second strategy could be beneficial, but you'll need to test it to find out what performance improvement it peoduces in your particular case.
The big improvement of this second strategy is that it walks the secondary index only, by avoiding to walk back to the primary clustered index of the table (that can be time consumming).
Instead of two single indexes on ID and Created, create a single composite index on ( customer_id, created_at ). This way the index engine can use BOTH parts of the where clause instead of just hoping to get the one. Jump right to the customer ID, then jump directly to the date desired, then gives results. it SHOULD be very fast.
Additional Follow-up.
I hear your comment about having multiple indexes, but add those into the main one, just after such as
( customer_id, created_at, updated_at, completion_time )
Then, in your queries could always include some help on the index in the where clause. For example, and I don't know your specific data. A record is created at some given point. The updated and completion time will always be AFTER that. How long does it take (worst-case scenario) from a creation to completion time... 2 days, 10 days, 90 days?
where
customerID = ?
AND created_at >= date - 10 days
AND updated_at >= date -1
Again, just an example, but if a person has 1000's of orders and relatively quick turn-around time, you could jump to those most recent and then find those updated within the time period.. Again, just an option as a single index vs 3, 4 or more indexes.
Seems you are dealing a very quick growing table, I should consider moving this frequent query to a cold table or replica.
One more point is that did you consider partition by customer_id. I am not quite understand the business logic behind to query customer_id = 10. If it's multi tenancy application, try partition.
For this query:
SELECT o.*
FROM `order` o
WHERE o.customer_id = 10 AND
created_at >= NOW() - INTERVAL 1 DAY;
My first inclination would be a composite index on (customer_id, created_at) -- as others have suggested.
But, you appear to have a lot of data and many inserts per day. That suggests partitioning plus an index. The appropriate partition would be on created_at, probably on a daily basis, along with an index for user_id.
A typical query would access the two most recent partitions. Because your queries are focused on recent data, this also reduces the memory occupied by the index, which might be an overall benefit.
This technique should be better than all the other answers, though perhaps by only a small amount:
Instead of orders being indexed thus:
PRIMARY KEY(order_id) -- AUTO_INCREMENT
INDEX(customer_id, ...) -- created_at, and possibly others
do this to "cluster" the rows together:
PRIMARY KEY(customer_id, order_id)
INDEX (order_id) -- to keep AUTO_INCREMENT happy
Then you can optionally have more indexes starting with customer_id as needed. Or not.
Another issue -- What will you do with 20K rows? That is a lot to feed to a client, especially of the human type. If you then munch on it, can't you make a more complex query that does more work, and returns fewer rows? That will probably be faster.

is my large mysql table destined for failure?

I have built a mysql table on my local computer to store stock market data. The table name is minute_data, and the structure is simple enough:
You can see that I made the key column a combination of date and symbol -> concat(date,symbol). This way I do an insert ignore ... query to add data to the table without duplicating a date/symbol combination.
With this table, data retrieval is very simple. Say I wanted to get all the data for the symbol CSCO, then I could simply do this query:
select * from minute_data where symbol = "CSCO" order by date;
Everything has been "working". The table now has data from over 1000 symbols, with over 22 million rows already. I am thinking that is is not even half full for all the 1000 symbols yet, so I am expecting to keep growing the size of the table.
I am starting to see serious performance problems when querying this table. For example the following query (which I often want to do, to see the latest date for a particular symbol) takes well over 1 minute to complete, and only returns 1 row!
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query (which is also very import) is also taking over 1 minute on average:
select count(*), symbol from minute_data group by symbol;
The performance problems are making it unrealistic to keep working with the data in this way. These are the questions that I would like to ask the community:
Is it futile to continue building my data set into this table?
Is MySQL a bad choice altogether for a data set like this?
What can I do to this table to improve performance?
What kind of data structure should I use for this purpose (instead of a MySQL table)?
Thank You!
UPDATE
I am providing the output from the explain, the same for the following 2 queries:
explain select count(*), symbol from minute_data group by symbol;
explain select * from minute_data where symbol = "CSCO" order by date desc limit 1;
UPDATE 2
pretty simple fix. I performed this query to remove the useless key_col that I had defined above, and made a primary key on 2 columns: date and symbol:
alter table minute_data drop primary key, add primary key (date,symbol);
Now I tried the following query, and it finished in less than 1 second:
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query still takes a long time to complete (72 seconds). I guess that's still because the query has to tabulate all 22 million rows in one query?:
select count(*), symbol from minute_data group by symbol;
Your key_col is completely useless. You know that you can have a primary key over multiple columns? I'd recommend, that you drop that column and create a new primary key on (date, symbol) in this order since your date column has the higher cardinality. Additionally you can then (if there's the need for it) create another unique index on (symbol, date). Post EXPLAINs of your most important queries. And what's the cardinality of symbol?
UPDATE:
What you can see in the explain is, that there's no index which can be used and it scans the whole 22.5 million rows. Please have a try with the above mentioned. If you don't want to drop the key_col right now, you should at least add an index on symbol column.

Is it a good idea to index datetime field in mysql?

I am working on designing a large database. In my application I will have many rows for example I currently have one table with 4 million records. Most of my queries use datetime clause to select data. Is it a good idea to index datetime fields in mysql database?
Select field1, field2,.....,field15
from table where field 20 between now() and now + 30 days
I am trying to keep my database working good and queries being run smoothly
More, what idea do you think I should have to create a high efficiency database?
MySQL recommends using indexes for a variety of reasons including elimination of rows between conditions: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
This makes your datetime column an excellent candidate for an index if you are going to be using it in conditions frequently in queries. If your only condition is BETWEEN NOW() AND DATE_ADD(NOW(), INTERVAL 30 DAY) and you have no other index in the condition, MySQL will have to do a full table scan on every query. I'm not sure how many rows are generated in 30 days, but as long as it's less than about 1/3 of the total rows it will be more efficient to use an index on the column.
Your question about creating an efficient database is very broad. I'd say to just make sure that it's normalized and all appropriate columns are indexed (i.e. ones used in joins and where clauses).
Here author performed tests showed that integer unix timestamp is better than DateTime. Note, he used MySql. But I feel no matter what DB engine you use comparing integers are slightly faster than comparing dates so int index is better than DateTime index. Take T1 - time of comparing 2 dates, T2 - time of comparing 2 integers. Search on indexed field takes approximately O(log(rows)) time because index based on some balanced tree - it may be different for different DB engines but anyway Log(rows) is common estimation. (if you not use bitmask or r-tree based index). So difference is (T2-T1)*Log(rows) - may play role if you perform your query oftenly.
I know this question was asked years ago, but I just found my solution.
I added an index to a datetime column.
I went from 1.6 seconds of calling the last 600 records sorted by datetime, and after adding the index, it came down to 0.0028 seconds. I'd say it's a win.
ALTER TABLE `database`.`table`
ADD INDEX `name_of_index` (`datetime_field_from_table`);

Slow MySQL query

Hey I have a very slow MySQL query. I'm sure all I need to do is add the correct index but all the things I try don't work.
The query is:
SELECT DATE(DateTime) as 'SpeedDate', avg(LoadTime) as 'LoadTime'
FROM SpeedMonitor
GROUP BY Date(DateTime);
The Explain for the query is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE SpeedMonitor ALL 7259978 Using temporary; Using filesort
And the table structure is:
CREATE TABLE `SpeedMonitor` (
`SMID` int(10) unsigned NOT NULL auto_increment,
`DateTime` datetime NOT NULL,
`LoadTime` double unsigned NOT NULL,
PRIMARY KEY (`SMID`)
) ENGINE=InnoDB AUTO_INCREMENT=7258294 DEFAULT CHARSET=latin1;
Any help would be greatly appreciated.
You're just asking for two columns in your query, so indexes could/should go there:
DateTime
LoadTime
Another way to speed your query up could be split DateTime field in two: date and time.
This way db can group directly on date field instead of calculating DATE(...).
EDITED:
If you prefer using a trigger, create a new column(DATE) and call it newdate, and try with this (I can't try it now to see if it's correct):
CREATE TRIGGER upd_check BEFORE INSERT ON SpeedMonitor
FOR EACH ROW
BEGIN
SET NEW.newdate=DATE(NEW.DateTime);
END
EDITED AGAIN:
I've just created a db with the same table speedmonitor filled with about 900,000 records.
Then I run the query SELECT newdate,AVG(LoadTime) loadtime FROM speedmonitor GROUP BY newdate and it took about 100s!!
Removing index on newdate field (and clearing cache using RESET QUERY CACHE and FLUSH TABLES), the same query took 0.6s!!!
Just for comparison: query SELECT DATE(DateTime),AVG(LoadTime) loadtime FROM speedmonitor GROUP BY DATE(DateTime) took 0.9s.
So I suppose that the index on newdate is not good: remove it.
I'm going to add as many records as I can now and test two queries again.
FINAL EDIT:
Removing indexes on newdate and DateTime columns, having 8mln records on speedmonitor table, here are results:
selecting and grouping on newdate column: 7.5s
selecting and grouping on DATE(DateTime) field: 13.7s
I think it's a good speedup.
Time is taken executing query inside mysql command prompt.
The problem is that you're using a function in your GROUP BY clause, so MySQL has to evaluate the expression Date(DateTime) on every record before it can group the results. I'd suggest adding a calculated field for Date(DateTime), which you could then index and see if that helps your performance.
I hope you'll permit me to point out that before you put a table into production with millions of records you should seriously consider how that data is going to be used and plan accordingly.
What is happening right now is that your query cannot use any indexes and hence scans the entire table building a response. Not the fastest way to work with relatively large tables.
You have some things to consider if you want to get to a better state:
How fast is it collecting data?
How much history do you need?
How granular are your reporting requirements?
Are you able to suspend logging to make table changes?
If the answer is "No" to the last question you could always create a new table/solution and start writing records there... importing in old data if/as needed.
Reporting granularity is important as you could, for example, compress a day's worth of data into 24 records. Load the current day into an index free loading table and then process it the next day into per hour averages. Name each loading table based on the sample date and you can delete old tables as processed.
Of course, hourly may not be fine grained enough.
Depending on your retention needs you might want to consider some type of partitioned storage. This can let you query against subsets of sample data and simply drop or archive old partitions when they are no long current enough to be relevant.
Anyhow, you seem to be on the edge of having some type of massive sampling, reporting and/or monitoring system (particularly if you were reporting on a variety of sites or pages with different characteristics). You may want to put some effort into designing this so it will fit your needs... ;)