I have a table using InnoDB that stores all messages sent by my system. Currently the table have 40 million rows and grows 3/4 million per month.
My query is basically to select messages sent from an user and within a data range. Here is a simplistic create table:
CREATE TABLE `log` (
`id` int(10) NOT NULL DEFAULT '0',
`type` varchar(10) NOT NULL DEFAULT '',
`timeLogged` int(11) NOT NULL DEFAULT '0',
`orig` varchar(128) NOT NULL DEFAULT '',
`rcpt` varchar(128) NOT NULL DEFAULT '',
`user` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `timeLogged` (`timeLogged`),
KEY `user` (`user`),
KEY `user_timeLogged` (`user`,`timeLogged`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Note: I have individual indexes too because of other queries.
Query looks like this:
SELECT COUNT(*) FROM log WHERE timeLogged BETWEEN 1282878000 AND 1382878000 AND user = 20
The issue is that this query takes from 2 minutes to 10 minutes, depending of user and server load which is too much time to wait for a page to load. I have mysql cache enabled and cache in application, but the problem is that when user search for new ranges, it won't hit cache.
My question are:
Would changing user_timeLogged index make any difference?
Is this a problem with MySQL and big databases? I mean, does Oracle or other DBs also suffer from this problem?
AFAIK, my indexes are correctly created and this query shouldn't take so long.
Thanks for anyone who help!
you're using innodb but not taking full advantage of your innodb clustered index (primary key) as it looks like your typical query is of the form:
select <fields> from <table> where user_id = x and <datefield> between y and z
not
select <fields> from <table> where id = x
the following article should help you optimise your table design for your query.
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
If you understand the article correctly you should find youself with something like the following:
drop table if exists user_log;
create table user_log
(
user_id int unsigned not null,
created_date datetime not null,
log_type_id tinyint unsigned not null default 0, -- 1 byte vs varchar(10)
...
...
primary key (user_id, created_date, log_type_id)
)
engine=innodb;
Here's some query performance stats from the above design:
Counts
select count(*) as counter from user_log
counter
=======
37770394
select count(*) as counter from user_log where
created_date between '2010-09-01 00:00:00' and '2010-11-30 00:00:00'
counter
=======
35547897
User and date based queries (all queries run with cold buffers)
select count(*) as counter from user_log where user_id = 4755
counter
=======
7624
runtime = 0.215 secs
select count(*) as counter from user_log where
user_id = 4755 and created_date between '2010-09-01 00:00:00' and '2010-11-30 00:00:00'
counter
=======
7404
runtime = 0.015 secs
select
user_id,
created_date,
count(*) as counter
from
user_log
where
user_id = 4755 and created_date between '2010-09-01 00:00:00' and '2010-11-30 00:00:00'
group by
user_id, created_date
order by
counter desc
limit 10;
runtime = 0.031 secs
Hope this helps :)
COUNT(*) is not loading from the table cache because you have a WHERE clause, using EXPLAIN as #jason mentioned, try changing it to COUNT(id) and see if that helps.
I could be wrong, but I also think that your indexes have to be in the same order as your WHERE clause. Since your WHERE clause uses timeLogged before user then your index should be KEYuser_timeLogged(timeLogged,user)`
Again, EXPLAIN will tell you whether this index change makes a difference.
Related
I have a log table, but I find it become very slow when I sort it.
Here's my database table structure in short.
CREATE TABLE `webhook_logs` (
`ID` bigint(20) UNSIGNED NOT NULL,
`event_id` bigint(20) UNSIGNED DEFAULT NULL,
`object_id` bigint(20) UNSIGNED DEFAULT NULL,
`occurred_at` bigint(20) UNSIGNED DEFAULT NULL,
`payload` text COLLATE utf8mb4_unicode_520_ci,
`priority` bigint(1) UNSIGNED DEFAULT NULL,
`status` varchar(32) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;
ALTER TABLE `webhook_logs`
ADD PRIMARY KEY (`ID`),
ADD KEY `event_id` (`event_id`),
ADD KEY `object_id` (`object_id`),
ADD KEY `occurred_at` (`occurred_at`),
ADD KEY `priority` (`priority`),
ADD KEY `status` (`status`);
There are 5M + records.
When I do
SELECT * FROM `webhook_logs` WHERE status = 'pending' AND occurred_at < 1652838913000 ORDER BY priority ASC LIMIT 100
, it took about 5 seconds to get the records.
However, when i remove the sorting, and just do
SELECT * FROM `webhook_logs` WHERE status = 'pending' AND occurred_at < 1652838913000 LIMIT 100
, it took only 0.0022 seconds.
I've been playing around with the index and see if the time improved, but with no luck. I wonder if I'm doing something wrong here.
I tried creating combo index with "occurred_at" and "priority", or combo index with all "occurred_at", "priority" and "status". None of them improved the speed, still take around 5 seconds. If any help, there server is running MYSQL 5.7.12.
Any help will be appropriated. Thanks.
Pure index can't solve your problem. In your query, the DB must first find out all records where "occurred_at < 1652838913000" and then sort them to get the records with highest priority. No index can help to reduce the sort.
But there are solutions to your problem, because priority always has only serveral values. You can create an index(status, priority, occurred_at), and then write a query like this:
select * from (
(SELECT * FROM `webhook_logs` WHERE status = 'pending' and priority=1 AND occurred_at < 1652838913000 LIMIT 100)
union
(SELECT * FROM `webhook_logs` WHERE status = 'pending' and priority=2 AND occurred_at < 1652838913000 LIMIT 100)
) a ORDER BY priority asc LIMIT 100
In this query, DB will use the index to do each sub query of the union, and then sort only very few rows. The result can be returned in less than 0.1 seconds
You don't need BIGINT for most of those columns. That datatype takes 8 bytes. There are much smaller datatypes. priority could be TINYINT UNSIGNED (1 byte, range of 0..255). status could be changed to a 1-byte ENUM. Such changes will shrink the data and index sizes, hence speed up most operations somewhat.
Replace INDEX(status) with
INDEX(status, occurred_at, priority, id) -- in this order
Then your query will run somewhat faster, depending on the distribution of the data.
This might run even faster:
SELECT w.*
FROM (
SELECT id
FROM `webhook_logs`
WHERE status = 'pending'
AND occurred_at < 1652838913000
ORDER BY priority ASC
LIMIT 100
) AS t
JOIN webhook_logs USING(id)
ORDER BY priority ASC -- yes, this is repeated
;
That is because it can pick the 100 ids from the my index much faster since it is "covering", then do 100 lookups to get "*".
My code in Laravel is:
Car::selectRaw('*,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update')
->leftJoin('car_prices', 'car_prices.car_id', 'cars.id')
->groupBy('car_prices.car_id')
->orderBy('latest_update', 'desc')
->paginate(10);
It takes long time to run until throwing error:
Maximum execution time of 60 seconds exceeded
The count of records in cars table is 100,000 and 6,000,000 in car_prices.
The tables structure:
CREATE TABLE `cars` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=110001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
CREATE TABLE `car_prices` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`car_id` bigint(20) unsigned NOT NULL,
`price` decimal(8,2) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `car_prices_car_id_foreign` (`car_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5506827 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The query:
select count(*) as aggregate
from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`;
select *,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`
order by `latest_update` desc
limit 10
offset 0;
How can I optimize it? Should I cache the data? Or there is some better query than this?
My hard disk is SSD
Value of innodb_flush_log_at_trx_commit = 1
The number of writes/inserts approximately 1000/second from 10AM - 02PM and before and after this period there are much less requests.
U need to either have better car table unique index latest_update or remove ->orderBy('latest_update', 'desc') in query. and sort it after receiving the results
U can check the performance in mysql with explain
EXPLAIN SELECT * FROM car order by latest_update desc;
/// Check this https://www.exoscale.com/syslog/explaining-mysql-queries/#:~:text=the%20last%20decade.-,Explain,DELETE%20%2C%20REPLACE%20%2C%20and%20UPDATE%20.
and https://dev.mysql.com/doc/refman/5.7/en/using-explain.html#:~:text=The%20EXPLAIN%20statement%20provides%20information,%2C%20REPLACE%20%2C%20and%20UPDATE%20statements.&text=That%20is%2C%20MySQL%20explains%20how,joined%20and%20in%20which%20order.
Basically u need to optimize (better index) your DB table "car" so that it perform well
And other thing u might to try increasing execution time
In php.ini u need to set max_execution_time = 600 or something more to just check how much time it needed to complete execution.
https://www.codewall.co.uk/increase-php-script-max-execution-time-limit-using-ini_set-function/
The query you have used is not apt for such large tables. instead whenever entry coming to the table car_prices set a operation and take minimum and maximum value and store it in the cars table. or you can setup a crone for this.
In both queries,
GROUP BY cars.id
This is instead of using car_prices.car_id, which might be missing because of the LEFT JOIN.
Once you have done that, the first query (with just the COUNT) can drop the JOIN. And then the GROUP BY becomes redundant:
select count(*) as aggregate
from `cars`
The second query has issues.
With the current design, you must go through all of both tables. Ugh.
Also... If there are no prices for a given car, it will have NULL for latest_update, therefore it will sort at the end of the 100,000 rows. Given that, you may as well not display those cars; this would simplify the query enough to be better optimized.
If you need to list the cars for which you have no prices, make that a separate request in the UI. That query will be a LEFT JOIN .. IS NULL and won't need the MAX()s.
But, I am still concerned about the 10,000 pages that the user needs to paginate through.
Switch from MyISAM to InnoDB.
Toss created_at and updated_at, if you aren't using them for anything.
After that, cars is simply a mapping between id and name. This might allow you to avoid going through cars. Instead do something like
SELECT ( SELECT name FROM cars WHERE id = x.car_id ) AS name,
...
FROM ...
Another thought that whenever you add a row to car_prices, you update updated_at in cars. This would allow you to find the 10 cars entirely in cars.
Decide what you are willing to sacrifice.
More
Note: With MyISAM, a slow SELECT blocks UPDATE. With InnoDB, the can run in parallel; the SELECT uses the values before the UPDATE. Either way, the select is at some "point in time". But InnoDB allows more parallelism.
It is a tradeoff. A small slowdown in updates to achieve a big speedup on selects. (No, I don't know for sure that my suggestion is "faster")
Some further questions to analyze the tradeoff:
Disk: HDD or SSD?
Value of innodb_flush_log_at_trx_commit (after you change to InnoDB).
How much traffic? As a first cut, is the number of writes--insert/delete--more than 100/second?
I have to create a cron job, which is simple in itself, but because it will run every minute I'm worried about performance. I have two tables, one has user names and the other has details about their network. Most of the time a user will belong to just one network, but it is theoretically possible that they might belong to more, but even then very few, maybe two or three. So, in order to reduce the number of JOINs, I saved the network ids separated by | in a field in the user table, e.g.
|1|3|9|
The (simplified for this question) user table structure is
TABLE `users` (
`u_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`userid` VARCHAR(500) NOT NULL UNIQUE,
`net_ids` VARCHAR(500) NOT NULL DEFAULT '',
PRIMARY KEY (`u_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The (also simplified) network table structure is
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
PRIMARY KEY (`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I have to send a warning when timeout occurs, my query is
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN users AS U ON U.net_ids LIKE CONCAT('%|', N.n_id, '|%');
I made N a subquery to reduce the number of rows joined. But I would like to know if it would be faster to add a third table with u_id and n_id as columns, removed the net_ids column from users and then do a join on all three tables? Because I read that using LIKE slows things down.
Which is the most effcient query to use in this case? One JOIN and a LIKE or two JOINS?
P.S. I did some experimentation and the initial values for using two JOINS are higher than using a JOIN and a LIKE. However, repeated runs of the same query seems to speed things up a lot, I suspect something is cached somewhere, either in my app or the database, and both become comparable, so I did not find this data satisfactory. It also contradicts what I was expecting based on what I have been reading.
I used this table:
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
INDEX `u_id` (`u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
INDEX `n_id` (`n_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and this query:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id;
You should define composite indexes for the user_net table. One of them can (and should) be the primary key.
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`u_id`, `n_id`),
INDEX `uid_nid` (`n_id`, `u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I would also rewrite your query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, N.timeout_mins, N.login_time), NOW()) < 60
While your subquery will probably not hurt much, there is no need for it.
Note that the last condition cannot use an index, because you have to combine two columns. If your MySQL version is at least 5.7.6 you can define an indexed virtual (calculated) column.
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
`is_open` TINYINT UNSIGNED,
`notify` TINYINT UNSIGNED,
`timeout_dt` DATETIME AS (`login_time` + INTERVAL `timeout_mins` MINUTE),
PRIMARY KEY (`n_id`),
INDEX (`timeout_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now change the query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND N.timeout_dt < NOW() + INTERVAL 60 SECOND
and it will be able to use the index.
You can also try to replace
INDEX (`timeout_dt`)
with
INDEX (`is_open`, `notify`, `timeout_dt`)
and see if it is of any help.
Reformulate to avoid hiding columns inside functions. I can't grok your date expression, but note this:
login_time < NOW() - INTERVAL timeout_mins MINUTE
If you can achieve something like that, then this index should help:
INDEX(is_open, notify, login_time)
If that is not good enough, let's see the other formulation so we can compare them.
Having stuff separated by comma (or |) is likely to be a really bad idea.
Bottom line: Assume that JOINs are not a performance problem, write the queries with as many JOINs as needed. Then let's optimize that.
A table with a few Million rows, something like this:
my_table (
`CONTVISITID` bigint(20) NOT NULL AUTO_INCREMENT,
`NODE_ID` bigint(20) DEFAULT NULL,
`CONT_ID` bigint(20) DEFAULT NULL,
`NODE_NAME` varchar(50) DEFAULT NULL,
`CONT_NAME` varchar(100) DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`HITS` bigint(20) DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CLIENT_TYPE` varchar(20) DEFAULT NULL,
`TYPE` bigint(1) DEFAULT NULL,
`PLAY_TIMES` bigint(20) DEFAULT NULL,
`FIRST_PUBLISH_TIME` bigint(20) DEFAULT NULL,
PRIMARY KEY (`CONTVISITID`),
KEY `cont_visit_contid` (`CONT_ID`),
KEY `cont_visit_createtime` (`CREATE_TIME`),
KEY `cont_visit_publishtime` (`FIRST_PUBLISH_TIME`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=57676834 DEFAULT CHARSET=utf8
I had a query that I have managed to optimize to the following departing from a flat select:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits,type,first_publish_time
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
Can this be further optimized?
Edit:
I started with a FLAT select like I mentioned before, what I mean by flat select not to have a composite select like my current one. Instead of the single select that someone responded with. A single select is twice slower, so not viable in my case.
Edit2: I have a DBA friend who suggested me to change the query to this:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
As I do not need the fields extra (type,first_publish_time) and the TMP table is smaller, this makes the query faster about about 1/4 total time of the fastest version I have. He also suggested to add a composite index between (create_time, cont_id, hits). He says with this index I will get really good performance, but I have not done that as this is a production DB and the alter might affect replication. I will post results once done.
INDEX(type, first_publish_time)
INDEX(type, create_time)
Then do
SELECT cont_id, SUM(hits) AS tot_hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time > 1398310263000
AND type = 1
group by cont_id
order by tot_hits DESC
LIMIT 10;
Start the index with any = filters (type, in this case); then you get one chance to us a range.
The reason for 2 indexes -- The Optimizer will look at statistics and decide which look better based on the values given.
Consider shrinking the BIGINTs (8 bytes) to some smaller INT type. Saving space will help speed, especially if the table is too big to be cached.
For further discussion, please provide EXPLAIN SELECT ...;.
I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.