My code in Laravel is:
Car::selectRaw('*,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update')
->leftJoin('car_prices', 'car_prices.car_id', 'cars.id')
->groupBy('car_prices.car_id')
->orderBy('latest_update', 'desc')
->paginate(10);
It takes long time to run until throwing error:
Maximum execution time of 60 seconds exceeded
The count of records in cars table is 100,000 and 6,000,000 in car_prices.
The tables structure:
CREATE TABLE `cars` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=110001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
CREATE TABLE `car_prices` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`car_id` bigint(20) unsigned NOT NULL,
`price` decimal(8,2) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `car_prices_car_id_foreign` (`car_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5506827 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The query:
select count(*) as aggregate
from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`;
select *,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`
order by `latest_update` desc
limit 10
offset 0;
How can I optimize it? Should I cache the data? Or there is some better query than this?
My hard disk is SSD
Value of innodb_flush_log_at_trx_commit = 1
The number of writes/inserts approximately 1000/second from 10AM - 02PM and before and after this period there are much less requests.
U need to either have better car table unique index latest_update or remove ->orderBy('latest_update', 'desc') in query. and sort it after receiving the results
U can check the performance in mysql with explain
EXPLAIN SELECT * FROM car order by latest_update desc;
/// Check this https://www.exoscale.com/syslog/explaining-mysql-queries/#:~:text=the%20last%20decade.-,Explain,DELETE%20%2C%20REPLACE%20%2C%20and%20UPDATE%20.
and https://dev.mysql.com/doc/refman/5.7/en/using-explain.html#:~:text=The%20EXPLAIN%20statement%20provides%20information,%2C%20REPLACE%20%2C%20and%20UPDATE%20statements.&text=That%20is%2C%20MySQL%20explains%20how,joined%20and%20in%20which%20order.
Basically u need to optimize (better index) your DB table "car" so that it perform well
And other thing u might to try increasing execution time
In php.ini u need to set max_execution_time = 600 or something more to just check how much time it needed to complete execution.
https://www.codewall.co.uk/increase-php-script-max-execution-time-limit-using-ini_set-function/
The query you have used is not apt for such large tables. instead whenever entry coming to the table car_prices set a operation and take minimum and maximum value and store it in the cars table. or you can setup a crone for this.
In both queries,
GROUP BY cars.id
This is instead of using car_prices.car_id, which might be missing because of the LEFT JOIN.
Once you have done that, the first query (with just the COUNT) can drop the JOIN. And then the GROUP BY becomes redundant:
select count(*) as aggregate
from `cars`
The second query has issues.
With the current design, you must go through all of both tables. Ugh.
Also... If there are no prices for a given car, it will have NULL for latest_update, therefore it will sort at the end of the 100,000 rows. Given that, you may as well not display those cars; this would simplify the query enough to be better optimized.
If you need to list the cars for which you have no prices, make that a separate request in the UI. That query will be a LEFT JOIN .. IS NULL and won't need the MAX()s.
But, I am still concerned about the 10,000 pages that the user needs to paginate through.
Switch from MyISAM to InnoDB.
Toss created_at and updated_at, if you aren't using them for anything.
After that, cars is simply a mapping between id and name. This might allow you to avoid going through cars. Instead do something like
SELECT ( SELECT name FROM cars WHERE id = x.car_id ) AS name,
...
FROM ...
Another thought that whenever you add a row to car_prices, you update updated_at in cars. This would allow you to find the 10 cars entirely in cars.
Decide what you are willing to sacrifice.
More
Note: With MyISAM, a slow SELECT blocks UPDATE. With InnoDB, the can run in parallel; the SELECT uses the values before the UPDATE. Either way, the select is at some "point in time". But InnoDB allows more parallelism.
It is a tradeoff. A small slowdown in updates to achieve a big speedup on selects. (No, I don't know for sure that my suggestion is "faster")
Some further questions to analyze the tradeoff:
Disk: HDD or SSD?
Value of innodb_flush_log_at_trx_commit (after you change to InnoDB).
How much traffic? As a first cut, is the number of writes--insert/delete--more than 100/second?
Related
I have a large table with user votes. I tried nearly every tutorial and essay about INDEX usage and after failing that ... every possible combination of fields as keys, but the query stays slow.
Is there any Index I can use to speed this up?
(I will spare you my hidious attempts at indizes so far...)
CREATE TABLE IF NOT EXISTS `votes` (
`uid` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`objectId` bigint(15) NOT NULL,
`vote` tinyint(1) NOT NULL,
`created` datetime NOT NULL,
UNIQUE KEY `unique input` (`uid`,`objectId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The table has around 1.3Million rows and will continue growing. This is the query i am trying:
EXPLAIN SELECT objectId, COUNT(uid) AS voteCount, AVG(vote) AS rating
FROM votes
GROUP BY objectId
Any pointers?
The only approach I could suggest is the following, although I don't know if it would increase performance. It assumes that you have an Objects table, with one row per ObjectId:
SELECT ObjectId,
(select count(*) from votes v where v.objectid = o.objectid) as votecount,
(select avg(vote) from votes v where v.objectid = o.objectid) as rating
FROM objects o;
Then you want the following index: votes(objectid, vote).
This would replace the outer group by with index scans, which may speed up the query.
I don't see how an index will help to speed this up, because the average function will require that you interact with each and every row. There's no WHERE clause.
Maybe you could create a VIEW that would amortize the cost for you.
I think duffymo is correct. However you could try swapping the two columns in your unique key around (or add an index for objectid only) as it may help the GROUP BY.
I have a site where there is an activity feed, similar to how social sites like Facebook have one. It is a "newest first" list that describes actions taken by users. In production, there's about 200k entries in that table.
Since this is going to be asked anyway, I'll first share the full table structure:
CREATE TABLE `karmalog` (
`id` int(11) NOT NULL auto_increment,
`guid` char(36) default NULL,
`user_id` int(11) default NULL,
`user_name` varchar(45) default NULL,
`user_avat_url` varchar(255) default NULL,
`user_sec_id` int(11) default NULL,
`user_sec_name` varchar(45) default NULL,
`user_sec_avat_url` varchar(255) default NULL,
`event` enum('EDIT_PROFILE','EDIT_AVATAR','EDIT_EMAIL','EDIT_PASSWORD','FAV_IMG_ADD','FAV_IMG_ADDED','FAV_IMG_REMOVE','FAV_IMG_REMOVED','FOLLOW','FOLLOWED','UNFOLLOW','UNFOLLOWED','COM_POSTED','COM_POST','COM_VOTE','COM_VOTED','IMG_VOTED','IMG_UPLOAD','LIST_CREATE','LIST_DELETE','LIST_ADMINDELETE','LIST_VOTE','LIST_VOTED','IMG_UPD','IMG_RESTORE','IMG_UPD_LIC','IMG_UPD_MOD','IMG_GEO','IMG_UPD_MODERATED','IMG_VOTE','IMG_VOTED','TAG_FAV_ADD','CLASS_DOWN','CLASS_UP','IMG_DELETE','IMG_ADMINDELETE','IMG_ADMINDELETEFAV','SET_PASSWORD','IMG_RESTORED','IMG_VIEW','FORUM_CREATE','FORUM_DELETE','FORUM_ADMINDELETE','FORUM_REPLY','FORUM_DELETEREPLY','FORUM_ADMINDELETEREPLY','FORUM_SUBSCRIBE','FORUM_UNSUBSCRIBE','TAG_INFO_EDITED','IMG_ADDSPECIE','IMG_REMOVESPECIE','SPECIE_ADDVIDEO','SPECIE_REMOVEVIDEO','EARN_MEDAL','JOIN') NOT NULL,
`event_type` enum('follow','tag','image','class','list','forum','specie','medal','user') NOT NULL,
`active` bit(1) NOT NULL,
`delete` bit(1) NOT NULL default '\0',
`object_id` int(11) default NULL,
`object_cache` text,
`object_sec_id` int(11) default NULL,
`object_sec_cache` text,
`karma_delta` int(11) NOT NULL,
`gold_delta` int(11) NOT NULL,
`newkarma` int(11) NOT NULL,
`newgold` int(11) NOT NULL,
`migrated` int(11) NOT NULL default '0',
`date_created` timestamp NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `user_sec_id` (`user_sec_id`),
KEY `image_id` (`object_id`),
KEY `date_event` (`date_created`,`event`),
KEY `event` (`event`),
KEY `date_created` (`date_created`),
CONSTRAINT `karmalog_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE SET NULL,
CONSTRAINT `karmalog_ibfk_2` FOREIGN KEY (`user_sec_id`) REFERENCES `user` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Before optimizing this table, my query had 5 joins and I ran into slow query times. I have denormalized all of that data, so that not a single join is there anymore. So the table and query is flat.
As you can see in the table design, there's an "event" field which is an enum, holding a few dozen possible values. Throughout the site, I show activity feeds based on specific event types. Typically that query looks like this:
SELECT * FROM karmalog as k
WHERE k.event IN ($events) AND k.delete=0
ORDER BY k.date_created DESC, k.id DESC
LIMIT 0,30
What this query does is to find the latest 30 entries in the total set that match any of the events passed in $events, which can be multiple.
Due to removing the joins and having indices on most fields, I was expecting this to perform very well, but it doesn't. On 200k entries, it still takes over 3 seconds and I don't understand why.
Regarding solutions, I know I could archive older entries or partition the table per event type, but that will have quite a code impact, and I first would like to understand why the above is so slow.
As a temporary work-around, I'm now doing this:
SELECT * FROM
(SELECT * FROM karmalog ORDER BY date_created DESC, id DESC LIMIT 0,1000) as karma
WHERE karma.event IN ($events) AND karma.delete=0
LIMIT $page,$pagesize
What this does is to limit the baseset to search in to the latest 1000 entries only, hoping and guessing that there's 30 entries to be found for the filters that I pass in. It's not very robust though. It will not work for more rare events, and it brings pagination issues.
Therefore, I first like to get to the root cause of why my initial query is slow, against my expectation.
Edit: I was asked to share the execution plan. Here's the test query:
EXPLAIN SELECT * FROM karmalog
WHERE event IN ('FAV_IMG_ADD','FOLLOW','COM_POST','IMG_VOTE','LIST_VOTE','JOIN','CLASS_UP','LIST_CREATE','FORUM_REPLY','FORUM_CREATE','FORUM_SUBSCRIBE','IMG_GEO','IMG_ADDSPECIE','SPECIE_ADDVIDEO','EARN_MEDAL') AND karmalog.delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,36
Execution plan:
id = 1
select_type = SIMPLE
table = karmalog
type = range
possible_keys = event
key = event
key_len = 1
red = NULL
rows = 80519
Extra = Using where; Using filesort
I'm not sure how to read into the above, but I do know that the sort clause really seems to kill this query. With this sorting, it takes 4.3 secs, without 0.03 secs.
SELECT * sometimes slows down ordered queries by a huge amount, so let's start by refactoring your query as follows:
SELECT k.*
FROM karmalog AS k
JOIN (
SELECT id
FROM karmalog
WHERE event IN ($events)
AND delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,30
) AS m ON k.id = m.id
ORDER BY k.date_created DESC, k.id DESC
This will do your ORDER BY ... LIMIT operation without having to haul the whole table around in the sorting phase. Finally it will look up the appropriate thirty rows from the original table and sort just those again. This might save a whole lot of I/O and in-memory data shuffling.
Second, if id column values are assigned in ascending order as records are inserted, then the use of date_created in your ORDER BY operation is redundant. But MySQL doesn't know that, so leaving it out might help. This will be true if you always use the current date when inserting, and never update the dates.
Third, you might be able to use a compound covering index for the selection (inner) query. This is an index that contains all the fields you need. When you use a covering index, the whole query can be satisfied from the index, and there's no need to bounce back to the original table. This saves disk access time.
Try this compound covering index: (delete, event, id). If you decide you can't get rid of the use of date_created in your ordering, try this instead: (delete, event, date_created, id)
Add a compound index over the two relevant questions. In your table, you can do that by specifying e.g.
KEY `date_created` (`date_created`, `event`)
This key can still be used to satisfy plain old date_created range searching. But in addition to that, the event data is included as well, so the DBS will be able to detect the relevant rows by only looking at the index.
If you want, you can try the other order as well: first event and then date. This might allow some optimization if there are many event types but your filter only contains few. On the other hand, I'm not sure the system will be able to make use of the LIMIT clause in this case, so I'm not certain that this other order will be any help at all.
Edit: I completely missed that your date_event index already has this info. According to your execution plan, though, that one isn't used. Looks like the optimizer is getting things wrong. You could try removing the event index, and perhaps the date index as well, and see what happens then.
I have a problem similar to
SQL: selecting rows where column value changed from previous row
The accepted answer by ypercube which i adapted to
CREATE TABLE `schange` (
`PersonID` int(11) NOT NULL,
`StateID` int(11) NOT NULL,
`TStamp` datetime NOT NULL,
KEY `tstamp` (`TStamp`),
KEY `personstate` (`PersonID`, `StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `states` (
`StateID` int(11) NOT NULL AUTO_INCREMENT,
`State` varchar(100) NOT NULL,
`Available` tinyint(1) NOT NULL,
`Otherstatuseshere` tinyint(1) NOT NULL,
PRIMARY KEY (`StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT
COALESCE((#statusPre <> s.Available), 1) AS statusChanged,
c.PersonID,
c.TStamp,
s.*,
#statusPre := s.Available
FROM schange c
INNER JOIN states s USING (StateID),
(SELECT #statusPre:=NULL) AS d
WHERE PersonID = 1 AND TStamp > "2012-01-01" AND TStamp < "2013-01-01"
ORDER BY TStamp ;
The query itself worked just fine in testing, and with the right mix of temporary tables i was able to generate reports with daily sum availability from a huge pile of data in virtually no time at all.
The real problem comes in when i discovered that the tables where using the MyISAM engine, which we have completely abandoned, recreated the tables to use InnoDB, and noticed the query no longer works as expected.
After some bashing head into wall i have discovered that MyISAM seems to go over the columns each row in order (selecting statusChanged before updating #statusPre), while InnoDB seems to do all the variable assigning first, and only after that it populates result rows, regardless if the assigning happens in the select or where clauses, in functions (coalesce, greater etc), subqueries or otherwise.
Trying to accomplish this in a query without variables seems to always end the same way, a subquery requiring exponentially more time to process the more rows are in the set, resulting in a excrushiating minutes (or hours) long wait to get beginning and ending events for one status, while a finished report should include daily sums of multiple.
Can this type of query work on the InnoDB engine, and if so, how should one go about it?
or is the only feasible option to go for a database product that supports WITH statements?
Removing
KEY personstate (PersonID, StateID)
fixes the problem.
No idea why tho, but it was not really required anyway, the timestamp key is the more important one and speeds up the query nicely.
I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.
I'm running the following query on a Macbook Pro 2.53ghz with 4GB of Ram:
SELECT
c.id AS id,
c.name AS name,
c.parent_id AS parent_id,
s.domain AS domain_name,
s.domain_id AS domain_id,
NULL AS stats
FROM
stats s
LEFT JOIN stats_id_category sic ON s.id = sic.stats_id
LEFT JOIN categories c ON c.id = sic.category_id
GROUP BY
c.name
It takes about 17 seconds to complete.
EXPLAIN:
alt text http://img7.imageshack.us/img7/1364/picture1va.png
The tables:
Information:
Number of rows: 147397
Data size: 20.3MB
Index size: 1.4MB
Table:
CREATE TABLE `stats` (
`id` int(11) unsigned NOT NULL auto_increment,
`time` int(11) NOT NULL,
`domain` varchar(40) NOT NULL,
`ip` varchar(20) NOT NULL,
`user_agent` varchar(255) NOT NULL,
`domain_id` int(11) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`referrer` varchar(400) default NULL,
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147398 DEFAULT CHARSET=utf8
Information second table:
Number of rows: 1285093
Data size: 11MB
Index size: 17.5MB
Second table:
CREATE TABLE `stats_id_category` (
`stats_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
KEY `stats_id` (`stats_id`,`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Information third table:
Number of rows: 161
Data size: 3.9KB
Index size: 8KB
Third table:
CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`parent_id` int(11) default NULL,
`name` varchar(40) NOT NULL,
`questions_category_id` int(11) NOT NULL default '0',
`rank` int(2) NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=205 DEFAULT CHARSET=latin1
Hopefully someone can help me speed this up.
I see several WTF's in your query:
You use two LEFT OUTER JOINs but then you group by the c.name column which might have no matches. So perhaps you don't really need an outer join? If that's the case, you should use an inner join, because outer joins are often slower.
You are grouping by c.name but this gives ambiguous results for every other column in your select-list. I.e. there might be multiple values in these columns in each grouping by c.name. You're lucky you're using MySQL, because this query would simply give an error in any other RDBMS.
This is a performance issue because the GROUP BY is likely causing the "using temporary; using filesort" you see in the EXPLAIN. This is a notorious performance-killer, and it's probably the single biggest reason this query is taking 17 seconds. Since it's not clear why you're using GROUP BY at all (using no aggregate functions, and violating the Single-Value Rule), it seems like you need to rethink this.
You are grouping by c.name which doesn't have a UNIQUE constraint on it. You could in theory have multiple categories with the same name, and these would be lumped together in a group. I wonder why you don't group by c.id if you want one group per category.
SELECT NULL AS stats: I don't understand why you need this. It's kind of like creating a variable that you never use. It shouldn't harm performance, but it's just another WTF that makes me think you haven't thought this query through very well.
You say in a comment you're looking for number of visitors per category. But your query doesn't have any aggregate functions like SUM() or COUNT(). And your select-list includes s.domain and s.domain_id which would be different for every visitor, right? So what value do you expect to be in the result set if you only have one row per category? This isn't really a performance issue either, it just means your query results don't tell you anything useful.
Your stats_id_category table has an index over its two columns, but no primary key. So you can easily get duplicate rows, and this means your count of visitors may be inaccurate. You need to drop that redundant index and use a primary key instead. I'd order category_id first in that primary key, so the join can take advantage of the index.
ALTER TABLE stats_id_category DROP KEY stats_id,
ADD PRIMARY KEY (category_id, stats_id);
Now you can eliminate one of your joins, if all you need to count is the number of visitors:
SELECT c.id, c.name, c.parent_id, COUNT(*) AS num_visitors
FROM categories c
INNER JOIN stats_id_category sic ON (sic.category_id = c.id)
GROUP BY c.id;
Now the query doesn't need to read the stats table at all, or even the stats_id_category table. It can get its count simply by reading the index of the stats_id_category table, which should eliminate a lot of work.
You are missing the third table in the information provided (categories).
Also, it seems odd that you are doing a LEFT JOIN and then using the right table (which might be all NULLS) in the GROUP BY. You will end up grouping all of the non-matching rows together as a result, is that what you intended?
Finally, can you provide an EXPLAIN for the SELECT?
Harrison is right; we need the other table. I would start by adding an index on category_id to stats_id_category, though.
I agree with Bill. Point 2 is very important. The query doesn't even make logical sense. Also, with the simple fact that there is no where statement means that you have to pull back every row in the stats table, which seems to be around 140000. It then has to sort all that data, so that it can perform the GROUP BY. This is because sort [ O(n log n)] and then find duplicates [ O(n) ] is much faster than just finding duplicates without sorting the data set [ O(n^2)?? ].