Selecting rows where column value changed from previous row, user variables, innodb - mysql

I have a problem similar to
SQL: selecting rows where column value changed from previous row
The accepted answer by ypercube which i adapted to
CREATE TABLE `schange` (
`PersonID` int(11) NOT NULL,
`StateID` int(11) NOT NULL,
`TStamp` datetime NOT NULL,
KEY `tstamp` (`TStamp`),
KEY `personstate` (`PersonID`, `StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `states` (
`StateID` int(11) NOT NULL AUTO_INCREMENT,
`State` varchar(100) NOT NULL,
`Available` tinyint(1) NOT NULL,
`Otherstatuseshere` tinyint(1) NOT NULL,
PRIMARY KEY (`StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT
COALESCE((#statusPre <> s.Available), 1) AS statusChanged,
c.PersonID,
c.TStamp,
s.*,
#statusPre := s.Available
FROM schange c
INNER JOIN states s USING (StateID),
(SELECT #statusPre:=NULL) AS d
WHERE PersonID = 1 AND TStamp > "2012-01-01" AND TStamp < "2013-01-01"
ORDER BY TStamp ;
The query itself worked just fine in testing, and with the right mix of temporary tables i was able to generate reports with daily sum availability from a huge pile of data in virtually no time at all.
The real problem comes in when i discovered that the tables where using the MyISAM engine, which we have completely abandoned, recreated the tables to use InnoDB, and noticed the query no longer works as expected.
After some bashing head into wall i have discovered that MyISAM seems to go over the columns each row in order (selecting statusChanged before updating #statusPre), while InnoDB seems to do all the variable assigning first, and only after that it populates result rows, regardless if the assigning happens in the select or where clauses, in functions (coalesce, greater etc), subqueries or otherwise.
Trying to accomplish this in a query without variables seems to always end the same way, a subquery requiring exponentially more time to process the more rows are in the set, resulting in a excrushiating minutes (or hours) long wait to get beginning and ending events for one status, while a finished report should include daily sums of multiple.
Can this type of query work on the InnoDB engine, and if so, how should one go about it?
or is the only feasible option to go for a database product that supports WITH statements?

Removing
KEY personstate (PersonID, StateID)
fixes the problem.
No idea why tho, but it was not really required anyway, the timestamp key is the more important one and speeds up the query nicely.

Related

Laravel - How to optimize MIN - MAX - orderBy queries?

My code in Laravel is:
Car::selectRaw('*,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update')
->leftJoin('car_prices', 'car_prices.car_id', 'cars.id')
->groupBy('car_prices.car_id')
->orderBy('latest_update', 'desc')
->paginate(10);
It takes long time to run until throwing error:
Maximum execution time of 60 seconds exceeded
The count of records in cars table is 100,000 and 6,000,000 in car_prices.
The tables structure:
CREATE TABLE `cars` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=110001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
CREATE TABLE `car_prices` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`car_id` bigint(20) unsigned NOT NULL,
`price` decimal(8,2) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `car_prices_car_id_foreign` (`car_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5506827 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The query:
select count(*) as aggregate
from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`;
select *,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`
order by `latest_update` desc
limit 10
offset 0;
How can I optimize it? Should I cache the data? Or there is some better query than this?
My hard disk is SSD
Value of innodb_flush_log_at_trx_commit = 1
The number of writes/inserts approximately 1000/second from 10AM - 02PM and before and after this period there are much less requests.
U need to either have better car table unique index latest_update or remove ->orderBy('latest_update', 'desc') in query. and sort it after receiving the results
U can check the performance in mysql with explain
EXPLAIN SELECT * FROM car order by latest_update desc;
/// Check this https://www.exoscale.com/syslog/explaining-mysql-queries/#:~:text=the%20last%20decade.-,Explain,DELETE%20%2C%20REPLACE%20%2C%20and%20UPDATE%20.
and https://dev.mysql.com/doc/refman/5.7/en/using-explain.html#:~:text=The%20EXPLAIN%20statement%20provides%20information,%2C%20REPLACE%20%2C%20and%20UPDATE%20statements.&text=That%20is%2C%20MySQL%20explains%20how,joined%20and%20in%20which%20order.
Basically u need to optimize (better index) your DB table "car" so that it perform well
And other thing u might to try increasing execution time
In php.ini u need to set max_execution_time = 600 or something more to just check how much time it needed to complete execution.
https://www.codewall.co.uk/increase-php-script-max-execution-time-limit-using-ini_set-function/
The query you have used is not apt for such large tables. instead whenever entry coming to the table car_prices set a operation and take minimum and maximum value and store it in the cars table. or you can setup a crone for this.
In both queries,
GROUP BY cars.id
This is instead of using car_prices.car_id, which might be missing because of the LEFT JOIN.
Once you have done that, the first query (with just the COUNT) can drop the JOIN. And then the GROUP BY becomes redundant:
select count(*) as aggregate
from `cars`
The second query has issues.
With the current design, you must go through all of both tables. Ugh.
Also... If there are no prices for a given car, it will have NULL for latest_update, therefore it will sort at the end of the 100,000 rows. Given that, you may as well not display those cars; this would simplify the query enough to be better optimized.
If you need to list the cars for which you have no prices, make that a separate request in the UI. That query will be a LEFT JOIN .. IS NULL and won't need the MAX()s.
But, I am still concerned about the 10,000 pages that the user needs to paginate through.
Switch from MyISAM to InnoDB.
Toss created_at and updated_at, if you aren't using them for anything.
After that, cars is simply a mapping between id and name. This might allow you to avoid going through cars. Instead do something like
SELECT ( SELECT name FROM cars WHERE id = x.car_id ) AS name,
...
FROM ...
Another thought that whenever you add a row to car_prices, you update updated_at in cars. This would allow you to find the 10 cars entirely in cars.
Decide what you are willing to sacrifice.
More
Note: With MyISAM, a slow SELECT blocks UPDATE. With InnoDB, the can run in parallel; the SELECT uses the values before the UPDATE. Either way, the select is at some "point in time". But InnoDB allows more parallelism.
It is a tradeoff. A small slowdown in updates to achieve a big speedup on selects. (No, I don't know for sure that my suggestion is "faster")
Some further questions to analyze the tradeoff:
Disk: HDD or SSD?
Value of innodb_flush_log_at_trx_commit (after you change to InnoDB).
How much traffic? As a first cut, is the number of writes--insert/delete--more than 100/second?

What is the optimal index for this DB table?

I have a table for storing stats. Currently this is populated with about 10 million rows at the end of the day then copied to daily stats table and deleted. For this reason I can't have an auto-incrementing primary key.
This is the table structure:
CREATE TABLE `stats` (
`shop_id` int(11) NOT NULL,
`title` varchar(255) CHARACTER SET latin1 NOT NULL,
`created` datetime NOT NULL,
`mobile` tinyint(1) NOT NULL DEFAULT '0',
`click` tinyint(1) NOT NULL DEFAULT '0',
`conversion` tinyint(1) NOT NULL DEFAULT '0',
`ip` varchar(20) CHARACTER SET latin1 NOT NULL,
KEY `shop_id` (`shop_id`,`created`,`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have a key on shop_id, created, ip but I'm not sure what columns I should use to create the optimal index to increase lookup speeds any further?
The query below takes about 12 seconds with no key and about 1.5 seconds using the index above:
SELECT DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane')) AS `date`, COUNT(*) AS `views`
FROM `stats`
WHERE `created` <= '2017-07-18 09:59:59'
AND `shop_id` = '17515021'
AND `click` != 1
AND `conversion` != 1
GROUP BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'))
ORDER BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'));
If there is no column (or combination of columns) that is guaranteed unique, then do have an AUTO_INCREMENT id. Don't worry about truncating/deleting. (However, if the id does not reset, you probably need to use BIGINT, not INT UNSIGNED to avoid overflow.)
Don't use id as the primary key, instead, PRIMARY KEY(shop_id, created, id), INDEX(id).
That unconventional PK will help with performance in 2 ways, while being unique (due to the addition of id). The INDEX(id) is to keep AUTO_INCREMENT happy. (Whether you DELETE hourly or daily is a separate issue.)
Build a Summary table based on each hour (or minute). It will contain the count for such -- 400K/hour or 7K/minute. Augment it each hour (or minute) so that you don't have to do all the work at the end of the day.
The summary table can also filter on click and/or conversion. Or it could keep both, if you need them.
If click/conversion have only two states (0 & 1), don't say != 1, say = 0; the optimizer is much better at = than at !=.
If they 2-state and you changed to =, then this becomes viable and much better: INDEX(shop_id, click, conversion, created) -- created must be last.
Don't bother with TZ when summarizing into the Summary table; apply the conversion later.
Better yet, don't use DATETIME, use TIMESTAMP so that you won't need to convert (assuming you have TZ set correctly).
After all that, if you still have issues, start over on the Question; there may be further tweaks.
In your where clause, Use the column first which will return the small set of results and so on and create the index in the same order.
You have
WHERE created <= '2017-07-18 09:59:59'
AND shop_id = '17515021'
AND click != 1
AND conversion != 1
If created will return the small number of set as compare to other 3 columns then you are good otherwise you that column at first position in your where clause then select the second column as per the same explanation and create the index as per you where clause.
If you think order is fine then create an index
KEY created_shopid_click_conversion (created,shop_id, click, conversion);.

Flat MySQL table with enum-based filters is unexpectedly slow

I have a site where there is an activity feed, similar to how social sites like Facebook have one. It is a "newest first" list that describes actions taken by users. In production, there's about 200k entries in that table.
Since this is going to be asked anyway, I'll first share the full table structure:
CREATE TABLE `karmalog` (
`id` int(11) NOT NULL auto_increment,
`guid` char(36) default NULL,
`user_id` int(11) default NULL,
`user_name` varchar(45) default NULL,
`user_avat_url` varchar(255) default NULL,
`user_sec_id` int(11) default NULL,
`user_sec_name` varchar(45) default NULL,
`user_sec_avat_url` varchar(255) default NULL,
`event` enum('EDIT_PROFILE','EDIT_AVATAR','EDIT_EMAIL','EDIT_PASSWORD','FAV_IMG_ADD','FAV_IMG_ADDED','FAV_IMG_REMOVE','FAV_IMG_REMOVED','FOLLOW','FOLLOWED','UNFOLLOW','UNFOLLOWED','COM_POSTED','COM_POST','COM_VOTE','COM_VOTED','IMG_VOTED','IMG_UPLOAD','LIST_CREATE','LIST_DELETE','LIST_ADMINDELETE','LIST_VOTE','LIST_VOTED','IMG_UPD','IMG_RESTORE','IMG_UPD_LIC','IMG_UPD_MOD','IMG_GEO','IMG_UPD_MODERATED','IMG_VOTE','IMG_VOTED','TAG_FAV_ADD','CLASS_DOWN','CLASS_UP','IMG_DELETE','IMG_ADMINDELETE','IMG_ADMINDELETEFAV','SET_PASSWORD','IMG_RESTORED','IMG_VIEW','FORUM_CREATE','FORUM_DELETE','FORUM_ADMINDELETE','FORUM_REPLY','FORUM_DELETEREPLY','FORUM_ADMINDELETEREPLY','FORUM_SUBSCRIBE','FORUM_UNSUBSCRIBE','TAG_INFO_EDITED','IMG_ADDSPECIE','IMG_REMOVESPECIE','SPECIE_ADDVIDEO','SPECIE_REMOVEVIDEO','EARN_MEDAL','JOIN') NOT NULL,
`event_type` enum('follow','tag','image','class','list','forum','specie','medal','user') NOT NULL,
`active` bit(1) NOT NULL,
`delete` bit(1) NOT NULL default '\0',
`object_id` int(11) default NULL,
`object_cache` text,
`object_sec_id` int(11) default NULL,
`object_sec_cache` text,
`karma_delta` int(11) NOT NULL,
`gold_delta` int(11) NOT NULL,
`newkarma` int(11) NOT NULL,
`newgold` int(11) NOT NULL,
`migrated` int(11) NOT NULL default '0',
`date_created` timestamp NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `user_sec_id` (`user_sec_id`),
KEY `image_id` (`object_id`),
KEY `date_event` (`date_created`,`event`),
KEY `event` (`event`),
KEY `date_created` (`date_created`),
CONSTRAINT `karmalog_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE SET NULL,
CONSTRAINT `karmalog_ibfk_2` FOREIGN KEY (`user_sec_id`) REFERENCES `user` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Before optimizing this table, my query had 5 joins and I ran into slow query times. I have denormalized all of that data, so that not a single join is there anymore. So the table and query is flat.
As you can see in the table design, there's an "event" field which is an enum, holding a few dozen possible values. Throughout the site, I show activity feeds based on specific event types. Typically that query looks like this:
SELECT * FROM karmalog as k
WHERE k.event IN ($events) AND k.delete=0
ORDER BY k.date_created DESC, k.id DESC
LIMIT 0,30
What this query does is to find the latest 30 entries in the total set that match any of the events passed in $events, which can be multiple.
Due to removing the joins and having indices on most fields, I was expecting this to perform very well, but it doesn't. On 200k entries, it still takes over 3 seconds and I don't understand why.
Regarding solutions, I know I could archive older entries or partition the table per event type, but that will have quite a code impact, and I first would like to understand why the above is so slow.
As a temporary work-around, I'm now doing this:
SELECT * FROM
(SELECT * FROM karmalog ORDER BY date_created DESC, id DESC LIMIT 0,1000) as karma
WHERE karma.event IN ($events) AND karma.delete=0
LIMIT $page,$pagesize
What this does is to limit the baseset to search in to the latest 1000 entries only, hoping and guessing that there's 30 entries to be found for the filters that I pass in. It's not very robust though. It will not work for more rare events, and it brings pagination issues.
Therefore, I first like to get to the root cause of why my initial query is slow, against my expectation.
Edit: I was asked to share the execution plan. Here's the test query:
EXPLAIN SELECT * FROM karmalog
WHERE event IN ('FAV_IMG_ADD','FOLLOW','COM_POST','IMG_VOTE','LIST_VOTE','JOIN','CLASS_UP','LIST_CREATE','FORUM_REPLY','FORUM_CREATE','FORUM_SUBSCRIBE','IMG_GEO','IMG_ADDSPECIE','SPECIE_ADDVIDEO','EARN_MEDAL') AND karmalog.delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,36
Execution plan:
id = 1
select_type = SIMPLE
table = karmalog
type = range
possible_keys = event
key = event
key_len = 1
red = NULL
rows = 80519
Extra = Using where; Using filesort
I'm not sure how to read into the above, but I do know that the sort clause really seems to kill this query. With this sorting, it takes 4.3 secs, without 0.03 secs.
SELECT * sometimes slows down ordered queries by a huge amount, so let's start by refactoring your query as follows:
SELECT k.*
FROM karmalog AS k
JOIN (
SELECT id
FROM karmalog
WHERE event IN ($events)
AND delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,30
) AS m ON k.id = m.id
ORDER BY k.date_created DESC, k.id DESC
This will do your ORDER BY ... LIMIT operation without having to haul the whole table around in the sorting phase. Finally it will look up the appropriate thirty rows from the original table and sort just those again. This might save a whole lot of I/O and in-memory data shuffling.
Second, if id column values are assigned in ascending order as records are inserted, then the use of date_created in your ORDER BY operation is redundant. But MySQL doesn't know that, so leaving it out might help. This will be true if you always use the current date when inserting, and never update the dates.
Third, you might be able to use a compound covering index for the selection (inner) query. This is an index that contains all the fields you need. When you use a covering index, the whole query can be satisfied from the index, and there's no need to bounce back to the original table. This saves disk access time.
Try this compound covering index: (delete, event, id). If you decide you can't get rid of the use of date_created in your ordering, try this instead: (delete, event, date_created, id)
Add a compound index over the two relevant questions. In your table, you can do that by specifying e.g.
KEY `date_created` (`date_created`, `event`)
This key can still be used to satisfy plain old date_created range searching. But in addition to that, the event data is included as well, so the DBS will be able to detect the relevant rows by only looking at the index.
If you want, you can try the other order as well: first event and then date. This might allow some optimization if there are many event types but your filter only contains few. On the other hand, I'm not sure the system will be able to make use of the LIMIT clause in this case, so I'm not certain that this other order will be any help at all.
Edit: I completely missed that your date_event index already has this info. According to your execution plan, though, that one isn't used. Looks like the optimizer is getting things wrong. You could try removing the event index, and perhaps the date index as well, and see what happens then.

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1
The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)
First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge
If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB

Optimizing MySQL table structure. Advice needed

I have these table structures and while it works, using EXPLAIN on certain SQL queries gives 'Using temporary; Using filesort' on one of the table. This might hamper performance once the table is populated with thousands of data. Below are the table structure and explanations of the system.
CREATE TABLE IF NOT EXISTS `jobapp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fullname` varchar(50) NOT NULL,
`icno` varchar(14) NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '1',
`timestamp` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `icno` (`icno`)
) ENGINE=MyISAM;
CREATE TABLE IF NOT EXISTS `jobapplied` (
`appid` int(11) NOT NULL,
`jid` int(11) NOT NULL,
`jobstatus` tinyint(1) NOT NULL,
`timestamp` int(10) NOT NULL,
KEY `jid` (`jid`),
KEY `appid` (`appid`)
) ENGINE=MyISAM;
Query I tried which gives aforementioned statement:
EXPLAIN SELECT japp.id, japp.fullname, japp.icno, japp.status, japped.jid, japped.jobstatus
FROM jobapp AS japp
INNER JOIN jobapplied AS japped ON japp.id = japped.appid
WHERE japped.jid = '85'
AND japped.jobstatus = '2'
AND japp.status = '2'
ORDER BY japp.`timestamp` DESC
This system is for recruiting new staff. Once registration is opened, hundreds of applicant will register in a single time. They are allowed to select 5 different jobs. Later on at the end of registration session, the admin will go through each job one by one. I have used a single table (jobapplied) to store 2 items (applicant id, job id) to record who applied what. And this is the table which causes aforementioned statement. I realize this table is without PRIMARY key but I just can't figure out any other way later on for the admin to search specifically which job who have applied.
Any advice on how can I optimize the table?
Apart from the missing indexes and primary keys others have mentioned . . .
This might hamper performance once the
table is populated with thousands of
data.
You seem to be assuming that the query optimizer will use the same execution plan on a table with thousands of rows as it will on a table with just a few rows. Optimizers don't work like that.
The only reliable way to tell how a particular vendor's optimizer will execute a query on a table with thousands of rows--which is still a small table, and will probably easily fit in memory--is to
load a scratch version of the
database with thousands of rows
"explain" the query you're interested
in
FWIW, the last test I ran like this involved close to a billion rows--about 50 million in each of about 20 tables. The execution plan for that query--which included about 20 left outer joins--was a lot different than it was for the sample data (just a few thousand rows).
You are ordering by jobapp.timestamp, but there is no index for timestamp so the tablesort (and probably the temporary) will be necessary try adding and index for timestamp to jobapp something like KEY timid (timestamp,id)