What is better: select duplicates check OR making a unique index? - mysql

I have a table like this:
`id` int(11) NOT NULL AUTO_INCREMENT,
`torderid` int(11) DEFAULT NULL,
`tuid` int(11) DEFAULT NULL,
`tcontid` int(11) DEFAULT NULL,
`tstatus` varchar(10) DEFAULT 'pending',
the rule here is that the same user UID can't have more than one pending order with the same contid.
so the first thing i could do is to check if there is a pending order like this:
select count(id) into #cnt from tblorders where tuid = 1 and tcontid=5 and tstatus like 'pending';
and if it is > 0 then can't insert.
or i can just make a unique index for the three columns and that way the table won't accept new records of the duplicates.
the question is:
WHICH WAY IS FASTER? because thats gonna be a large database...

Few suggestions.
use tstatus = 'pending'; instead of tstatus like 'pending';
Creating composite primary keys for tid, tcontid, tstatus may not work if you are considering only for 'pending' status. What about other statuses?
If you decide to index the columns, I would recommend you create a separate table for tstatus and use the foreign key reference here. So it will save the space for the indexed columns and also your query will always run on the indexed fields.

An index is clearly faster as it is designed for that usecase.
It will fasten the search of the tuple you are looking for and if the constraint is not satisfied, it will send an error so, in your treatment script you will handle it easier (and faster) than by fetching the result, and so.

Assuming that a user will attempt to add a conflicting record less often than a valid record then the compound index will be faster. The SQL engine will maintain the index and throw an error when there is an index constraint violation.
Even if you did elect the select method you would need to maintain the index. Aside from that, pulling a result set from a select all the way back into your application's memory space and then checking the result is much slower than enforcing it on an index constraint.
For more info please see: https://dev.mysql.com/doc/refman/5.5/en/multiple-column-indexes.html

Related

Mysql concept of key for the same column with shared primary key

As below code, I set chat_id and student_id as primary key so that the same chat will be added for another student_id. When it goes to production, the record will be a lot. Should I add one more index for student_id only so that the search will be faster for every user when they come back to the screen to see recently messages?
CREATE TABLE `tim_chat_recipients` (
`chat_id` INT(11) NOT NULL,
`student_id` INT(11) NOT NULL,
`message_status` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '1:New, 2:Read, 3:Deleted',
PRIMARY KEY (`chat_id`, `student_id`))COLLATE='latin1_swedish_ci' ENGINE=InnoDB;
If the code is going to search for student_id without a chat_id then an index over at least student_id is required or queries will result in non-scaling table/cluster scans.. this is because the implicit cluster/index (chat_id, student_id) can't be used for index/reduction without a supplied chat_id.
This secondary index - INDEX(student_id, [optionally other columns]) - might include other columns, either as part of the index or includes, although "appropriate selection" depends a good bit on the queries. In this case it may make sense to include the message_status column to be able to use "messages not seen" as a filter with minimal read-IO as it can avoid a probe back to the table/cluster. (Indices are about balancing maintenance/disk costs and benefits to queries.)
In short: for an index to be selected by the Query Planner in a SARGABLE query, the left parts of the index must be known and mentioned in either the WHERE or, sometimes, the JOIN .. ON clauses.

High traffic table, optimal indexes?

I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)
What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.

Mysql Innodb deadlock problems on REPLACE INTO

I want to update the statistic count in mysql.
The SQL is as follow:
REPLACE INTO `record_amount`(`source`,`owner`,`day_time`,`count`) VALUES (?,?,?,?)
Schema :
CREATE TABLE `record_amount` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id',
`owner` varchar(50) NOT NULL ,
`source` varchar(50) NOT NULL ,
`day_time` varchar(10) NOT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `src_time` (`owner`,`source`,`day_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
However, it caused a DEADLOCK exception in multi-processes running (i.e. Map-Reduce).
I've read some materials online and confused about those locks. I know innodb uses row-level lock. I can just use the table-lock to solve the business problem but it is a little extreme. I found some possible solutions:
change REPLACE INTO to transaction with SELECT id FOR UPDATE and UPDATE
change REPLACE INTO to INSERT ... ON DUPLICATE KEY UPDATE
I have no idea that which is practical and better. Can someone explain it or offer some links for me to read and study? Thank you!
Are you building a summary table, one source row at a time? And effectively doing UPDATE ... count = count+1? Throw away the code and start over. MAP-REDUCE on that is like using a sledge hammer on a thumbtack.
INSERT INTO summary (source, owner, day_time, count)
SELECT source, owner, day_time, COUNT(*)
FROM raw
GROUP BY source, owner, day_time
ON DUPLICATE KEY UPDATE count = count + VALUES(count);
A single statement approximately like that will do all the work at virtually disk I/O speed. No SELECT ... FOR UPDATE. No deadlocks. No multiple threads. Etc.
Further improvements:
Get rid of the AUTO_INCREMENT; turn the UNIQUE into PRIMARY KEY.
day_time -- is that a DATETIME truncated to an hour? (Or something like that.) Use DATETIME, you will have much more flexibility in querying.
To discuss further, please elaborate on the source data (`CREATE TABLE, number of rows, frequency of processing, etc) and other details. If this is really a Data Warehouse application with a Summary table, I may have more suggestions.
If the data is coming from a file, do LOAD DATA to shovel it into a temp table raw so that the above INSERT..SELECT can work. If it is of manageable size, make raw Engine=MEMORY to avoid any I/O for it.
If you have multiple feeds, my high-speed-ingestion blog discusses how to have multiple threads without any deadlocks.

Select rows where column LIKE dictionary word

I have 2 tables:
Dictionary - Contains roughly 36,000 words
CREATE TABLE IF NOT EXISTS `dictionary` (
`word` varchar(255) NOT NULL,
PRIMARY KEY (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Datas - Contains roughly 100,000 rows
CREATE TABLE IF NOT EXISTS `datas` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`hash` varchar(32) NOT NULL,
`data` varchar(255) NOT NULL,
`length` int(11) NOT NULL,
`time` int(11) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `hash` (`hash`),
KEY `data` (`data`),
KEY `length` (`length`),
KEY `time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=105316 ;
I would like to somehow select all the rows from datas where the column data contains 1 or more words.
I understand this is a big ask, it would need to match all of these rows together in every combination possible, so it needs the best optimization.
I have tried the below query, but it just hangs for ages:
SELECT `datas`.*, `dictionary`.`word`
FROM `datas`, `dictionary`
WHERE `datas`.`data` LIKE CONCAT('%', `dictionary`.`word`, '%')
AND LENGTH(`dictionary`.`word`) > 3
ORDER BY `length` ASC
LIMIT 15
I have also tried something similar to the above with a left join, and on clause that specified the like statement.
This is actually not an easy problem, what you are trying to perform is called Full Text Search, and relational databases are not the best tools for such a task. If this is some kind of a core functionality consider using solutions dedicated for this kind of operations, like Sphinx Search Server.
If this is not a "Mission Critical" system, you can try with something else. I can see that datas.data column isn't really long, so you can create a structure dedicated for your task and keep maintaining it during operational use. Fore example, create table:
dictionary_datas (
datas_id FK (datas.id),
word FK (dictionary.word)
)
Now anytime you insert, delete or simply modify datas or dictionary tables you update dictionary_datas placing there info which datas_id contains which words (basically many to many relations). Of course it will degradate your performance, so if you have high high transactional load on your system, you have to do this periodicaly. For example place a Cron Job which runs every night at 03:00 am and actualize the table. To simplify the task you can add a flag TO_CHECK into DATAS table, and actualize data only for those records having there 1 (after you actualise dictionary_datas you switch the value to 0). Remember by the way to refresh whole DATAS table after an update to DICTIONARY table. 36 000 and 100 000 aren't big numbers in terms of data processing.
Once you have this table you can just query it like:
SELECT datas_id, count(*) AS words_num FROM dictionary_datas GROUP BY datas_id HAVING count(*) > 3;
To speed up the query (and yet slow down it's update) you can create a composite index on its columns datas_id, word (in EXACTLY that order). If you decide to refresh the data periodicaly you should remove the index before refresh, than refresh the data, and finaly create the index after refreshing - this way will be faster.
I'm not sure if I understood your problem, but I think this could be a solution. Also, I think people don't like Regular Expression but this works for me to select columns where their value has more than 1 word.
SELECT * FROM datas WHERE data REGEXP "([a-z] )+"
Have you tried this?
select *
from dictionary, datas
where position(word,data) > 0
;
This is very inefficient, but might be good enough for you. Here is a fiddle.
For better performance, you could try placing a text search index on your text column DATA and then using the CONTAINS function instead of POSITION.

Effective indexing for a DB with millions of rows

I have a MYISAM MySQL DB table with many millions of rows inside which I've been asked to work with, but I need to make the queries faster first.
There was no indexing before at all! I added a new index on the 'type' column which has helped but I wanted to know if there were any other columns that might be best indexed too?
Here is my CREATE TABLE:
CREATE TABLE `clicks` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`companyid` int(11) DEFAULT '0',
`type` varchar(32) NOT NULL DEFAULT '',
`contextid` int(11) NOT NULL DEFAULT '0',
`period` varchar(16) NOT NULL DEFAULT '',
`timestamp` int(11) NOT NULL DEFAULT '0',
`location` varchar(32) NOT NULL DEFAULT '',
`ip` varchar(32) DEFAULT NULL,
`useragent` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `companyid` (`companyid`,`type`,`period`),
KEY `type` (`type`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
A typical SELECT statement would commonly filter by the companyid, type and contextid columns.
For example:
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
or
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND type IN('direct') AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
The last part of my questions would be this: when I added the index on type it took about 1 hour - if I am adding or removing multiple indexes, can you do it in one query or do you have to do them 1 by 1 and wait for each to finish?
Thanks for your thoughts.
Indexing is really powerful, but isn't as much of a black art as you might think. Learn about MySQL's EXPLAIN PLAN capabilities, this will help you systematically find where improvements can be made:
http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html
Which indexes to add really depends on your queries. Anything that you're sorting (GROUP BY) or selecting (WHERE) on is a good candidate for an index.
You may also want to have a look at how Mysql uses indexes.
As regards the time taken to add indexes, where you're sure you want to add multiple indexes, you could do mysqldump, manually edit the table structure in the .sql file, and then reimport. This can take a while, but at least you can do all the changes at once. However, this doesn't really fit with the idea of testing as you go... so use this approach with care. (I've done it when modifying a number of tables with the same structure, and wanting to add some indexes to all of them.)
Also, I'm not a 100% sure, but I think that when you add an index, Mysql creates a copy of the table with the index, and then deletes the original table - so make sure there's enough space on your server / partition for the current size of the table & some margin.
Here's one of your queries, broken on multiple lines so it's easier to read.
SELECT period, count(period) as count
FROM clicks
WHERE contextid in (123)
AND timestamp > 123123123
GROUP BY period
ORDER BY timestamp ASC
I'm not even sure this is a valid query. I thought your GROUP BY and ORDER BY had to match up in SQL. I think that you would have to order on count, as the GROUP BY would order on period.
The important part of the query for optimization is the WHERE clause. In this case, an index on contextid and timestamp would speed up the query.
Obviously, you can't index every WHERE clause. You index the most common WHERE clauses.
I'd add indexes to existing tables one at a time. Yes, it's slow. But you should only have to add the indexes once.
In my opinion timestamp and period can be indexed as they are being used in the WHERE clause.
Also instead of using contextid in (123) use contextid = 123 and instead of type IN('direct') use type = 'direct'
You can add multiple indexes in a single query. This will save some time overall, but the table will be inaccessible while you wait for the entire query to complete:
ALTER TABLE table1 ADD INDEX `Index1`('col1'),
ADD INDEX `Index2`('col2')
Regarding indexes, it's a complex subject. However, adding indexes on single columns with high-cardinality that are included in your WHERE clause is a good place to start. MySQL will try to pick the best index for the query and use that.
To further tweak performance, you should consider multi-column indexes, which I see you've implemented with your 'companyid' index.
To be able to utilize an index all the way through to a GROUP BY or ORDER BY clause relies on a lot of conditions, that you might want to read up on.
To best utilize indexes, your database server must have enough RAM to store the indexes entirely in memory and the server must be configured properly to actually utilize the memory.