Check almost 1 million values for matches in MYSQL DB [duplicate] - mysql

I have been experimenting with Redis and MongoDB lately and it would seem that there are often cases where you would store an array of id's in either MongoDB or Redis. I'll stick with Redis for this question since I am asking about the MySQL IN operator.
I was wondering how performant it is to list a large number (300-3000) of id's inside the IN operator, which would look something like this:
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
Imagine something as simple as a products and categories table which you might normally JOIN together to get the products from a certain category. In the example above you can see that under a given category in Redis ( category:4:product_ids ) I return all the product ids from the category with id 4, and place them in the above SELECT query inside the IN operator.
How performant is this?
Is this an "it depends" situation? Or is there a concrete "this is (un)acceptable" or "fast" or "slow" or should I add a LIMIT 25, or doesn't that help?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25
Or should I trim the array of product id's returned by Redis to limit it to 25 and only add 25 id's to the query rather than 3000 and LIMIT-ing it to 25 from inside the query?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)
Any suggestions/feedback is much appreciated!

Generally speaking, if the IN list gets too large (for some ill-defined value of 'too large' that is usually in the region of 100 or smaller), it becomes more efficient to use a join, creating a temporary table if need so be to hold the numbers.
If the numbers are a dense set (no gaps - which the sample data suggests), then you can do even better with WHERE id BETWEEN 300 AND 3000.
However, presumably there are gaps in the set, at which point it may be better to go with the list of valid values after all (unless the gaps are relatively few in number, in which case you could use:
WHERE id BETWEEN 300 AND 3000 AND id NOT BETWEEN 742 AND 836
Or whatever the gaps are.

I have been doing some tests, and as David Fells says in his answer, it is quite well optimized. As a reference, I have created an InnoDB table with 1,000,000 registers and doing a select with the "IN" operator with 500,000 random numbers, it takes only 2.5 seconds on my MAC; selecting only the even registers takes 0.5 seconds.
The only problem that I had is that I had to increase the max_allowed_packet parameter from the my.cnf file. If not, a mysterious “MYSQL has gone away” error is generated.
Here is the PHP code that I use to make the test:
$NROWS =1000000;
$SELECTED = 50;
$NROWSINSERT =15000;
$dsn="mysql:host=localhost;port=8889;dbname=testschema";
$pdo = new PDO($dsn, "root", "root");
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$pdo->exec("drop table if exists `uniclau`.`testtable`");
$pdo->exec("CREATE TABLE `testtable` (
`id` INT NOT NULL ,
`text` VARCHAR(45) NULL ,
PRIMARY KEY (`id`) )");
$before = microtime(true);
$Values='';
$SelValues='(';
$c=0;
for ($i=0; $i<$NROWS; $i++) {
$r = rand(0,99);
if ($c>0) $Values .= ",";
$Values .= "( $i , 'This is value $i and r= $r')";
if ($r<$SELECTED) {
if ($SelValues!="(") $SelValues .= ",";
$SelValues .= $i;
}
$c++;
if (($c==100)||(($i==$NROWS-1)&&($c>0))) {
$pdo->exec("INSERT INTO `testtable` VALUES $Values");
$Values = "";
$c=0;
}
}
$SelValues .=')';
echo "<br>";
$after = microtime(true);
echo "Insert execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$sql = "SELECT count(*) FROM `testtable` WHERE id IN $SelValues";
$result = $pdo->prepare($sql);
$after = microtime(true);
echo "Prepare execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$result->execute();
$c = $result->fetchColumn();
$after = microtime(true);
echo "Random selection = $c Time execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$sql = "SELECT count(*) FROM `testtable` WHERE id %2 = 1";
$result = $pdo->prepare($sql);
$result->execute();
$c = $result->fetchColumn();
$after = microtime(true);
echo "Pairs = $c Exdcution time=" . ($after-$before) . "s<br>";
And the results:
Insert execution time =35.2927210331s
Prepare execution time =0.0161771774292s
Random selection = 499102 Time execution time =2.40285992622s
Pairs = 500000 Exdcution time=0.465420007706s

You can create a temporary table where you can put any number of IDs and run nested query
Example:
CREATE [TEMPORARY] TABLE tmp_IDs (`ID` INT NOT NULL,PRIMARY KEY (`ID`));
and select:
SELECT id, name, price
FROM products
WHERE id IN (SELECT ID FROM tmp_IDs);

Using IN with a large parameter set on a large list of records will in fact be slow.
In the case that I solved recently I had two where clauses, one with 2,50 parameters and the other with 3,500 parameters, querying a table of 40 Million records.
My query took 5 minutes using the standard WHERE IN. By instead using a subquery for the IN statement (putting the parameters in their own indexed table), I got the query down to TWO seconds.
Worked for both MySQL and Oracle in my experience.

IN is fine, and well optimized. Make sure you use it on an indexed field and you're fine.
It's functionally equivalent to:
(x = 1 OR x = 2 OR x = 3 ... OR x = 99)
As far as the DB engine is concerned.
EDIT: Please notice this answer was written in 2011, and see the comments of this answer discussing the latest MySQL features.

When you provide many values for the IN operator it first must sort it to remove duplicates. At least I suspect that. So it would be not good to provide too many values, as sorting takes N log N time.
My experience proved that slicing the set of values into smaller subsets and combining the results of all the queries in the application gives best performance. I admit that I gathered experience on a different database (Pervasive), but the same may apply to all the engines. My count of values per set was 500-1000. More or less was significantly slower.

Related

Mysql select sleep then return

I want select X records from database (in PHP script), then sleep 60 seconds after continue the next 60 results...
SO:
SELECT * FROM TABLE WHERE A = 'B' LIMIT 60
SELECT SLEEP(60);
....
SELECT * FROM TABLE WHERE A = 'B' LIMIT X **where X is the next 60 results, then**
SELECT SLEEP(60);
AND etc...
How can I achievement this?
There is no such thing as "the next 60 records". SQL tables represent unordered sets. Without an order by, a SQL statement can return a result set in any order -- and even in different orders on different executions.
Hence, you first need something to guarantee the ordering . . . that is, an order by with keys that uniquely identify each row.
You can then use offset/limit to accomplish what you want. Or, you could put the code into a stored procedure and use a while loop. Or, you could do this on the application side.
In PHP:
<?php
// obtain the database connection, there's a heap of examples on the net, assuming you're using a library like mysqlite
$offset = 0;
while (true) {
if ($offset == 0) {
$res = $db->query('SELECT * FROM TABLE WHERE A = 'B' LIMIT 60');
} else {
$res = $db->query('SELECT * FROM TABLE WHERE A = 'B' LIMIT ' . $offset . ',60');
}
$rows = $db->fetch_assoc($res);
sleep(60);
if ($offset >= $some_arbitrary_number) {
break;
}
$offset += 60;
}
What you're doing is gradually incrementing the limit field by 60 until you reach a limit. The easiest way to do it is in a control while loop using true for the condition and break when you reach your invalid condition.

mysql - insert many to many relationship

I am trying to insert records in 2 different mysql tables. Here's the situation:
Table 1: is_main that contains records of resorts with a primary key called id.
Table 2: is_features that contains a list of features that a resort can have (i.e. beach, ski, spa etc...). Each feature has got a primary key called id.
Table 3: is_i2f to connect each resort id with the feature id. This table has got 2 fields: id_i and id_f. Both fields are primary key.
I have created a form to insert a new resort, but I'm stuck here. I need a proper mysql query to insert a new resort in the is_main table and insert in is_i2f one record for each feature it has, with the id of the resort id id_i and the id of the feature id id_f.
$features = ['beach','relax','city_break','theme_park','ski','spa','views','fine_dining','golf'];
mysql_query("INSERT INTO is_main (inv_name, armchair, holiday, sipp, resort, price, rooms, inv_length, more_info)
VALUES ('$name', '$armchair', '$holiday', '$sipp', '$resort', '$price', '$rooms', '$length', '$more_info')");
$id = mysql_insert_id();
foreach($features as $feature) {
if(isset($_POST[$feature])) {
$$feature = 1;
mysql_query("INSERT INTO is_i2f (id_i, id_f) VALUES (" . $id . ", ?????????????? /missing part here????/ ); }
else {
$$feature = 0; }
}
Thanks.
Please, I'm going CrAzY!!!!!!!!!!!!!!
This may not be relevant to you, but...
Would it not make more sense to leave the link table unpopulated? You can use JOINs to then select what you need to populate the various views etc in your application
i.e. query to get 1 resort with all features:
SELECT
Id,
f.Id,
f.Name
FROM IS_MAIN m
CROSS JOIN IS_FEATURES f
WHERE m.Id = $RequiredResortId
Please find the answer on Mysql insert into 2 tables.
If you want to do multiple insert at a time you can write a SP to fulfill your needs
If I understand you correctly you could concatenate variable amount of to be inserted/selected values into one query. (This is the second query which needs an id from the first.)
//initializing variables
$id = mysql_insert_id();
$qTail = '';
$i = -1;
//standard beginning
$qHead = "INSERT INTO `is_i2f` (`id`,`feature`) VALUES ";
//loop through variable amount of variables
foreach($features] as $key => $feature) {
$i++;
//id stays the same, $feature varies
$qValues[$i] = "('{$id}', '{$feature}')";
//multiple values into one string
$qTail .= $qValues[$i] . ',';
} //end of foreach
//concatenate working query, need to remove last comma from $qTail
$q = $qHead . rtrim($qTail, ',');
Now you should have a usable insert query $q. Just echo it and see how it looks and test if it works.
Hope this was the case. If not, sorry...

Why doesn't this SELECT query return the results I expect?

I need help with a select query, but before asking the question, I will give a short description of how my system works:
My database has a many-to-many relationship:
table product:
prd_cod(pk) //stores the product code ex: 0,1,2
cat_cod(fk)
prd_name //stores the product name, ex: tv, gps, notebook
table description_characteristc:
prd_cod(fk)
id_characteristic(fk)
description //stores the description of the characteristic, ex: sony, 1kg, hj10
table characteristic:
id_characteristic (pk)
name_characteristic //store the name of characteristic, ex: brand, weight, model
I have already made a suggest jQuery (in the index.php), where every word I type calls suggest.php, which makes a select and returns the result into the suggestion box in the index:
<?php
header('Content-type: text/html; charset=UTF-8');
$hostname = 'localhost';
$username = 'root';
$password = '';
$dbname = 'cpd';
mysql_connect($hostname, $username, $password)or die('Erro ao tentar conecta o banco
de dados.');
mysql_select_db( $dbname );
if( isset( $_REQUEST['query'] ) && $_REQUEST['query'] != "" )
{
$q = mysql_real_escape_string( $_REQUEST['query'] );
if( isset( $_REQUEST['identifier'] ) && $_REQUEST['identifier'] == "sugestao")
{
$sql = "SELECT p.prd_name, d.description
FROM product p
INNER JOIN description_characteristc d using (prd_cod)
WHERE '".$q."' like concat(p.prd_name, '%') AND
concat(p.prd_name, ' ', d.description) like concat('".$q."', '%')LIMIT 10";
$r = mysql_query( $sql );
if ( $r )
{
echo '<ul>'."\n";
$cont = 0;
while( $l = mysql_fetch_array( $r ) ){
$p = $l['nome'];
$p = preg_replace('/(' . $q . ')/i', '<span style="font-
weight:bold;">$1</span>',
$l['prd_nome'].' '.$l['descricao'].' '.$l['descricao']);
echo "\t".'<li id="autocomplete_'.$cont.'"
rel="'.$l['prd_nome'].'.'.$l['descricao'].'">'. utf8_encode( $p ) .'</li>'."\n";
$cont++;
}
echo '</ul>';
}
}
}
?>
Here are my questions:
Currently when the user types 't', the select brings nothing, only when the user type 'tv' is bringing the result:
tv led
tv plasm
tv samsumg
I would like that when the user type 't' the select bring me 'tv'.
When you type 'tv plasm' it's bringing the same name_characteristic twice:
ex: tv plasm plasm
Currently my select selects the prd_name and the descriptions of table description_characteristc:
tv led
I would like my select could make a inverse select too, ex: led tv.
I would like that when the results of the select were shown, there could be a cache feature that shows the order of the most sought for the less sought; remembering that prd_name stores only 'tv'.
The help I'm looking for can be in the form of select, as in the form of procedure. Also, I can edit the php file.
You should split and join your search query on PHP side like this:
<?php
$words = preg_split("/[^\\w]+/", $q);
$first = $words[0] + "%";
$all = implode(" ", $words) + "%";
?>
then use the variables $first and $all in this query:
SELECT p.prd_name, d.description
FROM product p
JOIN description d
ON d.prd_cod = p.prd_cod
WHERE p.prd_name LIKE '$first'
AND CONCAT(p.prd_name, ' ', d.description) LIKE '$all'
Create an index on product (prd_name) for this to work fast.
If you want the words matched in any order, you will have to create a FULLTEXT index on your tables (this is only possible in MyISAM):
CREATE FULLTEXT INDEX fx_product_name ON product (prd_name)
CREATE FULLTEXT INDEX fx_description_name ON description (description)
and write a query like this:
SELECT p.prd_name, d.description
FROM (
SELECT prd_cod
FROM product pi
WHERE MATCH(prd_name) AGAINST ('lcd tv' IN BOOLEAN MODE)
UNION
SELECT prd_cod
FROM description di
WHERE MATCH(description) AGAINST ('lcd tv' IN BOOLEAN MODE)
) q
JOIN product p
ON p.prd_cod = q.prd_cod
JOIN description d
ON d.prd_cod= p.prd_cod
WHERE MATCH(p.prd_name, d.description) AGAINST ('+lcd +tv' IN BOOLEAN MODE)
Note the search term syntax change: 'lcd tv' in the inner query and '+lcd +tv' in the outer one.
You may also want to set ##ft_min_word_len to 1 for the shorter words like tv or gps to match.
Since MySQL cannot build a fulltext index from two or more tables at once, it would be more simple if you denormalized you tables and put the prd_name into the description table. This way, you could get rid of the joins and just write:
SELECT prd_name, description
FROM description d
WHERE MATCH(prd_name, description) AGAINST ('+lcd +tv' IN BOOLEAN MODE)
You're using the LIKE clause badly and you don't seem to know what "AND" means. It's important to separate "and" as used in casual speech from "AND" as used in programming. AND in programming means "BOTH MUST BE TRUE". "and" in casual speech can mean "one of these conditions, you know what I mean?"
Also, you shouldn't be building SQL like this, it's an accident waiting to happen. You really should find a way to bind variables into SQL statements. I don't know PHP, so I can't help with that.
First, you should be using this in your WHERE clause p.prd_name LIKE '$q%'. Try this outside PHP -- outside the web -- just as a simple SQL query: SELECT * FROM PRODUCT P WHERE P.PRD_NAME LIKE 'T%'.
Second, you should fix "AND" to be "OR", since you want one condition OR the other condition to be true. If you want for BOTH conditions to be true, hardly anything will match.

How do I compress MySQL ids leaving no gaps [duplicate]

"BIG" UPDATE:
Ok I was getting the whole
auto-increment point wrong. I though
this would be an easier way to target
the first, second, third and so row,
but it is just the wrong approach.
You should instead care about that the
auto_increments are unique and well...
that they increment. You should use
the for that.
I wont delete this question because I
think it might be helpful for someone
else with the same wrong idea, BUT
BE WARNED! :)
I have a very simple MySQL table which went like this:
id comment user
1 hello name1
2 bye name2
3 hola name3
Then I deleted the two first comments, the result:
id comment user
3 hola name3
So now when I add comments:
id comment user
3 hola name3
5 chau name4
6 xxx name5
My problem is that I would need that whenever a row gets deleted it should "start over" and look like this.
id comment user
1 hola name3
2 chau name4
3 xxx name5
I would like to know how is it possible to some how "restart" the table so that it is "always" indexed 1, 2, 3 and so on.
Thanks in advance!!
I hope I have explained myself clear enough, I'm sorry for all my "plain english", feel free to edit if you think a word might be confusing :) and please ask for any clarification needed!
BTW: I did not add any of my code because this is a simplified situation and I though it be more confusing and less helpful to others, but I you think it would help (or is necessary) tell me about it!
Assuming there are no foreign key issues to deal with, this code will do it:
set #id:=0;
update mytable
set id = (#id := #id + 1)
order by id;
If there are foreign key issues, make sure your constraints are defined like this before you execute the update:
ALTER CHILD_TABLE ADD CONSTRAINT
FOREIGN KEY MYTABLE_ID REFERENCES MYTABLE
ON UPDATE CASCADE; -- This is the important bit
When it's all done, execute this to fix up the auto_increment value:
SELECT MAX(ID) + 1 FROM MYTABLE; -- note the output
ALTER TABLE MYTABLE AUTO_INCREMENT = <result from above>;
Disclaimer: I can't think of one valid reason to do this, and it can break stuff very bad. However, I'm adding this for the sake of completeness and demonstration purposes.
You could use this really ugly solution, please only do this if you're at gunpoint or your dog is held hostage!
-- Create a new veriable.
SET #newId:=0;
-- Set all id's in the table to a new one and
-- also increment the counter in the same step.
-- It's basically just setting id to ++id.
UPDATE
yourTableHere
SET
id=#newId:=#newId+1;
-- Now prepare and execute an ALTER TABLE statement
-- which sets the next auto-increment value.
SET #query:=CONCAT("ALTER TABLE yourTableHere AUTO_INCREMENT=", #newId+1);
PREPARE sttmnt FROM #query;
EXECUTE sttmnt;
DEALLOCATE PREPARE sttmnt;
This will reset all of the Ids to the position of the row in the table. Please be aware that this will reorder the rows to how MySQL gets them from the storage engine, so there's no guarantee on the order in any way.
If you have a system which is based on the Ids (like relationships between tables) then you'll be...well, let's say I hope you have a backup.
Can't be done using MySQL's autoincrement feature. You could roll your own solution, e.g. a mix between application logic and database triggers. BUT, seriosly, your design is heavily broken if it requires you to recycle UNIQUE IDs.
Couldn't you just create another table where you'd save references like that (this could be done by querying the minimum) and let your main table point to that auxilliary table?
EDIT
Here's a blog I've googled that deals with your problem: see here.
ALTER TABLE event AUTO_INCREMENT = 1;
That's not the purpose of AUTO_INCREMENT. It exists to generate unique identifiers, not to maintain a gapless sequence.
If you have a valid reason for wanting this, then generate the identifiers yourself in your code. AUTO_INCREMENT won't provide this for you.
Accentually, auto increment is made for that to increase, no matter, how much rows are there.
ALTER TABLE table AUTO_INCREMENT = 1 does reseting, but you may get some bad things, if ID's starts to repeating.
So, my advise would be - leave it alone :)
function backup_tables($host, $user, $pass, $dbname, $tables = '*'){
$connect = mysqli_connect($host, $user, $pass , $dbname);
mysqli_query($connect, "SET NAMES 'utf8'");
//get all of the tables
if($tables == '*'){
$tables = array();
$result = mysqli_query($connect, 'SHOW TABLES');
while($row = mysqli_fetch_row($result))
{
$tables[] = $row[0];
}
}
else
{
$tables = is_array($tables) ? $tables : explode(',',$tables);
}
foreach($tables as $table){
$table = trim($table);
// getting all table fields
$tblDetails = mysqli_query($connect,"SHOW FULL COLUMNS FROM $table");
// we may need to know how to create our table
$tblCreate = mysqli_fetch_row(mysqli_query($connect, 'SHOW CREATE TABLE '.$table));
// getting last line from table creation script in order to get info about engine ->suffix1
$suffix1 = end(explode(PHP_EOL,$tblCreate[1]));
// if there is auto increment we have to remove
if (strpos($suffix1,"AUTO_INCREMENT")){
$tmpArr = explode(" ",$suffix1);
$newStr = '';
foreach ($tmpArr as $term){
if (!is_int(strpos($term, "AUTO_INCREMENT"))) $newStr .= $term . ' '; else $suffix4 = $term; // suffix4 stores next value of auto_increment
}
$suffix1 = $newStr;
} // now if there is auto_increment we removed from the last line of creation table script
$return .= "DROP TABLE IF EXISTS `".$table."` CASCADE;\n\n";
// starting creation table with our rules
$kgbReturn = "CREATE TABLE `$table` (\n";
while($cols = mysqli_fetch_row($tblDetails )){
if ($cols[2]) $cols[2] = " COLLATE " . $cols[2]; //if a charset defined add to line
if ($cols[3]=='NO') $cols[3] = " NOT NULL"; // if the field may be null
$kgbReturn .= "`".$cols[0]."` ".$cols[1]. $cols[2] . $cols[3]. ",\n"; //field creation line ready
}
$kgbReturn = rtrim($kgbReturn,",\n") . "\n" .trim($suffix1," ") . ";\n\n"; // table creation without auto_increment
$tblDetails = mysqli_query($connect,"SHOW FULL COLUMNS FROM $table WHERE (`Key` LIKE 'PRI%')");
$suffix2 = '';
while($cols = mysqli_fetch_row($tblDetails )){
$suffix2 .= "ALTER TABLE `". $table ."` \n ADD PRIMARY KEY (`".$cols[0]."`);\n\n";
}
$tblDetails = mysqli_query($connect,"SHOW FULL COLUMNS FROM $table WHERE (Extra LIKE 'auto_increment%')");
$suffix3 = '';
while($cols = mysqli_fetch_row($tblDetails )){
$suffix3 = "ALTER TABLE `". $table ."` \n ADD PRIMARY KEY (`".$cols[0]."`);\n\n";
$suffix3 = "ALTER TABLE `".$table."` \n MODIFY `".$cols[0]."` ".$cols[1]." NOT NULL AUTO_INCREMENT, ".$suffix4.";";
}
$return .= $kgbReturn;
$result = mysqli_query($connect, 'SELECT * FROM '.$table);
$num_fields = mysqli_num_fields($result);
// insert into all values
for ($i = 0; $i < $num_fields; $i++){
while($row = mysqli_fetch_row($result)){
$return .= 'INSERT INTO '.$table.' VALUES(';
for($j=0; $j < $num_fields; $j++){
$row[$j] = addslashes($row[$j]);
$row[$j] = str_replace (array("\r\n", "\n", "\r", PHP_EOL), '\r', $row[$j])
;
if (isset($row[$j])) { $return.= '"'.$row[$j].'"' ; } else { $return .= '""'; }
if ($j < ($num_fields-1)) { $return .= ','; };
}
$return .= ");\n";
}
}
$return .= "\n\n"; // insert values completed.
// now add primary key and auto increment statements if exist
$return .= $suffix2 . $suffix3 . "\n\n\n";
echo "<pre>".$return ."</pre>"; // debug line. comment if you don't like.
}
// we need to write to a file that coded as utf-8
$bkTime = date('Y_m_j_H_i_s');
$fileName = 'backup-db-'.$bkTime.'.sql';
$f=fopen($fileName,"w");
# Now UTF-8 - Add byte order mark
fwrite($f, pack("CCC",0xef,0xbb,0xbf));
fwrite($f,$return);
fclose($f);
}
You shouldn't really be worrying about this - the only thing an id should be is unique; its actual value should be irrelevant.
That said, here is a way (see the top comment) to do exactly what you want to do.
For those that are looking to "reset" the auto_increment, say on a list that has had a few deletions and you want to renumber everything, you can do the following.
DROP the field you are auto_incrementing.
ALTER the table to ADD the field again with the same attributes.
You will notice that all existing rows are renumbered and the next auto_increment number will be equal to the row count plus 1.
(Keep in mind that DROPping that column will remove all existing data, so if you have exterior resources that rely on that data, or the numbers that are already there, you may break the link. Also, as with any major structure change, it's a good idea to backup your table BEFORE you make the change.)
Maybe it is your approach to the solution you're trying to achieve that is not correct as what you're trying to achieve it's not possible "automatically" and doing by hand when you have thousands of rows will make your system lag.
Is it really necessary that it the system adjusts at every delete?
update table_name set id =NULL; alter table table_name change column
`id` `id` int auto_increment;

MySQL "IN" operator performance on (large?) number of values

I have been experimenting with Redis and MongoDB lately and it would seem that there are often cases where you would store an array of id's in either MongoDB or Redis. I'll stick with Redis for this question since I am asking about the MySQL IN operator.
I was wondering how performant it is to list a large number (300-3000) of id's inside the IN operator, which would look something like this:
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
Imagine something as simple as a products and categories table which you might normally JOIN together to get the products from a certain category. In the example above you can see that under a given category in Redis ( category:4:product_ids ) I return all the product ids from the category with id 4, and place them in the above SELECT query inside the IN operator.
How performant is this?
Is this an "it depends" situation? Or is there a concrete "this is (un)acceptable" or "fast" or "slow" or should I add a LIMIT 25, or doesn't that help?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25
Or should I trim the array of product id's returned by Redis to limit it to 25 and only add 25 id's to the query rather than 3000 and LIMIT-ing it to 25 from inside the query?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)
Any suggestions/feedback is much appreciated!
Generally speaking, if the IN list gets too large (for some ill-defined value of 'too large' that is usually in the region of 100 or smaller), it becomes more efficient to use a join, creating a temporary table if need so be to hold the numbers.
If the numbers are a dense set (no gaps - which the sample data suggests), then you can do even better with WHERE id BETWEEN 300 AND 3000.
However, presumably there are gaps in the set, at which point it may be better to go with the list of valid values after all (unless the gaps are relatively few in number, in which case you could use:
WHERE id BETWEEN 300 AND 3000 AND id NOT BETWEEN 742 AND 836
Or whatever the gaps are.
I have been doing some tests, and as David Fells says in his answer, it is quite well optimized. As a reference, I have created an InnoDB table with 1,000,000 registers and doing a select with the "IN" operator with 500,000 random numbers, it takes only 2.5 seconds on my MAC; selecting only the even registers takes 0.5 seconds.
The only problem that I had is that I had to increase the max_allowed_packet parameter from the my.cnf file. If not, a mysterious “MYSQL has gone away” error is generated.
Here is the PHP code that I use to make the test:
$NROWS =1000000;
$SELECTED = 50;
$NROWSINSERT =15000;
$dsn="mysql:host=localhost;port=8889;dbname=testschema";
$pdo = new PDO($dsn, "root", "root");
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$pdo->exec("drop table if exists `uniclau`.`testtable`");
$pdo->exec("CREATE TABLE `testtable` (
`id` INT NOT NULL ,
`text` VARCHAR(45) NULL ,
PRIMARY KEY (`id`) )");
$before = microtime(true);
$Values='';
$SelValues='(';
$c=0;
for ($i=0; $i<$NROWS; $i++) {
$r = rand(0,99);
if ($c>0) $Values .= ",";
$Values .= "( $i , 'This is value $i and r= $r')";
if ($r<$SELECTED) {
if ($SelValues!="(") $SelValues .= ",";
$SelValues .= $i;
}
$c++;
if (($c==100)||(($i==$NROWS-1)&&($c>0))) {
$pdo->exec("INSERT INTO `testtable` VALUES $Values");
$Values = "";
$c=0;
}
}
$SelValues .=')';
echo "<br>";
$after = microtime(true);
echo "Insert execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$sql = "SELECT count(*) FROM `testtable` WHERE id IN $SelValues";
$result = $pdo->prepare($sql);
$after = microtime(true);
echo "Prepare execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$result->execute();
$c = $result->fetchColumn();
$after = microtime(true);
echo "Random selection = $c Time execution time =" . ($after-$before) . "s<br>";
$before = microtime(true);
$sql = "SELECT count(*) FROM `testtable` WHERE id %2 = 1";
$result = $pdo->prepare($sql);
$result->execute();
$c = $result->fetchColumn();
$after = microtime(true);
echo "Pairs = $c Exdcution time=" . ($after-$before) . "s<br>";
And the results:
Insert execution time =35.2927210331s
Prepare execution time =0.0161771774292s
Random selection = 499102 Time execution time =2.40285992622s
Pairs = 500000 Exdcution time=0.465420007706s
You can create a temporary table where you can put any number of IDs and run nested query
Example:
CREATE [TEMPORARY] TABLE tmp_IDs (`ID` INT NOT NULL,PRIMARY KEY (`ID`));
and select:
SELECT id, name, price
FROM products
WHERE id IN (SELECT ID FROM tmp_IDs);
Using IN with a large parameter set on a large list of records will in fact be slow.
In the case that I solved recently I had two where clauses, one with 2,50 parameters and the other with 3,500 parameters, querying a table of 40 Million records.
My query took 5 minutes using the standard WHERE IN. By instead using a subquery for the IN statement (putting the parameters in their own indexed table), I got the query down to TWO seconds.
Worked for both MySQL and Oracle in my experience.
IN is fine, and well optimized. Make sure you use it on an indexed field and you're fine.
It's functionally equivalent to:
(x = 1 OR x = 2 OR x = 3 ... OR x = 99)
As far as the DB engine is concerned.
EDIT: Please notice this answer was written in 2011, and see the comments of this answer discussing the latest MySQL features.
When you provide many values for the IN operator it first must sort it to remove duplicates. At least I suspect that. So it would be not good to provide too many values, as sorting takes N log N time.
My experience proved that slicing the set of values into smaller subsets and combining the results of all the queries in the application gives best performance. I admit that I gathered experience on a different database (Pervasive), but the same may apply to all the engines. My count of values per set was 500-1000. More or less was significantly slower.