MySQL 5.5 "select distinct" is really slow - mysql

One of the things my app does a fair amount is:
select count(distinct id) from x;
with id the primary key for table x. With MySQL 5.1 (and 5.0), it looks like this:
mysql> explain SELECT count(distinct id) from x;
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
| 1 | SIMPLE | x | index | NULL | ix_blahblahblah | 1 | NULL | 1234567 | Using index |
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
On InnoDB, this isn't exactly blazing, but it's not bad, either.
This week I'm trying out MySQL 5.5.11, and was surprised to see that the same query is many times slower. With the cache primed, it takes around 90 seconds, compared to 5 seconds before. The plan now looks like this:
mysql> explain select count(distinct id) from x;
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
| 1 | SIMPLE | x | range | NULL | PRIMARY | 4 | NULL | 1234567 | Using index for group-by (scanning) |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
One way to make it go fast again is to use select count(id) from x, which is safe because id is a primary key, but I'm going through some abstraction layers (like NHibernate) that make this a non-trivial task.
I tried analyze table x but it didn't make any appreciable difference.
It looks kind of like this bug, though it's not clear what versions that applies to, or what's happening (nobody's touched it in a year yet it's "serious/high/high").
Is there any way, besides simply changing my query, to get MySQL to be smarter about this?
UPDATE:
As requested, here's a way to reproduce it, more or less. I wrote this SQL script to generate 1 million rows of dummy data (takes 10 or 15 minutes to run):
delimiter $$
drop table if exists x;
create table x (
id integer unsigned not null auto_increment,
a integer,
b varchar(100),
c decimal(9,2),
primary key (id),
index ix_a (a),
index ix_b (b),
index ix_c (c)
) engine=innodb;
drop procedure if exists fill;
create procedure fill()
begin
declare i int default 0;
while i < 1000000 do
insert into x (a,b,c) values (1,"one",1.0);
set i = i+1;
end while;
end$$
delimiter ;
call fill();
When it's done, I observe this behavior:
5.1.48
select count(distinct id) from x
EXPLAIN is: key: ix_a, Extra: Using index
takes under 1.0 sec to run
select count(id) from x
EXPLAIN is: key: ix_a, Extra: Using index
takes under 0.5 sec to run
5.5.11
select count(distinct id) from x
EXPLAIN is: key: PRIMARY, Extra: Using index for group-by
takes over 7.0 sec to run
select count(id) from x
EXPLAIN is: key: ix_a, Extra: Using index
takes under 0.5 sec to run
EDIT:
If I modify the query in 5.5 by saying
select count(distinct id) from x force index (ix_a);
it runs much faster. Indexes b and c also work (to varying degrees), and even forcing index PRIMARY helps.

I'm not making any promises that this will be better but, as a possible work around, you could try:
SELECT COUNT(*)
FROM (SELECT id
FROM x
GROUP BY id) t

I'm not sure why you need DISTINCT on a unique primary key. It looks like MySQL is viewing the DISTINCT keyword as an operator and losing the ability to make use of the index (as would any operation on a field.) Other SQL engines also sometimes don't optimize searches on expressions very well, so it's not a surprise.
I note your comment in another answer about this being an artifact of your ORM. Have you ever read the famous Leaky Abstractions blog by Joel Spolsky? I think you are there. Sometimes you end up spending more time straightening out the tool than you spend on the problem you're using the tool to solve.

I dont know if you have realiased, but counting the rows on a large database with InnoDB is slow, even without the distinct keyword. InnoDB does not cache the rowcount in the table metadata, MyISAM does.
I would suggest you do one of two things
1) create a trigger that inserts/updates distinct counts into another table on insertion.
2) slave another MySQL server to your database, but change the table type on the slave only, to MyISAM and perform your query there (this is probarbly overkill).

I may be missreading your question, but if id is the primary key of table x, then the following two queries are logically equivalent:
select count(distinct id) from x;
select count(*) from x;
...regardless of whether the optimizer realizes this. Distinct generally implies a sort or scanning the index in order, which is considerably slower than just counting the rows.

Creative use of autoincrement fields
Note that your id is autoincrement.
It will add +1 after each insert.
However it does not reuse numbers, so if you delete a row you need to track of that.
My idea goes something like this.
Count(rows) = Max(id) - number of deletions - starting(id) + 1
Scenario using update
Create a separate table with the totals per table.
table counts
id integer autoincrement primary key
tablename varchar(45) /*not needed if you only need to count 1 table*/
start_id integer default maxint
delete_count
Make sure you extract the starting_id before the first delete(!) into the table and do
INSERT INTO counts (tablename, start_id, delete_count)
SELECT 'x', MIN(x.id), 0
FROM x;
Now create a after delete trigger.
DELIMITER $$
CREATE TRIGGER ad_x_each AFTER DELETE ON x FOR EACH ROW
BEGIN
UPDATE counts SET delete_count = delete_count + 1 WHERE tablename = 'x';
END $$
DELIMITER ;
IF you want to have the count, you do
SELECT max(x.id) - c.start_id + 1 - c.delete_count as number_of_rows
FROM x
INNER JOIN counts c ON (c.tablename = 'x')
This will give you your count instantly, with requiring a trigger to count on every insert.
insert scenario
If you have lots of deletes, you can speed up the proces by doing an insert instead of an update in the trigger and selecting
TABLE count_x /*1 counting table per table to keep track of*/
id integer autoincrement primary key /*make sure this field starts at 1*/
start_id integer default maxint /*do not put an index on this field!*/
Seed the starting id into the count table.
INSERT INTO counts (start_id) SELECT MIN(x.id) FROM x;
Now create a after delete trigger.
DELIMITER $$
CREATE TRIGGER ad_x_each AFTER DELETE ON x FOR EACH ROW
BEGIN
INSERT INTO count_x (start_id) VALUES (default);
END $$
DELIMITER ;
SELECT max(x.id) - min(c.start_id) + 1 - max(c.id) as number of rows
FROM x
JOIN count_x as c ON (c.id > 0)
You'll have to test which approach works best for you.
Note that in the insert scenario you don't need delete_count, because you are using the autoincrementing id to keep track of the number of deletions.

select count(*)
from ( select distinct(id) from x)

Related

What is the default select order in PostgreSQL or MySQL?

I have read in the PostgreSQL docs that without an ORDER statement, SELECT will return records in an unspecified order.
Recently on an interview, I was asked how to SELECT records in the order that they inserted without an PK or created_at or other field that can be used for order. The senior dev who interviewed me was insistent that without an ORDER statement the records will be returned in the order that they were inserted.
Is this true for PostgreSQL? Is it true for MySQL? Or any other RDBMS?
I can answer for MySQL. I don't know for PostgreSQL.
The default order is not the order of insertion, generally.
In the case of InnoDB, the default order depends on the order of the index read for the query. You can get this information from the EXPLAIN plan.
For MyISAM, it returns orders in the order they are read from the table. This might be the order of insertion, but MyISAM will reuse gaps after you delete records, so newer rows may be stored earlier.
None of this is guaranteed; it's just a side effect of the current implementation. MySQL could change the implementation in the next version, making the default order of result sets different, without violating any documented behavior.
So if you need the results in a specific order, you should use ORDER BY on your queries.
Following BK's answer, and by way of example...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table(id INT NOT NULL) ENGINE = MYISAM;
INSERT INTO my_table VALUES (1),(9),(5),(8),(7),(3),(2),(6);
DELETE FROM my_table WHERE id = 8;
INSERT INTO my_table VALUES (4),(8);
SELECT * FROM my_table;
+----+
| id |
+----+
| 1 |
| 9 |
| 5 |
| 4 | -- is this what
| 7 |
| 3 |
| 2 |
| 6 |
| 8 | -- we expect?
+----+
In the case of PostgreSQL, that is quite wrong.
If there are no deletes or updates, rows will be stored in the table in the order you insert them. And even though a sequential scan will usually return the rows in that order, that is not guaranteed: the synchronized sequential scan feature of PostgreSQL can have a sequential scan "piggy back" on an already executing one, so that rows are read starting somewhere in the middle of the table.
However, this ordering of the rows breaks down completely if you update or delete even a single row: the old version of the row will become obsolete, and (in the case of an UPDATE) the new version can end up somewhere entirely different in the table. The space for the old row version is eventually reclaimed by autovacuum and can be reused for a newly inserted row.
Without an ORDER BY clause, the database is free to return rows in any order. There is no guarantee that rows will be returned in the order they were inserted.
With MySQL (InnoDB), we observe that rows are typically returned in the order by an index used in the execution plan, or by the cluster key of a table.
It is not difficult to craft an example...
CREATE TABLE foo
( id INT NOT NULL
, val VARCHAR(10) NOT NULL DEFAULT ''
, UNIQUE KEY (id,val)
) ENGINE=InnoDB;
INSERT INTO foo (id, val) VALUES (7,'seven') ;
INSERT INTO foo (id, val) VALUES (4,'four') ;
SELECT id, val FROM foo ;
MySQL is free to return rows in any order, but in this case, we would typically observe that MySQL will access rows through the InnoDB cluster key.
id val
---- -----
4 four
7 seven
Not at all clear what point the interviewer was trying to make. If the interviewer is trying to sell the idea, given a requirement to return rows from a table in the order the rows were inserted, a query without an ORDER BY clause is ever the right solution, I'm not buying it.
We can craft examples where rows are returned in the order they were inserted, but that is a byproduct of the implementation, ... not guaranteed behavior, and we should never rely on that behavior to satisfy a specification.

Mysql High Concurrency Updates

I have a mysql table:
CREATE TABLE `coupons` (
`id` INT NOT NULL AUTO_INCREMENT,
`code` VARCHAR(255),
`user_id` INT,
UNIQUE KEY `code_idx` (`code`)
) ENGINE=InnoDB;
The table consists of thousands/millions of codes and initially user_id is NULL for everyone.
Now I have a web application which assigns a unique code to thousands of users visiting the application concurrently. I am not sure what is the correct way to handle this considering very high traffic.
The query I have written is:
UPDATE coupons SET user_id = <some_id> where user_id is NULL limit 1;
And the application runs this query with say a concurrency of 1000 req/sec.
What I have observed is the entire table gets locked and this is not scaling well.
What should I do?
Thanks.
As it is understood, coupons is prepopulated and a null user_id is updated to one that is not null.
explain update coupons set user_id = 1 where user_id is null limit 1;
This is likely requiring an architectural solution, but you may wish to review the explain after ensuring that the table has indexes for the columns treated, and that the facilitate rapid updates.
Adding an index to coupons.user_id, for example alters MySQL's strategy.
create unique index user_id_idx on coupons(user_id);
explain update coupons set user_id = 1 where user_id is null limit 1;
+----+-------------+---------+------------+-------+---------------+-------------+---------+-------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+-------------+---------+-------+------+----------+------------------------------+
| 1 | UPDATE | coupons | NULL | range | user_id_idx | user_id_idx | 5 | const | 6 | 100.00 | Using where; Using temporary |
+----+-------------+---------+------------+-------+---------------+-------------+---------+-------+------+----------+------------------------------+
1 row in set (0.01 sec)
So you should work with a DBA to ensure that the database entity is optimized. Trade-offs need to be considered.
Also, since you have a client application, you have the opportunity to pre-fetch null coupons.user_id and do an update directly on coupons.id. Curious to hear of your solution.
This question might be more suitable for DBA's (and I'm not a DBA) but I'll try to give you some ideas of what's going on.
InnoDB does not actually lock the whole table when you perform you update query. What it does is the next: it puts a record lock which prevents any other transaction from inserting, updating, or deleting rows where the value of coupons.user_id is NULL.
With your query you have at the moment(which depends on user_id to be NULL), you cannot have concurrency because your transaction will run one after another, not in parallel.
Even an index on your coupons.user_id won't help, because when putting the lock InnoDB create a shadow index for you if you don't have one. The outcome would be the same.
So, if you want to increase your throughput, there are two options I can think of:
Assign a user to a coupon in async mode. Put all assignment request in a queue then process the queue in background. Might not be suitable for your business rules.
Decrease the number of locked records. The idea here is to lock as less records as possible while performing an update. To achieve this you can add one or more indexed columns to your table, then use the index in your WHERE clause of Update query.
An example of column is a product_id, or a category, maybe a user location(country, zip).
then your query will look something like this:
UPDATE coupons SET user_id = WHERE product_id = user_id is NULL LIMIT 1;
And now InnoDB will lock only records with product_id = <product_id>. this way you you'll have concurrency.
Hope this helps!

mySQL Is non-clustering index + not indexed field still faster than 2 x not indexed fields?

For example I have table with 3 columns:
"id", "a", "b"
id is primary key
a - field without index
b - field without index
CREATE TABLE samples (id INT, a INT, b INT, PRIMARY KEY(id));
Now I want to do a select query:
SELECT * FROM samples where a = '77345' and b = '234234';
As I understand this query will be really fast if I will have index for both "a" and "b" fields, like this:
CREATE INDEX ab_index ON samples (a, b) USING BTREE;
Question:
Will select query above will be faster if I add index for "a" field only (no other indexes):
CREATE INDEX a_index ON samples (a) USING BTREE;
If yes, how much faster will it be?
Table samples:
create table samples (id int NOT NULL AUTO_INCREMENT, a int, b int, PRIMARY KEY(id));
inserted 2,483,308 records
Testing query
select * from samples where a = 3434 and b = 4389;
Without indexes:
**Timing (as measured by the server):**
Execution time: 0:00:0.57075288
Table lock wait time: 0:00:0.00008100
With index on (a):
CREATE INDEX a_index ON samples (a) USING BTREE;
**Timing (as measured by the server):**
Execution time: 0:00:0.00021302
Table lock wait time: 0:00:0.00008300
With index (a, b) only:
CREATE INDEX ab_index ON samples (a, b) USING BTREE;
**Timing (as measured by the server):**
Execution time: 0:00:0.00019394
Table lock wait time: 0:00:0.00007600
With (a) and (a, b) indexes:
**Timing (as measured by the server):**
Execution time: 0:00:0.00022304
Table lock wait time: 0:00:0.00008300
Dropped indexes, without any indexes again:
**Timing (as measured by the server):**
Execution time: 0:00:0.57105565
Table lock wait time: 0:00:0.00008300
With (a) only index again:
Execution time: 0:00:0.00021866
Table lock wait time: 0:00:0.00008700
So yes, adding (a) only index increases speed significantly.
What is strange , explain shows that in case of (a) and (a, b) indexes present mySQL still uses (a) index for some reason.
explain select * from samples where a = 45 and b = 3456;
+----+-------------+---------+------+------------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+------------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | samples | ref | a_index,ab_index | a_index | 5 | const | 1 | Using where |
+----+-------------+---------+------+------------------+---------+---------+-------+------+-------------+
"How much faster" questions are really quite difficult to answer without knowing a lot about the content of your data. If you index just your a column, and you have a large table, but not many distinct values of a, MySQL will still have to scan a large chunk of your table.
An index on just a or just b will probably, but not certainly, be faster than scanning the whole table. It's really hard to know without trying it on real data. It's certainly worth trying.
Pro tip: Never use SELECT * where you can, instead, enumerate the names of the columns you need. That's because the query execution planner can sometimes take shortcuts when it knows it doesn't need all the columns.
The query you've given, rewritten to say
SELECT id,a,b
FROM samples
where a = '77345'
and b = '234234'
can be very fast indeed if you have an index on (a,b,id). That's because MySQL can satisfy the whole query from the index. It finds a, then it finds b, in the index. Then sitting right there is the id value. This is called a compound covering index. It's worth reading about.

Default value for order field in mysql

In a given table I have a field (field_order) that will serve as way to define a custom order for showing the rows of the table. When inserting a new record
I would like to set that particular field with the numbers of rows in that table plus one
So if the table has 3 rows, at the time of inserting a new one, the default value for field_order should be 4.
What would be the best approach to set that value?
A simple select count inside the insert statement?
Is there a constant like CURRENT_TIMESTAMP for TIMESTAMP datatype that returns that value?
EDIT: The reason behind this is is to be able to sort the table by that particular field; and that field would be manipulated by a user in client side using jQuery's sortable
Okay, so the solutions surrounding this question actually involve a bit of nuance. Went ahead and decided to answer, but also wanted to address some of the nuance/details that the comments aren't addressing yet.
First off, I would very strongly advise you against using auto_increment on the primary key, if for no other reason that that it's very easy for those auto increment ids to get thrown off (for example, rolled back transactions will interfere with them MySQL AUTO_INCREMENT does not ROLLBACK. So will deletes, as #Sebas mentioned).
Second, you have to consider your storage engine. If you are using MyISAM, you can very quickly obtain a COUNT(*) of the table (because MyISAM always knows how many rows are in each table). If you're using INNODB, that's not the case. Depending on what you need this table for, you may be able to get away with MyISAM. It's not the default engine, but it is certainly possible that you could encounter a requirement for which MyISAM would be a better choice.
The third thing you should ask yourself is, "Why?" Why do you need to store your data that way at all? What does that actually give you? Do you in fact need that information in SQL? In the same table of the same SQL table?
And if the "Why" has an answer that justifies its use, then the last thing I'd ask is "how?" In particular, how are you going to deal with concurrent inserts? How are you going to deal with deletes or rollbacks?
Given the requirement that you have, doing a count star of the table is basically necessary... but even then, there's some nuance involved (deletes, rollbacks, concurrency) and also some decisions to be made (which storage engine do you use; can you get away with using MyISAM, which will be faster for count stars?).
More than anything, though, I'd be question why I needed this in the first place. Maybe you really do... but that's an awfully strange requirement.
IN LIGHT OF YOUR EDIT:
EDIT: The reason behind this is is to be able to sort the table by
that particular field; and that field would be manipulated by a user
in client side using jQuery's sortable
Essentially what you are asking for is metadata about your tables. And I would recommend storing those metadata in a separate table, or in a separate service altogether (Elastic Search, Redis, etc). You would need to periodically update that separate table (or Key value store). If you were doing this in SQL, you could use a trigger. Or you used something like Elastic Search, you could insert your data into SQL and ES at the same time. Either way, you have some tricky issues you need to contend with (for example, eventual consistency, concurrency, all the glorious things that can backfire when you are using triggers in MySQL).
If it were me, I'd note two things. One, not even Google delivers an always up to date COUNT(*). "Showing rows 1-10 out of approximately XYZ." They do that in part because they have more data that I imagine you do, and in part because it actually is impractical (and very quickly becomes infeasible and prohibitive) to calculate an exact COUNT(*) of a table and keep it up to date at all times.
So, either I'd change my requirement entirely and leverage a statistic I can obtain quickly (if you are using MyISAM for storage, go ahead and use count( * )... it will be very fast) or I would consider maintaining an index of the count stars of my tables that periodically updates via some process (cron job, trigger, whatever) every couple of hours, or every day, or something along those lines.
Inre the bounty on this question... there will never be a single, canonical answer to this question. There are tradeoffs to be made no matter how you decide to manage it. They may be tradeoffs in terms of consistency, latency, scalability, precise vs approximate solutions, losing INNODB in exchange for MyISAM... but there will be tradeoffs. And ultimately the decision comes down to what you are willing to trade in order to get your requirement.
If it were me, I'd probably flex my requirement. And if I did, I'd probably end up indexing it in Elastic Search and make sure it was up to date every couple of hours or so. Is that what you should do? That depends. It certainly isn't a "right answer" as much as it is one answer (out of many) that would work if I could live with my count(*) getting a bit out of date.
Should you use Elastic Search for this? That depends. But you will be dealing with tradeoffs which ever way you go. That does not depend. And you will need to decide what you're willing to give up in order to get what you want. If it's not critical, flex the requirement.
There may be a better approach, but all I can think of right now is to create a second table that holds the value you need, and use triggers to make the appropriate inserts / deletes:
Here's an example:
-- Let's say this is your table
create table tbl_test(
id int unsigned not null auto_increment primary key,
text varchar(50)
);
-- Now, here's the table I propose.
-- It will be related to your original table using 'Id'
-- (If you're using InnoDB you can add the appropriate constraint
create table tbl_incremental_values(
id int unsigned not null primary key,
incremental_value int unsigned not null default 0
);
-- The triggers that make this work:
delimiter $$
create trigger trig_add_one after insert on tbl_test for each row
begin
declare n int unsigned default 0;
set n = (select count(*) from tbl_test);
insert into tbl_incremental_values
values (NEW.id, (n));
end $$
-- If you're using InnoDB tables and you've created a constraint that cascades
-- delete operations, skip this trigger
create trigger trig_remove before delete on tbl_test for each row
begin
delete from tbl_incremental_values where id = OLD.id;
end $$
delimiter ;
Now, let's test it:
insert into tbl_test(text) values ('a'), ('b');
select a.*, b.incremental_value
from tbl_test as a inner join tbl_incremental_values as b using (id);
-- Result:
-- id | text | incremental_value
-- ---+------+------------------
-- 1 | a | 1
-- 2 | b | 2
delete from tbl_test where text = 'b';
select a.*, b.incremental_value
from tbl_test as a inner join tbl_incremental_values as b using (id);
-- Result:
-- id | text | incremental_value
-- ---+------+------------------
-- 1 | a | 1
insert into tbl_test(text) values ('c'), ('d');
select a.*, b.incremental_value
from tbl_test as a inner join tbl_incremental_values as b using (id);
-- Result:
-- id | text | incremental_value
-- ---+------+------------------
-- 1 | a | 1
-- 3 | c | 2
-- 4 | d | 3
This will work fine for small datasets, but as evanv says in his answer:
Why?" Why do you need to store your data that way at all? What does that actually give you? Do you in fact need that information in SQL? In the same table of the same SQL table?
If all you need is to output that result, there's a much easier way to make this work: user variables.
Let's now say that your table is something like this:
create table tbl_test(
id int unsigned not null auto_increment primary key,
ts timestamp,
text varchar(50)
);
insert into tbl_test(text) values('a');
insert into tbl_test(text) values('b');
insert into tbl_test(text) values('c');
insert into tbl_test(text) values('d');
delete from tbl_test where text = 'b';
insert into tbl_test(text) values('e');
The ts column will take the value of the date and time on which each row was inserted, so if you sort it by that column, you'll get the rows in the order they were inserted. But now: how to add that "incremental value"? Using a little trick with user variables it is possible:
select a.*
, #n := #n + 1 as incremental_value
-- ^^^^^^^^^^^^ This will update the value of #n on each row
from (select #n := 0) as init -- <-- you need to initialize #n to zero
, tbl_test as a
order by a.ts;
-- Result:
-- id | ts | text | incremental_value
-- ---+---------------------+------+----------------------
-- 1 | xxxx-xx-xx xx:xx:xx | a | 1
-- 3 | xxxx-xx-xx xx:xx:xx | c | 2
-- 4 | xxxx-xx-xx xx:xx:xx | d | 3
-- 5 | xxxx-xx-xx xx-xx-xx | e | 4
But now... how to deal with big datasets, where it's likely you'll use LIMIT? Simply by initializing #n to the start value of limit:
-- A dull example:
prepare stmt from
"select a.*, #n := #n + 1 as incremental_value
from (select #n := ?) as init, tbl_test as a
order by a.ts
limit ?, ?";
-- The question marks work as "place holders" for values. If you're working
-- directly on MySQL CLI or MySQL workbench, you'll need to create user variables
-- to hold the values you want to use.
set #first_row = 2, #nrows = 2;
execute stmt using #first_row, #first_row, #nrows;
-- ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^
-- Initalizes The "floor" The number
-- the #n of the of rows
-- value LIMIT you want
--
-- Set #first_row to zero if you want to get the first #nrows rows
--
-- Result:
-- id | ts | text | incremental_value
-- ---+---------------------+------+----------------------
-- 4 | xxxx-xx-xx xx:xx:xx | d | 3
-- 5 | xxxx-xx-xx xx-xx-xx | e | 4
deallocate prepare stmt;
It seems like the original question was asking for an easy way to set a default sort order on a new record. Later on the user may adjust that "order field" value. Seems like DELETES and ROLLBACKS have nothing to do with this.
Here's a simple solution. For the sort order field, set your default value as 0, and use the primary key as your secondary sort. Simply change your sort order in the query to be DESC. If you want the default functionality to be "display most recently added first", then use:
SELECT * from my_table
WHERE user_id = :uid
ORDER BY field_order, primary_id DESC
If you want to "display most recently added last" use:
SELECT * from my_table
WHERE user_id = :uid
ORDER BY field_order DESC, primary_id
What I have done to avoid the SELECT COUNT(*) ... in the insert query is to have an unsorted state of the field_order column, let's say a default value of 0.
The select-query looks like:
SELECT * FROM my_table ... ORDER BY id_primary, field_order
As long as you don't apply a custom order, your query will result in chronological order.
When you want to apply custom sorting field_order should be re-setted by counting them from -X to 0:
id | sort
---+-----
1 | -2
2 | -1
3 | 0
When altering occurs the custom sort remains, and new rows will always be sorted chronoligicaly at end of the custom sorting already in place:
id | sort
---+-----
1 | -2
3 | 0
4 | 0

MySQL query takes too long -- what should be the index?

Here is my query:
CREATE TEMPORARY TABLE temptbl (
pibn INT UNSIGNED NOT NULL, page SMALLINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temptbl (
SELECT pibn,page FROM mytable
WHERE word1=429907 AND word2=0);
ALTER TABLE temptbl ADD INDEX (pibn,page);
SELECT word1,COUNT(*) AS aaa
FROM mytable a
INNER JOIN temptbl b
ON a.pibn=b.pibn AND a.page=b.page
WHERE word2=0
GROUP BY word1 ORDER BY aaa DESC LIMIT 10;
DROP TABLE temptbl;
The issue is the SELECT word1,COUNT(*) AS aaa, specifically the count. That select statement takes 16 seconds.
EXPLAIN says:
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
| 1 | SIMPLE | b | ALL | pibn | NULL | NULL | NULL | 26778 | Using temporary; Using filesort |
| 1 | SIMPLE | a | ref | w2pibnpage1,word21pibn,pibnpage | w2pibnpage1 | 9 | const,db.b.pibn,db.b.page | 4 | Using index |
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
The index used (w2pibnpage1) is on:
word2,pibn,page,word1,id
I've been struggling with this for days, trying different combinations of columns for the index (which is annoying as it takes an hour to rebuild - millions of rows).
What should my indexes be, or what should I do to get this query to run in a fraction of a second (as it should)?
Here is a suggestion.
Presumably the temporary table is small. You can remove the index on that table, because a full table scan is fine there. In fact, that is what you want.
You then want indexes used on the big table. First the indexes need to match the join condition, then to match the where condition, and finally the group by condition. So, the suggestion is:
mytable(pibn, page, word2, word1, aaa)
I'm throwing in the order by column, so it doesn't have to fetch the value from the original data.
The query is taking a long time, but the expensive part seems to be accessing 'mytable' (you've not provided the structure of this) however the optimizer seems to think it only needs to fetch 4 rows from this using an index - which should be very fast. i.e. the data appears to be very skewed - how many rows does the last query examine (tally of counts)?
Without having a lok at the exact distribution of data, it's hard to be definitive - certainly you may need to hint the query to get it to work efficiently. The problem with designing indexes is that they should make all the queries faster - or at least give a reasonable tradeoff.
Looking at the predicates in the queries you've provided...
WHERE word1=429907 AND word2=0
Would be best served by an index on word1,word2,.... or word2,word1,.....
ON a.pibn=b.pibn AND a.page=b.page
WHERE a.word2=0
Would be best served by an index on mytable with word2+pibn+page in the leading columns.
How many distinct values are there for mytable.word1 and for mytable.word2? If word2 has a low number of distinct values (less than 20 or so) then it's not adding much selectivity to the index and can be omitted.
An index on word2,pibn,page,word1 gives you a covering index for the second query.
If your temptbl is small you want to first restrict the bigger table (mytable) and then join it (eventually by index) to your temptbl.
Right now, MySQL thinks it is better off by using the index of the bigger table to join.
You can get around this by doing a straight join:
SELECT word1,COUNT(*) AS aaa
FROM mytable a
STRAIGHT_JOIN temptbl b
ON a.pibn=b.pibn AND a.page=b.page
WHERE word2=0
GROUP BY word1
ORDER BY aaa DESC LIMIT 10;
This should use your index in mytable for the where clause and join mytable to temptbl via the index in temptbl.
If MySQL still wants to do it different, you can use FORCE INDEX to make it use the index.
With your data volumes is is not going to work fast no matter what you do, not without changing the schema.
If I understand you right, you're looking for the top words which go along with 429907 on the same pages.
You model as it it now would require counting all those words over an over again each time you run the query.
To speed it up, you would need to create an additional stats table:
CREATE TABLE word_pairs
(
word1_1 INT NOT NULL,
word1_2 INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY (word1_1, word1_2),
INDEX (word1_1, cnt),
INDEX (word1_2, cnt)
)
and update it each time you're inserting a record into a large table (increase the cnt for the newly inserted word and all the words it's being on the same page with).
This would probably bee too slow for a single server as such updates would require some time, so you would also need to shard that table across multiple servers.
If you had such a table you could just run:
SELECT *
FROM word_pairs
WHERE word1_1 = 429907
ORDER BY
cnt DESC
LIMIT 10
which would be instant.
I came up with this:
CREATE TEMPORARY TABLE temp1 (
pibn INT UNSIGNED NOT NULL, page SMALLINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp1 (
SELECT pibn,page FROM mytable
WHERE word1=429907 AND word2=0);
CREATE TEMPORARY TABLE temp2 (
word1 MEDIUMINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp2 (
SELECT a.word1
FROM mytable a, temp1 b
WHERE a.word2=0 AND a.pibn=b.pibn AND a.page=b.page);
DROP TABLE temp1;
CREATE INDEX index1 ON temp2 (word1) USING BTREE;
CREATE TEMPORARY TABLE temp3 (
word1 MEDIUMINT UNSIGNED NOT NULL, num INT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp3 (SELECT word1,COUNT(*) AS aaa FROM temp2 USE INDEX (index1) GROUP BY word1);
DROP TABLE temp2;
CREATE INDEX index1 ON temp3 (num) USING BTREE;
SELECT word1,num FROM temp3 USE INDEX (index1) ORDER BY num DESC LIMIT 10;
DROP TABLE temp3;
Takes 5 seconds.