Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.
SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.
Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.
I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.
Related
I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.
I want a query that does a fulltext search on one field and then a sort on a different field (imagine searching some text document and order by publication date). The table has about 17M rows and they are more or less uniformly distributed in dates. This is to be used in a webapp request/response cycle, so the query has to finish in at most 200ms.
Schematically:
SELECT * FROM table WHERE MATCH(text) AGAINST('query') ORDER BY date=my_date DESC LIMIT 10;
One possibility is having a fulltext index on the text field and a btree on the publication date:
ALTER TABLE table ADD FULLTEXT index_name(text);
CREATE INDEX index_name ON table (date);
This doesn't work very well in my case. What happens is that MySQL evaluates two execution paths. One is using the fulltext index to find the relevant rows, and once they are selected use a FILESORT to sort those rows. The second is using the BTREE index to sort the entire table and then look for matches using a FULL TABLE SCAN. They're both bad. In my case MySQL chooses the former. The problem is that the first step can select some 30k results which it then has to sort, which means the entire query might take of the order 10 seconds.
So I was thinking: do composite indexes of FULLTEXT+BTREE exist? If you know how a FULLTEXT index works, it first tokenizes the column you're indexing and then builds an index for the tokens. It seems reasonable to me to imagine a composite index such that the second index is a BTREE in dates for each token. Does this exist in MySQL and if so what's the syntax?
BONUS QUESTION: If it doesn't exist in MySQL, would PostgreSQL perform better in this situation?
Use IN BOOLEAN MODE.
The date index is not useful. There is no way to combine the two indexes.
Beware, if a user searches for something that shows up in 30K rows, the query will be slow. There is no straightforward away around it.
I suspect you have a TEXT column in the table? If so, there is hope. Instead of blindly doing SELECT *, let's first find the ids and get the LIMIT applied, then do the *.
SELECT a.*
FROM tbl AS a
JOIN ( SELECT date, id
FROM tbl
WHERE MATCH(...) AGAINST (...)
ORDER BY date DESC
LIMIT 10 ) AS x
USING(date, id)
ORDER BY date DESC;
Together with
PRIMARY KEY(date, id),
INDEX(id),
FULLTEXT(...)
This formulation and indexing should work like this:
Use FULLTEXT to find 30K rows, deliver the PK.
With the PK, sort 30K rows by date.
Pick the last 10, delivering date, id
Reach back into the table 10 times using the PK.
Sort again. (Yeah, this is necessary.)
More (Responding to a plethora of Comments):
The goal behind my reformulation is to avoid fetching all columns of 30K rows. Instead, it fetches only the PRIMARY KEY, then whittles that down to 10, then fetches * only 10 rows. Much less stuff shoveled around.
Concerning COUNT on an InnoDB table:
INDEX(col) makes it so that an index scan works for SELECT COUNT(*) or SELECT COUNT(col) without a WHERE.
Without INDEX(col),SELECT COUNT(*)will use the "smallest" index; butSELECT COUNT(col)` will need a table scan.
A table scan is usually slower than an index scan.
Be careful of timing -- It is significantly affected by whether the index and/or table is already cached in RAM.
Another thing about FULLTEXT is the + in front of words -- to say that each word must exist, else there is no match. This may cut down on the 30K.
The FULLTEXT index will deliver the date, id is random order, not PK order. Anyway, it is 'wrong' to assume any ordering, hence it is 'right' to add ORDER BY, then let the Optimizer toss it if it knows that it is redundant. And sometimes the Optimizer can take advantage of the ORDER BY (not in your case).
Removing just the ORDER BY, in many cases, makes a query run much faster. This is because it avoids fetching, say, 30K rows and sorting them. Instead it simply delivers "any" 10 rows.
(I have not experience with Postgres, so I cannot address that question.)
I'm using a MySQL database and have to perform some select queries on large/huge tables (e.g. 267,736 rows and 30 columns).
Query details:
Only select queries (the data in the table is fixed, never an update, insert or delete)
Select query on all the columns (business requirement)
Mostly limit the number of rows (LIMIT 10 to all rows -> user can choose)
Could be ordered by one or multiple columns (creation of indexes here will not help since the user can order by any column he likes)
Could be filtered by a value the user chooses (where filter on one or more columns)
Currently the queries take up to 2 seconds, which is to long.
Is there a way to speed them up?
Which storage engine should I use: InnoDB/MyISAM/...
Should I have a primary key, even if I will never use him?
...?
You should (must actually) use indexes.
Create indexes on all columns with which WHERE or ORDER BY is going to be used. Also study and use EXPLAIN to see the impact of the indexes and to optimize your queries.
You don't have to create a primary key if there is no column with unique data in your table, but it is very likely that you do have such a column (id, time...). In this case you should use primary key to filter your queries.
Number of columns in the query has close to no impact on SELECT speed.
As long as you make "Only select queries" storage engine does not matter either. MyISAM might be a bit faster, but InnoDB has many features you will need when you decide that your "Only select queries" rule must be broken.
I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.
If I SELECT IDs then UPDATE using those IDs, then the UPDATE query is faster than if I would UPDATE using the conditions in the SELECT.
To illustrate:
SELECT id FROM table WHERE a IS NULL LIMIT 10; -- 0.00 sec
UPDATE table SET field = value WHERE id IN (...); -- 0.01 sec
The above is about 100 times faster than an UPDATE with the same conditions:
UPDATE table SET field = value WHERE a IS NULL LIMIT 10; -- 0.91 sec
Why?
Note: the a column is indexed.
Most likely the second UPDATE statement locks much more rows, while the first one uses unique key and locks only the rows it's going to update.
The two queries are not identical. You only know that the IDs are unique in the table.
UPDATE ... LIMIT 10 will update at most 10 records.
UPDATE ... WHERE id IN (SELECT ... LIMIT 10) may update more than 10 records if there are duplicate ids.
I don't think there can be a one straight-forward answer to your "why?" without doing some sort of analysis and research.
The SELECT queries are normally cached, which means that if you run the same SELECT query multiple times, the execution time of the first query is normally greater than the following queries. Please note that this behavior can only be experienced where the SELECT is heavy and not in scenarios where even the first SELECT is much faster. So, in your example it might be that the SELECT took 0.00s because of the caching. The UPDATE queries are using different WHERE clauses and hence it is likely that their execution times are different.
Though the column a is indexed, but it is not necessary that MySQL must be using the index when doing the SELECT or the UPDATE. Please study the EXPLAIN outputs. Also, see the output of SHOW INDEX and check if the "Comment" column reads "disabled" for any indexes? You may read more here - http://dev.mysql.com/doc/refman/5.0/en/show-index.html and http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html.
Also, if we ignore the SELECT for a while and focus only on the UPDATE queries, it is obvious that they aren't both using the same WHERE condition - the first one runs on id column and the latter on a. Though both columns are indexed but it does not necessarily mean that all the table indexes perform alike. It is possible that some index is more efficient than the other depending on the size of the index or the datatype of the indexed column or if it is a single- or multiple-column index. There sure might be other reasons but I ain't an expert on it.
Also, I think that the second UPDATE is doing more work in the sense that it might be putting more row-level locks compared to the first UPDATE. It is true that both UPDATES are finally updating the same number of rows. But where in the first update, it is 10 rows that are locked, I think in the second UPDATE, all rows with a as NULL (which is more than 10) are locked before doing the UPDATE. Perhaps MySQL first applies the locking and then runs the LIMIT clause to update only limited records.
Hope the above explanation makes sense!
Do you have a composite index or separate indexes?
If it is a composite index of id and a columns,
In 2nd update statement the a column's index would not be used. The reason is that only the left most prefix indexes are used (unless if a is the PRIMARY KEY)
So if you want the a column's index to be used, you need in include id in your WHERE clause as well, with id first then a.
Also it depends on what storage engine you are using since MySQL does indexes at the engine level, not server.
You can try this:
UPDATE table SET field = value WHERE id IN (...) AND a IS NULL LIMIT 10;
By doing this id is in the left most index followed by a
Also from your comments, the lookups are much faster because if you are using InnoDB, updating columns would mean that the InnoDB storage engine would have to move indexes to a different page node, or have to split a page if the page is already full, since InnoDB stores indexes in sequential order. This process is VERY slow and expensive, and gets even slower if your indexes are fragmented, or if your table is very big
The comment by Michael J.V is the best description. This answer assumes a is a column that is not indexed and 'id' is.
The WHERE clause in the first UPDATE command is working off the primary key of the table, id
The WHERE clause in the second UPDATE command is working off a non-indexed column. This makes the finding of the columns to be updated significantly slower.
Never underestimate the power of indexes. A table will perform better if the indexes are used correctly than a table a tenth the size with no indexing.
Regarding "MySQL doesn't support updating the same table you're selecting from"
UPDATE table SET field = value
WHERE id IN (SELECT id FROM table WHERE a IS NULL LIMIT 10);
Just do this:
UPDATE table SET field = value
WHERE id IN (select id from (SELECT id FROM table WHERE a IS NULL LIMIT 10));
The accepted answer seems right but is incomplete, there are major differences.
As much as I understand, and I'm not a SQL expert:
The first query you SELECT N rows and UPDATE them using the primary key.
That's very fast as you have a direct access to all rows based on the fastest possible index.
The second query you UPDATE N rows using LIMIT
That will lock all rows and release again after the update is finished.
The big difference is that you have a RACE CONDITION in case 1) and an atomic UPDATE in case 2)
If you have two or more simultanous calls of the case 1) query you'll have the situation that you select the SAME id's from the table.
Both calls will update the same IDs simultanously, overwriting each other.
This is called "race condition".
The second case is avoiding that issue, mysql will lock all rows during the update.
If a second session is doing the same command it will have a wait time until the rows are unlocked.
So no race condition is possible at the expense of lost time.