mysql index performance on small "fast-moving" tables - mysql

We've got a table we use as a queue. Entries are constantly being added, and constantly being updated, and then deleted. Though we might be adding 3 entries/sec the table never grows to be more than a few hundred rows.
To get entries out of table we are doing a simple select.
SELECT * FROM queue_table WHERE some_id = ?
We are debating adding an index on some_id. I think the small size and speed at which we are adding and removing rows would say no, but conventionally, it seems we should have an index.
Any thoughts?

If you are using InnoDB (which you should do having a table of this kind) and the table is accessed concurrently, then you should definitely create the index.
When performing DML operations, InnoDB locks all rows it scans, not only those that match the WHERE clause conditions.
This means that without an index, a query like this:
DELETE
FROM mytable
WHERE some_id = ?
will have to do a full table scan and lock all rows.
This kills all concurrency (even if the threads access different some_id's they'll still have to wait for each other), and may even result in deadlocks.
With 3 transactions per second, no index should be a problem, so just create it.

To be sure, a benchmark using both techniques would be needed.
But generally, if the access is 50% reads and 50% writes, the penalty of updating an index could well not be worth it. But if the number of rows increase, that weights both the read and write performance such that an index should be used.

The only way to know for sure would be doing some benchmarks in actual/real conditions ; for example, measure the time each query takes, and :
for one day, collect that information each time the query is run -- without the index
and for another day, do exactly the same -- with the index.
For a table with a few hundreds rows doing both lots and inserts/deletes and select/updates, the difference should not be that big, so I think you can test in your production environment (and in real conditions) without much danger.
Yes, I know, testing in production is bad ; but in that case, it's the best way to know for sure : those conditions are probably too hard to replicate in a testing environment...

Related

Query performance increase from deleting rows in SQL database?

I have a database with a single table that keeps track of user state. When I'm done handling the row, its no longer necessary to keep it in the database and can be deleted.
Now lets say I wanted to keep track of the row instead of deleting it (for historical purposes, analytics, etc). Would it be better to:
Leave the data in the same table and mark the row as 'used' (with an extra column or something like that)
Delete the row from the table and insert it into a separate table that is created only for historical purposes
For choice #1, I wonder if leaving the unnecessary rows in the database will start to affect query performance. (All of my queries are on indexed columns, so maybe this doesn't matter?)
For choice #2, I wonder if the constant deleting of rows will end up causing problems such as fragmentation?
Query performance will be better in the long run:
What is happening with forever inserts:
The table grows, indexes grow, index performance (lookup) is decreases with the size of the table, especially insert performance is hurt.
What is happening with delete:
Table pages get fragmented, so the deleted space is not re-used 100% as expected, more near 50% in MySQL. So the table still grows to about twice the size you might expect for your amount of data. The index gets fragmented and becomes lob sided: It contains your new data but also the structure for your old data. It depends on the structure of your data on how bad this gets. This situation however stabilizes at a certain performance. This performance point has 2 benefits:
1) The table is more limited in size, so potential full table scans are faster
2) Your performance is predictable.
Due to the fragmentation however this performance point is not equal to about twice your data amount, it tends to be a bit worse (benchmark it to see yourself). The benefit of the delete scenario is however since you have a smaller data set, that you might be able to rebuild your index once every reasonable period, thus improving your performance.
Alternatives
There are two alternatives you can look at to improve performance:
Switch to MariaDB: This gains about 8% performance on large datasets (my observation, dataset just about 200GB compressed data)
Look at partitioning: If you have a handy partitioning parameter, you can create a series of "small tables" for you and prevent logic for delete, rebuild and historic data management. This might give you the best performance profile.
If most of the table is flagged as deleted, you will be stumbling over them as you look for the non-deleted records. Adding is_deleted to many of the indexes is likely to help.
If you are deleting records purely on age, then PARTITION BY RANGE(TO_DAYS(...)) is an excellent way to build the table. The DROP TABLE is instantaneous and the ALTER TABLE ... REORGANIZE ... to create a new week (or month or ...) partition is also instantaneous. See my blog for details.
If you "move" records to another table, then the table will not shrink very fast due to fragmentation. If you have enough disk space, this is not a bug deal. If some queries need to see both current and archived records, use UNION ALL; it is pretty easy and efficient.

MySQL query slow because of separate indexes?

Here is my situation. I have a MySQL MyISAM table containing about 4 million records with a total of 13,3 GB of data. The table contains messages received from an external system. Two of the columns in the table keep track of a timestamp and a boolean whether the message is handled or not.
When using this query:
SELECT MIN(timestampCB) FROM webshop_cb_onx_message
The result shows up almost instantly.
However, I need to find the earliest timestamp of unhandled messages, like this:
SELECT MIN(timestampCB ) FROM webshop_cb_onx_message WHERE handled = 0
The results of this query show up after about 3 minutes, which is way too slow for the script I'm writing.
Both columns are individually indexed, not together. However, adding an index to the table would take incredibly long considering the amount of data that is in there already.
Does my problem originate from the fact that both columns are separatly indexed, and if so, does anyone have a solution to my issue other than adding another index?
It is commonly recommended that if the selectivity of an index over 20% then a full table scan is preferable over an index access. This would mean it is likely that your index on handled won't actually result in using the index but a full table scan given the selectivity.
A composite index of handled, timestampCB may actually improve the performance given its a composite index, even if the selectivity isn't great MySQL would most likely still use it - even if it didn't you could force it's use.

Does this field need an index?

I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.

MySQL indexes - how many are enough?

I'm trying to fine-tune my MySQL server so I check my settings, analyzing slow-query log, and simplify my queries if possible.
Sometimes it is enough if I am indexing correctly, sometimes not. I've read somewhere (please correct me if this is stupidity) that more indexes than I need make the same effect, like if I don't have any of indexes.
How many indexes are enough? You can say it depends on hundreds of factors, but I'm curious about how can I clean up my mysql-slow.log enough to reduce server load.
Furthermore, I saw some "interesting" log entries like this:
# Query_time: 0 Lock_time: 0 Rows_sent: 22 Rows_examined: 44
SELECT * FROM `categories` ORDER BY `orderid` ASC;
The table in question contains exactly 22 rows, index set in orderid. Why is this query showing up in the log after all? Why examine 44 rows if it only contains 22?
The amount of indexing and the line of doing too much will depend on a lot of factors. On small tables like your "categories" table you usually don't want or need an index and it can actually hurt performance. The reason being is that it takes I/O (i.e. time) to read an index and then more I/O and time to retrieve the records associated with the matched rows. An exception being when you only query the columns contained within the index.
In your example you are retrieving all the columns and with only 22 rows and it may be faster to just do a table scan and sort those instead of using the index. The optimizer may/should be doing this and ignoring the index. If that is the case, then the index is just taking up space with no benefit. If your "categories" table is accessed often, you may want to consider pinning it into memory so the db server keeps it accessible without having to goto the disk all the time.
When adding indexes you need to balance out disk space, query performance, and the performance of updating and inserting into the tables. You can get away with more indexes on tables that are static and don't change much as opposed to tables with millions of updates a day. You'll start feeling the affects of index maintenance at that point. What is acceptable in your environment though is and can only be determined by you and your organization.
When doing your analysis, be sure to generate/update your table and index statistics so that you can be assured of accurate calculations.
As a general rule, you should have indexes on all primary keys (you don't have a choice in that), all foreign keys, and any other fields you commonly use to fetch rows.
For example, if I commonly look up users by username, I would have that indexed, even if user ID was the primary key.
How many indexes depends entirely on the queries your running, what kinds of joins are being done (if any), the kind of data stored in the table and how big the tables are (as well as many other factors). There's really no exact science to it. The greatest tool in your arsenal for figuring out how to optimize a query is explain. Using explain you can find out what kind of joins are being down, what possible keys could be used and which key (if any) was used as well as how many rows were examined for each table in the join.
Using this information you can decide how to key your tables and/or modify your queries to make them more efficient. The syntax for explain is very simple.
EXPLAIN SELECT * FROM `categories` ORDER BY `orderid` ASC;
Note, explain does not actually run the query. So if you're using this to debug a query that takes 5 minutes to run, explain will still be very fast.
You do need to be careful when adding indexes though as they do cause inserts and updates to go slower and on very large tables this performance hit can become noticeable. Especially if that same table is used for a lot of reads. While adding a lot of indexes generally won't kill the performance of a query, you should still only add them as yo
Also keep in mind that MySQL will use a maximum of one index per select statement (although if you are using a join, it can also use one for each join). So indexing just because is a waste of disk space and will slow the database down on writes. If you commonly use a where statement on two columns, do one index containing both of those columns, it will be significantly faster than indexing just one alone.
An index can speed up a SELECT query, but it will slow down INSERT/UPDATE/DELETE queries because they need to update the index as well, not just the row.
This is just personal opinion (I've got no facts to back it up), but I think that if there is a query that is taking a long time and an index would speed it up - go for it! "Too many" indexes would be if you added indexes that didn't do any good (e.g. there were no queries it would speed up). For example, a silly thing to do would be to place an index on every column "just because".
There's no magic number for the "best" number of indexes. The basic rule is this: add indexes for queries that are used often and/or need to run quickly.
Having "too many" indexes shouldn't slow down queries, but it each index added adds a small amount of time to add/update items in the db (since it modifies the indices as well), and a small amount of space. However, if you're just adding indexes as required, this is probably not a big concern.