Should I reset a table index / optimize after deletion of many rows? - mysql

I have a table with 1,000,000 records and I'm running a statement that's deleting ~700k rows. The auto-increment-index is of course still at 1,000,001. The highest primary key afterwards for example 40,000.
After such a huge deletion of rows should I manually set the index back to 40,001 or optimize the table in any way? Or doesn't MySQL care for this huge gap when inserting new rows and using the index in select statements afterwards (in terms of speed)?

The MySQL manual says:
OPTIMIZE TABLE should be used if you
have deleted a large part of a table
or if you have made many changes to a
table with variable-length rows
But don't reset the primary key, it can mess things up. The INT datatype (presumably) has a lot of room to grow from 1M.
In terms of query speed, it doesn't matter whether the index value is at 1 000 000 or at 1 000 000 000.

Using OPTIMIZE TABLE afterwards helps MySQL determine how best/fastest to peform queries, e.g. use index or just table scan? It uses what are called Statistics to do so, and so calling OPTIMIZE TABLE is basically giving it a hint that you've made a significant change.

Related

Duplicate table fields vs indexing only

I have a huge and very busy table (few thousands INSERT / second). The table stores loginlogs, it has a bigint ID which is not generated by MySQL but rather by pseudorandom generator on MySQL client.
Simply put, the table has loginlog_id, client_id, tons,of,other,columns,with,details,about,session....
I have few indexes on this table such as PRIMARY_KEY(loginlog_id) and INDEX(client_id)
In some other part of our system I need to fetch client_id based on loginlog_id. This does not happen that often (just few hundreds SELECT client_id FROM loginlogs WHERE loginlog_id=XXXXXX / second). Table loginlogs is read by various other scripts now and then, and always various columns are needed. But the most frequent call to read is for sure the above mentioned get client_id by loginlog_id.
My question is: should I create another table loginlogs_clientids and duplicate loginlog_id, client_id in there (this means another few thousands INSERTS, as for every loginlogs INSERT I get this new one). Or should I be happy with InnoDB handling my lookups by PRIMARY KEY efficiently.
We have tons of RAM (128GB, most of which is used by MySQL). Load of MySQL is between 40% and 350% CPU (we have 12 core CPU). When I tried to use the new table, I did not see any difference. But I am asking for the future, if our usage grows even more, what is the suggested approach? Duplicate or index?
Thanks!
No.
Looking up table data for a single row using the primary key is extremely efficient, and will take the same time for both tables.
Exceptions to that might be very large row sizes (e.g. 8KB+), and client_id is e.g. a varchar that is stored off-page, in which case you might need to read an additional data block, which at least theoretically could cost you some milliseconds.
Even if this strategy would have an advantage, you would not actually do it by creating a new table, but by adding an index (loginlog_id, client_id) to your original table. InnoDB stores everything, including the actual data, in an index structure, so that adding an index is basically the same as adding a new table with the same columns, but without (you) having the problem of synchronizing those two "tables".
Having a structure with a smaller row size can have some advantages for ranged scans, e.g. MySQL will evaluate select count(*) from tablename using the smallest index of the table, as it has to read less bytes. You already have such a small index (on client_id), so even in that regard, adding such an additonal table/index shouldn't have an effect. If you have any range scan on the primary key (which is probably unlikely for pseudorandom data), you may want to consider this though, or keep it in mind for cases when you have.

Table Insert Rate Slows AsTable Size Increases

Parsing some data and inserting it into 3 tables from .NET. Using Table Valued Parameters to pass the data as some inserts are 600,000 rows. Passing objects (not DataTables) and they are passed by reference (nature of TVP). Got 100:1 gain over straight value insert as insert value is limited to 1000 rows at a time. In the stored procedure the insert to the actual table from the TVP is sorted by the clustered index. These tables have no index other than the clustered index. The SP takes a TABLOCK as these are write once tables and one data loader. Fill Factor 100. No increase in data or transaction log size - it is sized for the total data load. Finally to the question. In the last 4 hours have inserted 200 million rows. The insert response time has dropped in 1/2. If the fill factor is 100 and I am inserting sorted by clustered index then why the drop in response? What can I do fix this?
I did not get TVP until I used it - it is like a reverse DataReader.
I would like to thank you for your help and apologize for an incorrect problem statement. For each parse (in this case I am parsing 200,000) the insert is sorted by the clustered index. However for only 1 of the 3 tables is the next parse as a whole in clustered index order. After parsing 70,000 the good table has a scan density of 99% but the other two tables are severely fragmented with a scan density of 12%.
Set a fill factor of 50 on the two fragmented tables and re-indexed. Now I am getting about 1/3 the max speed. I will just need to stop the process and re-index every few hours.
What I ended up doing is changing the clustered index to match insert order. Created a unique index on what used to be the clustered. I disable the unique, index insert the data, and then rebuild the unique index. With that scenario I get a 300:1 performance on a 10 hour run. That is not an extra 0 - three hundred to one. And that is not fudging - compared to starting with the index ordered and a fill factor or 30. Even with the extra index my table size is smaller as I can have both fill factors at 100.
I use #temp on some queries so I can get the rows in an order only known to the query. I converted the #temp to a TVP and gained 1/2 second (about the time it take to create and delete a #temp).
Per OP converting comment to answer...
In addition to auto-stats as #SqlACID mentions, constraint checking could get more expensive as the table fills up. If I'm going to seriously load a table I usually plan to disable or drop indexes and constraints, and re-create them after, if speed is my ultimate goal. This may mean deleting rows after the fact if the constraint is violated, or doing better validation on the bulk data up front when possible.

How can I speed-up the table reconstruction in MySQL when altering the schema?

I have a relatively big MySQL InnoDB table (compressed), and I sometimes need to alter its schema (increasing column size or adding a field).
It takes around 1 hour for a 500 MB table with millions of rows, but the server doesn't seem to be very busy (CPU #5%, not much RAM used, and 2.5 MB/s as I/O).
The table is not used in production so there are no concurrent requests at the same time. There is only a primary index (on the first 5 columns) and one foreign key constraint.
Do you have any suggestion on how to speed-up the table alteration process?
Changing storage engine (to newer generation engines like TokuDB) seems the way to go, until InnoDB is "fixed".
Would be helpful to know the exact table and primary key/index definitions, and of lesser importance, the row count to the nearest million, although I would guess as the table is only 500mb it's probably less than 20 million rows. Also, your approach to changing the table - are you creating a new schema and inserting into it, or using a alter table etc.
I've had success in this area before with approaches like
changing the index key composition, adding a unique key
dropping the indexes first, then changing the table, then adding the indexes back. sometimes independent indexes can be a real performance killer if they are affected by the change.
optimizing the table structure to remove unneeded or oversized columns
separating out data (normally columns but you can vertically partition in some circumstances) that won't change from the core structure that might change, so you only churn a smaller part of your table

mySQL (and MSSQL), using both indexed and non-indexed columns in where clause

The database I use is currently mySQL but maybe later MSSQL.
My questing is about how mySQL and msSQL takes care about indexed and nonindexed columns.
Lets say I have a simple table like this:
*table_ID -Auto increase. just a ID, indexed.
*table_user_ID -every user has a unique ID indexed
*table_somOtherID -some data..
*....
Lets say that I have A LOT!! of rows in this table, But the number of rows that every user add to this table is very small (10-100)
And I want to find one o a few specific rows in this table. a row or rows from a specific User(indexed column).
If I use the following WHERE clause:
..... WHERE table_user_ID= 'someID' AND table_someOtherID='anotherValue'.
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
I guess the database will increase a lot if I have to index every column in all tables..
But what do you think, is it enough to index those columns that will decrease the number of rows to just ten maybe hundred?
Database optimizers generally work on a cost basis on indexes by looking at all the possible indexes to use based on the query. In your specific case it will see 2 columns - table_user_ID with an index and someOtherID without an index. If you really only have 10-100 rows per userID then the cost of this index will be very low and it will be used. This is because the cardinality is high and the DB can only read the few rows it needs and not touch the other rows for every other user its not interested in. However, if the cost to use the index is very high (very few unique userIDs and many entries per user) it might actually be more efficient to not use the index and scan the whole table to prevent random seeking action as it jumps around the table grabbing rows based on the index.
Once it picks the index then the DB just grabs the rows that match that index (10 to 100 in your case) and try to match them against your other criteria searching for rows where someOtherID='anotherValue'
But the number of rows that every user add to this table is very small (10-100)
You only need to index the user_id. It should give you good performance regardless of your query, as long as it includes the user_id in the filter. Until you have identified other use cases, it will pretty much work as you state
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
Yes (in layman terms that is close).
In regards to SQL Server:
The ordering of the indexes are important depending on how you query and how the indexes are structured. If you create an index on the columns
-table_user_id
-table_someotherID
The index is ordered by the table_user_id first. Example:
1-2
1-5
1-6
2-3
2-5
2-6
For the first record on the index, 1 being the table user id, and 2 being some other value.
If you run a query with a where on table_user_id = blah, it will be very fast to use this index, since the table_user_id are indexed in order.
But if you run a query that only uses table_someotherID in the WHERE clause, it might not even use this index, as instead of doing a quick seek in the index for the matching value, it will do a rough scan of the index (which is less efficient than a seek).
Also SQL Server has a INCLUDE feature that associate the columns you want in the SELECT clause to the index you create on the WHERE or JOIN columns.
So to answer your question, it all depends on how you create the indexes and how you query them. You're right not to think about indexing every column, as indexes take up storage and performance hit when you do inserts and updates on the table.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.