Index creation time in MySql - mysql

I am designing a database in Mysql which will be filled with quite large amounts of raw data. I wanna know that I should define indexes before I insert the data, or I should first insert my data and then create the index? is there any difference?
Also I wanna know If I have wanna have index on 2 columns, is it better to index them separately or together?
Thanks

If you are doing a bulk load, my opinion is to not have indexes up front, that will slow the load to constantly write index pages, especially if a larger data set. That being said, after the tables are populated, do a SINGLE statement to build ALL the indexes you expect instead of one-by-one. I learned the hard way a long time ago. I had a table of 14+ million rows and had to build 15+ indexes. Each index was increasingly longer than the last. It appeared each time a new index, it needed to rebuild the pages for the prior. Doing them all at once proved significantly better.
As for multiple column indexes... it depends on how your querying will be performed. If many queries WILL utilize a pair or more of columns in the WHERE condition, then yes, use multiple columns in a single index.

Also I wanna know If I have wanna have
index on 2 columns, is it better to
index them separately or together?
This depends on your queries. When you have an index (colA, colB) the database can never use this index when you don't use colA in the WHERE condition of your queries. If you have queries WHERE colB = ? then you need an index that starts with this column.
index (colA, colB);
WHERE colA = ?; -- can use the index
WHERE colA = ? AND colB = ?; -- can use the index
This one will fail:
WHERE colB = ?;
But... if you change the order of the columns in the index:
index (colB, colA); -- different order
WHERE colb = ?; -- can use the index
WHERE colA = ? AND colB = ?; -- can use the index
And now this one can't use the index:
WHERE colA = ?;
Check your queries, use EXPLAIN and create only the indexes you realy need.

Insert data first.
If index on two columns, either as combo search or individual would be (under normal circ):
idx_a (fldA + fldB)
idx_b (fldB)
regards,
//t

Typically when you are doing large inserts of data you will want to index it afterwards, that way it doesn't have to maintain and rebuild the indexes as data is inserted, therefore speeding up the insert process.
The indexing strategy depends entirely on how you intend to query the database. Are you going to be querying them as a set (i.e. have both in the where clause together) or as individuals (i.e. have one or the other in your where clause).

Related

Avoiding mysql filesort

I have a client running a php photo gallery (on php 5.5, mysql 5.5, using myisam tables) that uses the directory tree method. Unfortunately, some of the queries in their gallery application is demanding horribly long filesorts. The offending query:
SELECT `name`, `slug`
FROM `db_table`
WHERE `left_ptr` <= '914731'
AND `right_ptr` >= '914734'
AND `id` <> 1
ORDER BY `left_ptr` ASC
There are indexes on id, left_ptr and right_ptr, but according to the EXPLAIN, none of them are being used in the query.
I heard that creating a composite index (on the 'condition' columns) would make things faster, but does that apply to this case? The last condition is really but an 'anything but 1' clause, so would a composite index apply to that, too? Thanks for any insight into this.
Yes, a composite index on (left_ptr, right_ptr) should make this query run better.
MySQL will only use one index per query. It's likely not using any single index because it's determined no single index would be much faster than a full table scan. For example, id <> 1 is every row but the first, so just do a full table scan. The other two filters depend on how the data is distributed, but if it doesn't filter a significant portion of the table it won't use an index.
A composite index on (left_ptr, right_ptr) should make this query run better. Don't bother with id, as above id <> 1 only filters one row.
MySQL can use the first column of a composite index alone, so this composite index also replaces the one on left_ptr alone

Index on mysql partitioned tables

I have a table with two partitions. Partitions are pactive = 1 and pinactive = 0. I understand that two partitions does not make so much of a gain, but I have used it to truncate and load in one partition and plain inserts in another partition.
The problem comes when I create indexes.
Query goes this way
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
Created index -
create index idx_try on customformattributes(partitionflag,companyid,activityname,completiondate,attributename,isclosed)
there are around 200000 records that will be retreived from the above query. But the query along with the mentioned index takes 30+ seconds. What is the reason for such a long time? Also, if remove the partitionflag from the mentioned index, the index is not even used.
And is the understanding that,
Even with the partitions available, the optimizer needs to have the required partition mentioned in the index definition, so that it only hits the required partition ---- Correct?
Any ideas on understanding this would be very helpful
You can optimize your index by reordering the columns in it. Usually the columns in the index are ordered by its cardinality (starting from the highest and go down to the lowest). Cardinality is the uniqueness of data in the given column. So in your case I suppose there are many variations of companyid in customformattributes table while partitionflag will have cardinality of 2 (if all the options for this column are 1 and 0).
Your query will first filter all the rows with partitionflag=0, then it will filter by company id and so on.
When you remove partitionflag from the index the query did not used the index because may be the optimizer decides that it will be faster to make full table scan instead of using the index (in most of the cases the optimizer is right)
For the given query:
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
the following index may be would be better (but of course :
create index idx_try on customformattributes(companyid,activityname, completiondate,attributename, partitionflag, isclosed)
For the query to use index the following rule must be met - the left most column in the index should be present in the where clause ... and depending on the mysql version you are using additional query requirements may be needed. For example if you are using old version of mysql - you may need to order the columns in the where clause in the same order they are listed in the index. In the last versions of mysql the query optimizer is responsible for ordering the columns in the where clause in the correct order.
Your SELECT query took 30+ seconds because it returns 200k rows and because the index might not be the optimal for the given query.
For the second question about the partitioning: the common rule is that the column you are partitioning by must be part of all the UNIQUE keys in a table (Primary key is also unique key by definition so the column should be added to the PK also). If table structure and logic allows you to add the partitioning column to all the UNIQUE indexes in the table then you add it and partition the table.
When the partitioning is made correctly you can take the advantage of partitioning pruning - this is when SELECT query searches the data only in the partitions where given data is stored (otherwise it looks in all partitions)
You can read more about partitioning here:
https://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
The query is slow simply because disks are slow.
Cardinality is not important when designing an index.
The optimal index for that query is
INDEX(companyid, activityname, partitionflag) -- in any order
It is "covering" since it includes all the columns mentioned anywhere in the SELECT. This is indicated by "Using index" in the EXPLAIN.
Leaving off the other 3 columns makes the query faster because it will have to read less off the disk.
If you make any changes to the query (add columns, change from '=' to '>', add ORDER BY, etc), then the index may no longer be optimal.
"Also, if remove the partitionflag from the mentioned index, the index is not even used." -- That is because it was no longer "covering".
Keep in mind that there are two ways an index may be used -- "covering" versus being a way to look up the data. When you don't have a "covering" index, the optimizer chooses between using the index and bouncing between the index and the data versus simply ignoring the index and scanning the table.

Does an index improve performance when using modulo?

Imagine a MySQL table with one field id containing 1 billion rows from number 1 to a billion.
When I do a query like this
SELECT * FROM table WHERE id > 2000 AND id < 5000;
It is obvious that an index on id will improve the performance of that query.
However does such an index also help with modulo as in the following query
SELECT * FROM table WHERE (id % 4) = 0;
Does using an index help when using modulo?
No.
Functions on columns used in an index (almost) always preclude the use of the index. Even if this weren't true, the optimizer might decide not to use an index anyway. Fetching just one out of four records may not be selective enough for the index to be worthwhile.
In Oracle DB for example you can define so called function based indices for your purpose where you define that modulo function in the index. But I'm pretty sure function based indices do not exist with MySQL.
What you could do as a workaround is adding a additional column where you store the result of your modulo function. You have to modify your insert scripts fill it for future inserts and update the existing data sets. Then you can add an index to that column and use it in your where clause.

Two non-primary/unique indexes in a three column table

I've got a three col table. It has a unique index, and another two (for two different columnts) for faster queries.
+-------------+-------------+----------+
| category_id | related_id | position |
+-------------+-------------+----------+
Sometimes the query is
SELECT * FROM table WHERE category_id = foo
and sometimes it's
SELECT * FROM table WHERE related_id = foo
So I decided to make both category_id and related_id an index for better performance. Is this bad practice? What are the downsides of this approach?
In the case I already have 100.000 rows in that table, and am inserting another 100.000, will it be an overkill. having to refresh the index with every new insert? Would that operation then take too long? Thanks
There are no downsides if it's doing exactly what you want, you query on a specific column a lot, so you make that column indexed, that's the whole point. Now you have a 60 column table and your adding indexes to columns you never query on then you are wasting resources because those indexes need to be maintained on INSERT/UPDATE/DELETE operations.
If you have created index for each column then you will definitely get benefit out of it.
Don't go for composite indexes (Multiple coulmn indexes).
You yourself can see the advantage of index in your query by using EXPLAIN (statement provides information about how MySQL executes statements).
EXAMPLE:
EXPLAIN SELECT * FROM table WHERE category_id = foo;
Hope this will help.
~K
Its good to have indexes. Just understand that indexes would take more disk space, but faster search.
It is in your best interest to index those fields which have less repeated values. For eg. Indexing a field that contains a Boolean flag might not be a good idea.
Since in your case you are having an id, hence I think you won't be having any problem in keeping the indexes that you have created.
Also, the inserts would be slower, but since you are saving id's there won't be much of a difference in the time required to insert. Go ahead and do the insert.
My personal advice :
When you are inserting large number of rows in a single table in one go, don't insert them using a single query, unless mandatory. This would prevent your table from getting locked and inaccessible for a long time.

SELECT vs UPDATE performance with index

If I SELECT IDs then UPDATE using those IDs, then the UPDATE query is faster than if I would UPDATE using the conditions in the SELECT.
To illustrate:
SELECT id FROM table WHERE a IS NULL LIMIT 10; -- 0.00 sec
UPDATE table SET field = value WHERE id IN (...); -- 0.01 sec
The above is about 100 times faster than an UPDATE with the same conditions:
UPDATE table SET field = value WHERE a IS NULL LIMIT 10; -- 0.91 sec
Why?
Note: the a column is indexed.
Most likely the second UPDATE statement locks much more rows, while the first one uses unique key and locks only the rows it's going to update.
The two queries are not identical. You only know that the IDs are unique in the table.
UPDATE ... LIMIT 10 will update at most 10 records.
UPDATE ... WHERE id IN (SELECT ... LIMIT 10) may update more than 10 records if there are duplicate ids.
I don't think there can be a one straight-forward answer to your "why?" without doing some sort of analysis and research.
The SELECT queries are normally cached, which means that if you run the same SELECT query multiple times, the execution time of the first query is normally greater than the following queries. Please note that this behavior can only be experienced where the SELECT is heavy and not in scenarios where even the first SELECT is much faster. So, in your example it might be that the SELECT took 0.00s because of the caching. The UPDATE queries are using different WHERE clauses and hence it is likely that their execution times are different.
Though the column a is indexed, but it is not necessary that MySQL must be using the index when doing the SELECT or the UPDATE. Please study the EXPLAIN outputs. Also, see the output of SHOW INDEX and check if the "Comment" column reads "disabled" for any indexes? You may read more here - http://dev.mysql.com/doc/refman/5.0/en/show-index.html and http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html.
Also, if we ignore the SELECT for a while and focus only on the UPDATE queries, it is obvious that they aren't both using the same WHERE condition - the first one runs on id column and the latter on a. Though both columns are indexed but it does not necessarily mean that all the table indexes perform alike. It is possible that some index is more efficient than the other depending on the size of the index or the datatype of the indexed column or if it is a single- or multiple-column index. There sure might be other reasons but I ain't an expert on it.
Also, I think that the second UPDATE is doing more work in the sense that it might be putting more row-level locks compared to the first UPDATE. It is true that both UPDATES are finally updating the same number of rows. But where in the first update, it is 10 rows that are locked, I think in the second UPDATE, all rows with a as NULL (which is more than 10) are locked before doing the UPDATE. Perhaps MySQL first applies the locking and then runs the LIMIT clause to update only limited records.
Hope the above explanation makes sense!
Do you have a composite index or separate indexes?
If it is a composite index of id and a columns,
In 2nd update statement the a column's index would not be used. The reason is that only the left most prefix indexes are used (unless if a is the PRIMARY KEY)
So if you want the a column's index to be used, you need in include id in your WHERE clause as well, with id first then a.
Also it depends on what storage engine you are using since MySQL does indexes at the engine level, not server.
You can try this:
UPDATE table SET field = value WHERE id IN (...) AND a IS NULL LIMIT 10;
By doing this id is in the left most index followed by a
Also from your comments, the lookups are much faster because if you are using InnoDB, updating columns would mean that the InnoDB storage engine would have to move indexes to a different page node, or have to split a page if the page is already full, since InnoDB stores indexes in sequential order. This process is VERY slow and expensive, and gets even slower if your indexes are fragmented, or if your table is very big
The comment by Michael J.V is the best description. This answer assumes a is a column that is not indexed and 'id' is.
The WHERE clause in the first UPDATE command is working off the primary key of the table, id
The WHERE clause in the second UPDATE command is working off a non-indexed column. This makes the finding of the columns to be updated significantly slower.
Never underestimate the power of indexes. A table will perform better if the indexes are used correctly than a table a tenth the size with no indexing.
Regarding "MySQL doesn't support updating the same table you're selecting from"
UPDATE table SET field = value
WHERE id IN (SELECT id FROM table WHERE a IS NULL LIMIT 10);
Just do this:
UPDATE table SET field = value
WHERE id IN (select id from (SELECT id FROM table WHERE a IS NULL LIMIT 10));
The accepted answer seems right but is incomplete, there are major differences.
As much as I understand, and I'm not a SQL expert:
The first query you SELECT N rows and UPDATE them using the primary key.
That's very fast as you have a direct access to all rows based on the fastest possible index.
The second query you UPDATE N rows using LIMIT
That will lock all rows and release again after the update is finished.
The big difference is that you have a RACE CONDITION in case 1) and an atomic UPDATE in case 2)
If you have two or more simultanous calls of the case 1) query you'll have the situation that you select the SAME id's from the table.
Both calls will update the same IDs simultanously, overwriting each other.
This is called "race condition".
The second case is avoiding that issue, mysql will lock all rows during the update.
If a second session is doing the same command it will have a wait time until the rows are unlocked.
So no race condition is possible at the expense of lost time.