Multi-column index combined with unique index efficiency - mysql

We have a table that has multiple columns, and we have a UNIQUE index on one of our columns (lets call it GBID), and we have another column (lets call it flag) that has no indicies. This table can be quite large and we query WHERE gbid IN () AND flag = 1 a lot, we occasionally query WHERE gbid = "XXX" and rarely query WHERE flag = 1.
Which is more efficient when it comes to indicies:
Have gbid as UNIQUE and flag with no index
Have gbid as UNIQUE and have a multi column index for (gbid, flag)
Have gbid as UNIQUE and have a multi column index for (flag, gbid)

It depends on the % of rows with flag=1, and on how many rows you select (how many gbid's you have in the IN clause).
If it is low (1-2%) and you are selecting a lot of gbid's, options 2 and 3 might be faster (I think option 3 will be better in that case).
If you have a more even distribution of flag values having it in the index won't make a difference.
If you want to be sure you should benchmark it with a sample of real data.

Related

Clustered index on integer column surprisingly slow

I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.

Is full table scan needed for counting rows with some attribute bigger than x?

Let's say there a table of people, with an age column which is indexed. How fast would be a query to count people older than 20: SELECT COUNT(*) FROM people WHERE age > 20? Is full table scan required? The database is MySQL.
if the column age is not indexed, then yes, a full table scan is required.
Even if it is indexed, if the data distribution of age values is such that there are more than a certain threshold percentage of the records that have age > 20, then a table scan is required anyway. it works this way, for each row that would be returned by the query, the processor must execute n disk IO operations, where n is the number of levels in the index... If there are, say a million rows in the table, and the index on age is 5 levels deep, then if there are more than 200k rows with age value > 20 then for each of those rows the processor has to execute 5 I/Os, for a total of 200k * 5 = 1 million I/Os, so, the optimizer says, if my statistics indicate that more than 200k rows would be returned, I might as well do a complete table scan, that will require less than 1 Million I/Os.
The only exception to this is if the entire table is clustered on the age column, then you only need to traverse the index for the boundaries of the age range you want to filter on.
There are some errors in the Accepted Answer. Rather than dissect that Answer, I will start fresh:
Given SELECT COUNT(*) FROM people WHERE age > 20, here is the performance for InnoDB, fastest first:
1. `INDEX(age)` -- Range scan within the index
2. `INDEX(age, ...)` -- Range scan within the index
3. `INDEX(foo, age)` -- Full Index scan
4. `PRIMARY KEY(age, ...)` -- Range scan within the table
5. No indexes -- Table scan needed
6. `PRIMARY KEY(foo, ...) -- Table scan needed (same as "No index")
Notes and caveats:
INDEX(age, ...) is a littler slower than INDEX(age) only because the index is bulkier.
Any secondary index containing all the columns mentioned anywhere in the SELECT (just age, in this example) is called a "covering" index. EXPLAIN will say Using index (not to be confused with Using index condition). A covering index is faster than other secondary indexes. (If we had another column in the select, I could say more.)
Note "Range scan" vs "scan" -- This is where the processing can drill down the BTree index (primary or secondary) to where age = 20 and scan forward. That is, it does not need to scan the entire table or index, hence Range scan" is faster than "scan".
Items 3 and 4 may not be in the correct order. Item 3 may be faster when the index is significantly less bulky than the table. Item 4 may be faster when the range is a small fraction of the table. Because of this "maybe", I can't say "a covering index is always faster than using the PRIMARY KEY". Instead I can only say "usually faster".
A million rows is likely to have only 3 levels of BTree. However this part of the computation is almost never worth pursuing. (Rule of Thumb: each level of an index or table BTree fans out by a factor of 100.)
If the necessary part of the data or index is not already cached in RAM, then there will be I/O -- this can drastically slow down any of the cases. It can even turn the fastest case into slower than all the rest.
If the the data/index is too big to be cached, then there will always be I/O. In this case the ordering will stay roughly the same, but the differences will be more pronounced. (For example, "bulkier" becomes a significant factor.)
SELECT name FROM t WHERE age>20 is a different can of worms. Some of what I have said does not carry over to it. (Ask another Question if you want me to spell that out. It will have more cases.)
MyISAM and MEMORY have differences relative to InnoDB.

Ensure certain default sort order in MySql table

I have a large MySql table with over 11 million rows. This is just a huge data set and my task is to be able to analyze the dataset based on certain rules.
Each row belongs to a certain category. There are 2 million different categories. I want to get all rows for a category and perform operations on that.
So currently, I do the following:
Select distinct categories from the table.
for each category : Select fields from table WHERE category=category
Even though my category column is indexed, it takes a really long time to execute Step 2. This is mainly because of the huge data set.
Alternatively, I can use GROUP BY clause, however I am not sure if it will be as fast since GROUP BY on such a huge dataset may be expensive, especially when considering that I will be running my analysis several times on parts of the dataset. A way to permanently ensure a sorted table would be useful.
Therefore as an alternative, I can speed up my queries if only my table is pre-sorted by category. Now I can just read the table row by row and perform the same operations in a much faster time, as all rows of one category will be fetched consecutively.
As the dataset (MySql table) is fixed and no update, delete, insert operations will be performed on it. I want to be able to ensure a way to maintain a default sort order by category. Can anyone suggest a trick to ensure the default sort order of the rows.
Maybe read all rows and rewrite them to a new table or add a new primary key which ensures this order?
Even though my category column is indexed
Indexed by a secondary index? If so, you can encounter the following performance problems:
InnoDB tables are always clustered and the secondary index in clustered table can require a double-lookup (see the "Disadvantages of clustering" in this article).
Indexed rows can be scattered all over the place (index can have bad clustering factor - the link is for Oracle but the principle is the same). If so, an index range scan (such as WHERE category = whatever) can end-up loading many table pages, even though the index is actually used and only a small subset of rows is actually selected. This can destroy the range scan performance.
In alternative to the secondary index, consider using a natural primary key, which in InnoDB tables also acts as a clustering key. The primary/clustering key such as {category, no} will keep the rows of the same category physically close together, making both of your queries (and especially the second one) maximally efficient.
OTOH, if you want to keep the secondary index, consider covering all the fields that you query, so the primary B-Tree doesn't have to be touched at all.
You can do this in one step regardless of indexing by doing something like (pseudo code):
Declare #LastCategory int = Null
Declare #Category int
For Each Row In
Select
#Category = Category,
...
From
Table
Order By
Category
If #LastCategory Is Null Or #LastCategory != #Category
Do any "New Category Steps"
Set #LastCategory = #Category
End
Process Row
End For
With the index on category I'd expect this to perform OK. Your performance issues may be down to what you are doing when processing each row.
Here's an example: http://sqlfiddle.com/#!2/e53c98/1

Is there a fast alternative to "ORDER BY RAND()" that does not require evenly distributed integers as primary keys?

I want to extract a random row of a table. Using "ORDER BY RAND" and taking the first row is slow because a separate table is created. The standard alternative is to rely on a unique primary index that has to be an integer.
However, this does not return good results if the primary keys aren't evenly distributed. Additionally, it requires that I maintain an additional column of integers.
I have done some random selection in T-SQL with unevenly distributed keys that does not require an additional column to be added, this is how:
Check how many valid rows there are in the table (COUNT(...))
Randomize a number between 1 and the number of rows
Query the row using the random number as an index
Even if your primary keys aren't evenly distributed, you can still make sure of them by an open-ended range query:
SELECT thing FROM table WHERE pk_id > 134 LIMIT 1;
Even if there is no row that has a key of 134, you'll get the next one in the chain. The nice thing about this approach is that it's a simple range scan and highly efficient. You also do not need to know how many rows are in the table (say via a SELECT COUNT(*) ...), which is costly when using InnoDB - and you should be using InnoDB). You DO need to do the max row id, but that's efficient to grab (SELECT MAX(pk_id) FROM table) and can be cached.

Indexing SQL database

i got questions about indexing SQL database:
Is it better to index boolean column or rather not because there are only 2 options? i know if the table is small then indexing will not change anything, but im asking about table with 1mln records.
If i got two dates ValidFrom and ValidTo is it better to create 1 index with 2 columns or 2 seperate indexes? In 90% of queries i use where validfrom < date && validto > date, but there are also few selects only with validfrom or only with validto
whats the diffrence between clustered and non-clistered index? i cant find any article, so a link would be great
You both tagged MySQL and SQL-server. This answer is MySQL inspired.
It depends on many things, but more important than the size is the variation. If about 50% of the values are TRUE, that means the rest of the values (also about 50%) are FALSE and an index will not help much. If only 2% of the values are TRUE and your queries often only need TRUE records, this index will be useful!
If your queries often use both, put both in the index. If one is used more than the other, put that one FIRST in the index, so the composite index can be used for the one field as well.
A clustered index means that the data actually is inside the index. A non-clustered index just points to the data, which is actually stored elsewhere. The PRIMARY KEY in InnoDB is a clustered index.
If you want to use Indexes in MySQL, EXPLAIN is your friend!
This is all for SQL Server, which is what I know about...
1 - Depends on cardinality, but as a rule an index on a single boolean field (BIT in SQL Server) won't be used since it's not very selective.
2 - Make 2 indexes, one with both, and the other with just the second field from the first index. Then you are covered in both cases.
3 - Clustered indexes contain the data for ALL fields at the leaf level (the entire table basically) ordered by your clustered index field. Non-clustered indexes contain only the key fields and any INCLUDEd fields at the leaf level, with a pointer to the clustered index row if you need any other data from other fields for that row.
If you use the "Filtered Index", the number of records up to 2 million with no problems.
Create 1 Non clustered index instead of 2 Filtered Index
Different in user experience, these two aspects are not related to each other nothing. The search index (PK: Primary Key) is different than searching for a range of values ​​(Non clustered Index often used in tracing the value range), in fact finding by PK represented less than 1% queries