SQL Server 2008 Index Optimization - clustered lookup vs nonclustered include - sql-server-2008

This is a long, involved question about index optimization theory. This is not homework, though I was first exposed to this question in a sample exam for Microsoft's 70-432. The original question was about general query optimization, but then I found this peculiar behavior I could not explain.
First, the table:
CREATE TABLE Invoice_details (
Invoice_id int NOT NULL,
Customer_id int NOT NULL,
Invoice_date datetime DEFAULT GETDATE() NULL,
Amount_total int NULL,
Serial_num int IDENTITY (1,1) NOT NULL)
Now, a clustered index, and the two indexes for testing:
CREATE UNIQUE CLUSTERED INDEX [ix_serial] ON [dbo].[Invoice_details] ([Serial_num] ASC)
/* Below is the "original" index */
CREATE NONCLUSTERED INDEX [ix_invoice_customer] ON [dbo].[Invoice_details]
([Invoice_id] ASC,[Customer_id] ASC)
/* Below is the "optimized" index (adds one included field) */
CREATE NONCLUSTERED INDEX [ix_invoice_customer_inc] ON [dbo].[Invoice_details]
([Invoice_id] ASC,[Customer_id] ASC) INCLUDE ([Invoice_date])
I also added some random test data to the table - 100000 rows. Invoice_id, Customer_id, and Amount_total each received their own random values (range 1000-9999), and Invoice_date received GETDATE() plus a random number of seconds (range 1000-9999). I can provide the actual routine I used, but did not think the specifics would be relevant.
And finally, the query:
SELECT Invoice_id,Customer_id,Invoice_date FROM Invoice_details WHERE Customer_id=1234;
Obviously, the query's first step will be a nonclustered index scan. Regardless of which index is used, that first step will return the same number of index rows. With the "original" index, the next step will be a lookup via the clustered index to retrieve Invoice_date, followed by an internal JOIN between the two sets. With the "optimized" index, that field is included in the index leaf, so the planner goes straight to returning the results.
Which index results in faster execution, and why?

It depends ... on the tipping point.

Assuming no issues such as fragmentation then it comes down to selectivity of the query.
The 2 indexes are very similar. Because the "optimized" one includes an additional column in the leaf pages then a full scan of that index may well mean more pages need to be read compared to the original one. However if more than a handful of rows are due to be returned I would expect the benefit of not needing the lookup to very quickly outweigh this minor disadvantage.

Related

MySQL Clustered vs Non Clustered Index Performance

I'm running a couple tests on MySQL Clustered vs Non Clustered indexes where I have a table 100gb_table which contains ~60 million rows:
100gb_table schema:
CREATE TABLE 100gb_table (
id int PRIMARY KEY NOT NULL AUTO_INCREMENT,
c1 int,
c2 text,
c3 text,
c4 blob NOT NULL,
c5 text,
c6 text,
ts timestamp NOT NULL default(CURRENT_TIMESTAMP)
);
and I'm executing a query that only reads the clustered index:
SELECT id FROM 100gb_table ORDER BY id;
I'm seeing that it takes almost an ~55 min for this query to complete which is strangely slow. I modified the table by adding another index on top of the Primary Key column and ran the following query which forces the non-clustered index to be used:
SELECT id FROM 100gb_table USE INDEX (non_clustered_key) ORDER BY id;
This finished in <10 minutes, much faster than reading with the clustered index. Why is there such a large discrepancy between these two? My understanding is that both indexes store the index column's values in a tree structure, except the clustered index contains table data in the leaf nodes so I would expect both queries to be similarly performant. Could the BLOB column possibly be distorting the clustered index structure?
The answer comes in how the data is laid out.
The PRIMARY KEY is "clustered" with the data; that is, the data is order ed by the PK in a B+Tree structure. To read all of the ids, the entire BTree must be read.
Any secondary index is also in a B+Tree structure, but it contains (1) the columns of the index, and (2) any other columns in the PK.
In your example (with lots of [presumably] bulky columns), the data BTree is a lot bigger than the secondary index (on just id). Either test probably required reading all the relevant blocks from the disk.
A side note... This is not as bad as it could be. There is a limit of about 8KB on how big a row can be. TEXT and BLOB columns, when short enough, are included in that 8KB. But when one is bulky, it is put in another place, leaving behind a 'pointer' to the text/blob. Hence, the main part of the data BTree is smaller than it might be if all the text/blob data were included directly.
Since SELECT id FROM tbl is a mostly unnecessary query, the design of InnoDB does not worry about the inefficiency you discovered.
Tack on ORDER BY or WHERE, etc, and there are many different optimizations that could into play. You might even find that INDEX(c1) will let your query run in not much more than 10 minutes. (I think I have given you all the clues for 'why'.)
Also, if you had done SELECT * FROM tbl, it might have taken much longer than 55 minutes. This is because of having extra [random] fetches to get the texts/blobs from the "off-record" storage. And from the network time to shovel far more data.

MySQL query run time is better even though its execution plan is bad

I am trying to optimize this MySQL query and having less experience in understanding execution plan I am having hard time making sense of the execution plan.
My question is : Can you please help me in understanding why the query execution plan of New Query is worse than that of Original query even though New query performs better in Prod.
SQL needed to reproduce this case is here
Also kept relevant table definition in the end ( Table bill_range references bill using foreign key bill_id )
Original query takes 10 second to complete in PROD
select *
from bill_range
where (4050 between low and high )
order by bill_id limit 1;
while new query (I am forcing/suggesting to use index) takes 5 second to complete in PROD
select *
from bill_range
use index ( bill_range_low_high_index)
where (4050 between low and high )
order by bill_id limit 1;
But the execution plan gives suggest original query is better( this is the part where my understanding seems to be wrong )
Original query
New query
Column "type" for original query suggest index while new query
says ALL
Column "Key" is bill_id (perhaps index on FK) for
original queryand Null for new query
Column "rows" for original query is 1 while for new query says 9
So given all this information wouldn't it imply that new query is actually worse than original query .
And if that is true why is new query performing better? Or am I reading the execution plan wrong.
Table defintions
CREATE TABLE bill_range (
id int(11) NOT NULL AUTO_INCREMENT,
low varchar(255) NOT NULL,
high varchar(255) NOT NULL,
PRIMARY KEY (id),
bill_id int(11) NOT NULL,
FOREIGN KEY (bill_id) REFERENCES bill(id)
);
CREATE TABLE bill (
id int(11) NOT NULL AUTO_INCREMENT,
label varchar(10),
PRIMARY KEY (id)
);
create index bill_range_low_high_index on bill_range( low, high);
NOTE : The reason I am providing definition of 2 tables is because original query decided to use an index based on Foreign key to bill table
Your index isn't quite optimal for your query. Let me explain if I may.
MySQL indexes use BTREE data structures. Those work well in indexed-sequential access mode (hence the MyISAM name of MySQL's first storage engine). It favors queries that jump to a particular place in an index and then run through the index element by element. The typical example is this, with an index on col.
SELECT whatever FROM tbl WHERE col >= constant AND col <= constant2
That is a rewrite of WHERE col BETWEEN constant AND constant2.
Let's recast your query so this pattern is obvious, and so the columns you want are explicit.
select id, low, high, bill_id
from bill_range
where low <= 4050
and high >= 4050
order by bill_id limit 1;
An index on the high column allows a range scan starting with the first eligible row with high >= 4050. Then, we can go on to make it a compound index, including the bill_id and low columns.
CREATE INDEX high_billid_low ON bill_range (high, bill_id, low);
Because we want the lowest matching bill_id we put that into the index next, then finally the low value. So the query planner random accesses the index to the first elibible row by high, then scans until it finds the very first index item that meets the low criterion. And then it's done: that's the desired result. It's already ordered by bill_id so it can stop. ORDER BY comes from the index. The query can be satisfied entirely from the index -- it is a so-called covering index.
As to why your two queries performed differently: In the first, the query planner decided to scan your data in bill_id order looking for the first matching low/high pair. Possibly it decided that actually sorting a result set would likely be more expensive than scanning bill_ids in order. It looks to me like your second query did a table scan. Why that was faster, who knows?
Notice that this index would also work for you.
CREATE INDEX low_billid_high ON bill_range (low DESCENDING, bill_id, high);
In InnoDB the table's PK id is implicitly part of every index, so there's no need to mention it in the compound index.
And, you can still write it the way you first wrote it; the query planner will figure out what you want.
Pro tip: Avoid SELECT * ... the * makes it harder to reason about the columns you need to retrieve.

mariadb (mysql) sub partition error (total sub partition count exceeds 64)

enter image description here
Hello
I want to configure a partition (monthly)/subpartition (day by day) as the query above.
If the total number of subpartitions exceeds 64,
'(errno: 168 "Unknown (generic) error from engine")'
The table is not created due to an error. (Creating less than 64 is successed).
I know that the maximum number of partitions (including subpartitions) that can be created is 8,192, is there anything I missed?
Below is the log table.
create table detection_log
(
id bigint auto_increment,
detected_time datetime default '1970-01-01' not null,
malware_title varchar(255) null,
malware_category varchar(30) null,
user_name varchar(30) null,
department_path varchar(255) null,
PRIMARY KEY (detected_time, id),
INDEX `detection_log_id_uindex` (id),
INDEX `detection_log_malware_title_index` (malware_title),
INDEX `detection_log_malware_category_index` (malware_category),
INDEX `detection_log_user_name_index` (user_name),
INDEX `detection_log_department_path_index` (departmen`enter code here`t_path)
);
SUBPARTITIONs provide no benefit that I know of.
HASH partitioning either provides no benefit or hurts performance.
So... Explain what you hoped to gain by partitioning; then we can discuss whether any type of partitioning is worth doing. Also, provide the likely SELECTs so we can discuss the optimal INDEXes. If you need a "two-dimensional" index, that might indicate a need for partitioning (but still not subpartitioning).
More
I see PRIMARY KEY(detected_time,id). This provides a very fast way to do
SELECT ...
WHERE detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
In fact, it will probably be faster than if you also partition the table. (As a general rule it is useless to partition on the first part of the PK.)
If you need to do
SELECT ...
WHERE user_id = 123
AND detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
Then this is optimal:
INDEX(user_id, detected_time, id)
Again, probably faster than any form of partitioning on any column(s).
And
A "point query" (WHERE key = 123) takes a few milliseconds more in a 1-billion-row table compared to a 1000-row table. Rarely is the difference important. The depth of the BTree (perhaps 5 levels vs 2 levels) is the main difference. If you PARTITION the table, you are removing perhaps 1 or 2 levels of the BTree, but replacing them with code to "prune" down to the desired partition. I claim that this tradeoff does not provide a performance benefit.
A "range query" is very nearly the same speed regardless of the table size. This is because the structure is actually a B+Tree, so it is very efficient to fetch the 'next' row.
Hence, the main goal in optimizing queries on a huge table is to take advantage of the characteristics of the B+Tree.
Pagination
SELECT log.detected_time, log.user_name, log.department_path,
log.malware_category, log.malware_title
FROM detection_log as log
JOIN
(
SELECT id
FROM detection_log
WHERE user_name = 'param'
ORDER BY detected_time DESC
LIMIT 25 OFFSET 1000
) as temp ON temp.id = log.id;
The good part: Finding ids, then fetching the data.
The slow part: Using OFFSET.
Have this composite index: INDEX(user_name, detected_time, id) in that order. Make another index for when you use department_path.
Instead of OFFSET, "remember where you left off". A blog specifically about that: http://mysql.rjweb.org/doc.php/pagination
Purging
Deleting after a year is an excellent use of PARTITIONing. Use PARTITION BY RANGE(TO_DAYS(detected_time)) and have either ~55 weekly or 15 monthly partitions. See HTTP://mysql.rjweb.org/doc.php/partitionmaint for details. DROP PARTITION is immensely faster than DELETE. (This partitioning will not speed up SELECT.)

What is the difference between single or composite column indexes? [duplicate]

This question already has answers here:
When should I use a composite index?
(9 answers)
Closed 7 years ago.
In any relational Databases, we can create indexes that boost query speed. But creating more index can damage update/insert speed because the Db system will have to update each index when new data coming (insert, update, merge etc)
We use an example.
we can create a index called index1
ADD INDEX index1 (order_id ASC, buyer_id ASC)
OR we can create 2 indexes, index2 and index3
ADD INDEX index2 (order_id ASC)
ADD INDEX index3 (buyer_id ASC)
In a query like this
select * from tablename where order_id>100 and buyer_id>100
Which one is faster? By using Index1 or index2 and index3?
On the other side of the equation, when inserting or updating, I assume it will be much faster to just use one index instead of 2 but I haven't tested it against MySql or MSSQL server so I can't be so sure. If anyone has experience on that matter, do share it please.
And the last thing is about int typed values, I thought it's not possible or relevant to create a index just for int type columns because it doesn't boost the query time, is it true?
The performance of an index are linked to its selectivity, the fact of using two indexes, or a composite index must be assessed, in the context of its application or query particularly critical as regards the performance just by virtue of where on the fields as a possible the index reduces the number of rows to process (and put into joins).
In your case, since an order usually only one buyer is not very selective index order_id, buyer_id (pleasant its usefulness to join operations) as it would be more the reverse, buyer_id, order_id for facilitating the search for orders of a buyer
For the exact query you mentioned I would personally go for index1 (you will have a seek operation for both conditions at once). The same index should also do the job even if you filter by order_id only (because order id is the first column of the index, so the same BTREE structure should still help even if you omit the buyer).
At the same time index1 would not help much if you filter by buyer_id only (because the BTREE will be structured firstly by the missing order_id as per the index creation statement). You will probably end up with index scan with index1, while having separate indices would still work in that scenario (a seek on index3 is what should be expected).

How to replace clustered index scan with a non-clustered index seek or clustered index seek?

Below is my create table script:-
CREATE TABLE [dbo].[PatientCharts](
[PatientChartId] [uniqueidentifier] ROWGUIDCOL NOT NULL,
[FacilityId] [uniqueidentifier] NOT NULL,
[VisitNumber] [varchar](200) NOT NULL,
[MRNNumber] [varchar](100) NULL,
[TimeIn] [time](7) NULL,
[TimeOut] [time](7) NULL,
[DateOfService] [date] NULL,
[DateOut] [date] NULL),
I have one clustered index on PatientChartId and two non-clustered index on VisitNumber and MRNNumber. This table has millions of records.
The following query is doing a clustered index scan:-
SELECT *
FROM dbo.PatientCharts
INNER JOIN ( SELECT FacilityID
FROM Facilities
WHERE RemoteClientDB IN (
SELECT SiteID
FROM RemoteClient WITH ( NOLOCK )
WHERE Code = 'IN-ESXI-EDISC14'
)
) AS Filter ON dbo.PatientCharts.FacilityId = Filter.FacilityID
This clustered index scan is taking a lot of time in production because of data volume.
The execution plan is :-
I have even tried adding a Non-clusted index on FacilityID and including PatientChartID but still the same execution plan.
I am doing DBCC FREEPROCCACHE everytime to instruct sql server to use a new plan every time.
Is there anything else which I should do to prevent clusteredindex scan ?
The clustered scan will occur since there is no index to support your query. Even if you index FacilityID and PatientChartID you are still potentially asking for sufficient amounts of data to scan due to going past the tipping point (Google Kimberly Tripp Tipping Point)
There is no easy way to say the next part, but for a system with millions of records but such a trivial query causing you a problem, you are going to have to get a lot more aware about indexing in general and how the SQL plan engine behaves. I would recommend Kalen Delany's SQL Internals and if you search on here for book recommendations, there are questions with a number of good solid recommendations.
Have you tried implementing this as a straight query with inner joins instead of using subqueries for each step?
I would be happy to take a look at the resulting execution plan if you change the query to the following form:
select * from patientschart...
inner join facilities...
inner join remoteclientdb....
where...
I think the optimizer will choose the correct indexes once you get rid of the subqueries. Try it and share the execution plan.
Also, on another note, do you need all fields in the resultset? You might benefit by switching to specific columns instead of * in the select list.
I hope this helps.
As Andrew mentioned, your clustered index isn't helping you or hurting you here- if you didn't have the clustered index, you'd see a table scan instead (which I assure you would be no more fun than the clustered index scan).
Assuming that this is the most important query on this table, I'd say that you should change the table design so that the clustered index is on FacilityID instead. That would be dramatically faster.
I think you should avoid doing a SELECT * and specify the coulmns which you require . Then you can plan your indexes on the execution plan you get