Index on a bit column - sql-server-2008

Consider a table with a bit column indicating whether the object is active or inactive where the majority of the items are inactive (closed).
My understanding was that because of the limited number of distinct values for this column (2), the SQL Engine found it more efficient to perform a table scan to find the open items rather than attempt to index over a bit column.
SQL 2008 has a new feature that allows filters on an index. Without know much about the internals of it, I would assume that the index contains a reference to a record only if it meets the filter criteria and that this approach would provide an efficient means of retrieving all of the active records without having to resort to splitting the active records into separate tables or partitions.
I used to place the primary key of the open records records into a table that I used to identify the active records and then joining the main table to this "active list" table to return only the active records.
Is there any reason why using a filtered index for this purpose would not be appropriate in this situation?

A filtered index on an Active bit field is a valid choice.
You will probably want to add a specific UPDATE STATISTICS for
filtered indexes (especially on volatile data) that specifically
updates their stats with a FULLSCAN. The good news is that they are
likely to be smaller indexes (and therefore easier/less-costly
statistics to update)
This is because the update statistics threshold is based on the underlying column rather than the filtered index values only.
Ref.

Related

Is there a benefit to index mysql column if I always have different value in every row?

Question is for rows like timestamp, where always different value stored in every row.
I'm already search through stackoverflow and read about indexes, but I don't understand profit if no one value equals to another. So, index cardinality will be equal to number of rows. What the profit?
This kind of column would actually be an excellent candidate for an index, preferably a unique one.
Tables are unsorted sets of data, so without any knowledge about the table, the database will have to go over the entire table sequentially to find the rows you're looking for (O(n) complexity, where n is the number of rows).
An index is, essentially a tree that stores values in a sorted way, which allows the database to intelligently find the rows you're looking for (O(log n)). In addition, making the index unique tell the database there can be only one row per timestamp value, so once a single row is retrieved the database can stop searching for more.
The performance benefit for such an index, assuming you search for rows according to timestamps, should be significant.
An index is a map between key values and retrieval pointers. The DBMS uses an index during a query if a strategy that uses the index appears to be optimal.
If the index never gets used, then it is useless.
Indexes can speed up lookups based on a single keyed value, or based on a range of key values (depending on the index type), or by allowing index only retrieval in cases where only the key is needed for the query. Speed ups can be as low as two for one or as high as a hundred for one, depending on the size of the table and various other factors.
If your timestamp field is never used in the WHERE clause or the ON clause of a query, the chances are you are better off with no index. The art of choosing indexes well goes a lot deeper than this, but this is a start.

Finding records not updated in last k days efficiently

I have a table which contains records of last n days. The records in this table are around 100 million. I need to find the records which are not updated in last k
My solution to this problem is
Partition the table on k1. Index on timestamp column. Now instead of updating the timestamp(so that index is not rebuilt), perform remove + insert. By doing this the I think the query to find the records not updated in last k days will be fast.
Is there any other better way to optimize these operations?
For example,
Suppose we have many users and each user can use different products. Also a user can start using(becomes owner) new products any time. If user does not use a product for n days his ownership expires. Now we need to find all the products for a user which are not used by him in last k days. The number of users are of order 10000 and number of products from which he can choose is of order 100,000.
I modeled this problem using a table with schema (user_id, product_id, last_used). product_id is the id of the product the user is using. Whenever a user uses the product last_used is updated. Also a user's ownership of product expires if not used for n days by the user. I partitioned on the table on user_id and indexed last_used(timestamp). Also instead of updating I performed delete + create. I did partitioning and indexing for optimizing the query to fetch records not updated in last k days for a user.
Is there a better way to solve this problem?
You have said you need to "find" and, I think "expire" the records belonging to a particular user after a certain number of days.
Look, this can be done even in a large table with good indexing without too much trouble. I promise you, partitioning the table will be a lot of trouble. You have asserted that it's too expensive in your application to carry an index on your last_used column because of updates. But, considering the initial and ongoing expense of maintaining a partitioned table, I strongly suggest you prove that assertion first. You may be wrong about the cost of maintaining indexes.
(Updating one row with a column that's indexed doesn't rebuild the index, it modifies it. The MySQL storage engine developers have optimized that use case, I promise you.)
As I am sure you know, this query will retrieve old records for a particular user.
SELECT product_id
FROM tbl
WHERE user_id = <<<chosen user>>>
AND last_used <= CURRENT_DATE() - <<<k>>> DAY
will yield your list of products. This will work very efficiently indeed if you have a compound covering index on (user_id, last_used, product_id). If you don't know what a compound covering index is, you really should find out using your favorite search engine. This one will random-access the particular user and then do a range scan on the last_used date. It will then return the product ids from the index.
If you want to get rid of all old records, I suggest you write a host program that repeats this query in a loop until you find that it has processed zero rows. Run this at an off-peak time in your application. The LIMIT clause will prevent each individual query from taking too long and interfering with other uses of the table. For the sake of speed on this query, you'll need an index on last_used.
DELETE FROM tbl
WHERE last_used <= CURRENT_DATE() - <<<k>>> DAY
LIMIT 500
I hope this helps. It comes from someone who's made the costly mistake of trying to partition something that didn't need partitioning.
MySQL doesn't "rebuild" indexes (not completely) when you modify an indexed value. In fact, it doesn't even reorder the records. It just moves the record to the proper 16KB page.
Within a page, the records are in the order they were added. If you inserted in order, then they're in order, otherwise, they're not.
So, when they say that MySQL's clustered indexes are in physical order, it's only true down to the page level, but not within the page.
Clustered indexes still get the benefit that the page data is on the same page as the index, so no further lookup is needed if the row data is small enough to fit in the pages. Reading is faster, but restructuring is slower because you have to move the data with the index. Secondary indexes are much faster to update, but to actually retrieve the data (with the exception of covering indexes), a further lookup must be made to retrieve the actual data via the primary key that the secondary index yields.
Example
Page 1 might hold user records for people whose last name start with A through B. Page 2 might hold names C through D, etc. If Bob renames himself Chuck, his record just gets copied over from page 1 to page 2. His record will always be put at the end of page 2. The keys are kept sorted, but not the data they point to.
If the page becomes full, MySQL will split the page. In this case, assuming even distribution between C and D, page 1 will be A through B, page 2 will be C, and page 3 will be D.
When a record is deleted, the space is compacted, and if the record becomes less than half full, MySQL will merge neighboring pages and possibly free up a page inbetween.
All of these changes are buffered, and MySQL does the actual writes when it's not busy.
The example works the same for both clustered (primary) and secondary indexes, but remember that with a clustered index, the keys point to the actual table data, whereas with a secondary index, the keys point to a value equal to the primary key.
Summary
After awhile, page splitting caused from random inserts will cause the pages to become noncontiguous on disk. The table will become "fragmented". Optimizing the table (rebuilding the table/index) fixes this.
There would be no benefit in deleting then reinserting the record. In fact, you'll just be adding transactional overhead. Let MySQL handle updating the index for you.
Now that you understand indexes a bit more, perhaps you can make a better decision of how to optimize your database.

Insert Performance with Auto Incrementing Column

If I have a table that has a PK that is an auto incrementing column and my table has one single index that only contains that column, what is the performance impact on that vs having no key/index at all?
I am curious about the impact on both MySQL and MS SQL - if there is a difference.
Update
Just to clarify, my table has other columns, but none of them are keys/indexes. I am only concerned right now about insert performance.
Under MySQL with InnoDB if you do not specify a PK or have no UK then it will automatically create a hidden column that is very similar to an auto-increment column which is used for the clustering of the table (see the corresponding documentation). So no matter whether you define one explicitly or not there will be some kind of auto-incremented column with an index. So performance will not be negatively impacted in adding one explicitly (in both cases inserting will be done according to a clustering index), quite the opposite, if you query on that column, it's the fastest possible access path in MySQL.
From what I could gather about MS SQL Server it provides a non-clustered option, i.e. a way to specify a table without a PK and without organizing using any index. In that case there will be a certain overhead associated with specifying an auto increment column as a PK, as there's additional data. And depending on whether you specify clustering or not, it will have to insert data according to the index as MySQL (clustered) or can just put data at the end (non-clustered)
"Index-Organized Tables and Clustered Indexes" goes into much more detail about that kind of stuff for various databases.
Indexing is a way of paging the data contained within your table. Think of them like bookmarks in a phonebook labelled A-Z, you want to find someone with a name starting with 'G'. You'll find it much quicker and more efficient to use the bookmarks to jump to G then if you had to scan the whole phonebook with your eyes.
Indexing works just like this
SELECT * FROM [TABLE]
WHERE [COLUMN1] = 'G'
Without an index, the SQL Engine will search the entire table and eliminate every row that doesn't meet the WHERE condition. With an index, the SQL Query Optimiser will SEEK out only the rows with the corresponding search condition. You'll likely find this notice-able on bigger tables.

What is difference between INDEX and VIEW in MySQL

Which one is fast either Index or View both are used for optimization purpose both are implement on table's column so any one explain which one is more faster and what is difference between both of them and which scenario we use view and index.
VIEW
View is a logical table. It is a physical object which stores data logically. View just refers to data that is tored in base tables.
A view is a logical entity. It is a SQL statement stored in the database in the system tablespace. Data for a view is built in a table created by the database engine in the TEMP tablespace.
INDEX
Indexes are pointres that maps to the physical address of data. So by using indexes data manipulation becomes faster.
An index is a performance-tuning method of allowing faster retrieval of records. An index creates an entry for each value that appears in the indexed columns.
ANALOGY:
Suppose in a shop, assume you have multiple racks. Categorizing each rack based on the items saved is like creating an index. So, you would know where exactly to look for to find a particular item. This is indexing.
In the same shop, you want to know multiple data, say, the Products, inventory, Sales data and stuff as a consolidated report, then it can be compared to a view.
Hope this analogy explains when you have to use a view and when you have to use an index!
Both are different things in the perspective of SQL.
VIEWS
A view is nothing more than a SQL statement that is stored in the database with an associated name. A view is actually a composition of a table in the form of a predefined SQL query.
Views, which are kind of virtual tables, allow users to do the following:
A view can contain all rows of a table or select rows from a table. A view can be created from one or many tables which depends on the written SQL query to create a view.
Structure data in a way that users or classes of users find natural or intuitive.
Restrict access to the data such that a user can see and (sometimes) modify exactly what they need and no more.
Summarize data from various tables which can be used to generate reports.
INDEXES
While Indexes are special lookup tables that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book.
For example, if you want to reference all pages in a book that discuss a certain topic, you first refer to the index, which lists all topics alphabetically and are then referred to one or more specific page numbers.
An index helps speed up SELECT queries and WHERE clauses, but it slows down data input, with UPDATE and INSERT statements. Indexes can be created or dropped with no effect on the data.
view:
1) view is also a one of the database object.
view contains logical data of a base table.where base table has actual data(physical data).another way we can say view is like a window through which data from table can be viewed or changed.
2) It is just simply a stored SQL statement with an object name. It can be used in any SELECT statement like a table.
index:
1) indexes will be created on columns.by using indexes the fetching of rows will be done quickly.
2) It is a way of cataloging the table-info based on 1 or more columns. One table may contain one/more indexes. Indexes are like a 2-D structure having ROWID & indexed-column (ordered). When a table-data is retrieved based on this column (col. which are used in WHERE clause), this index gets into the picture automatically and it's pointer search the required ROWIDs. These ROWIDs are now matched with actual table's ROWID and the records from table are shown.

When should my indexes have the active column?

I have several tables and I'm wondering if my composite index is helpful or not. I am using MySQL 5+ but I guess this would apply to any database (or not?).
Anyway, say I the following table:
username active
-----------------------------------
Moe.Howard 1
Larry.Fine 0
Shemp.Howard 1
So I normally select like:
select * from users where username = 'shemp.howard' and active = 1;
The active=1 is used in many of our tables. Normally, my index would be on the username column but I'm thinking of added the active flag as well (to the same index).
My logic is that as the query engine is scanning through the index, it would be scanning against an index like:
moe.howard,1
shemp.howard,1
larry.fine,0
and find Shemp before it hits the inactive users (Larry).
Now, our active columns are usually TINYINTS and Unsigned. But I'm concerned the index might be backward!
larry.fine,0
moe.howard,1
shemp.howard,1
How should I best handle this and make sure my indexes are correct? Should I not add the active column to the same index as username? Or should I create a separate index for the active and make it descending?
Thanks.
If you combine those two fields in a composite index with the active flag as the second part of the key, then the index order will only depend on that value when (iff) the name field for two or more rows is identical (which seems unlikely in this situation based on the assumption that one would want user names in a system to be unique). The first key in the composite index will define the order of the keys whenever they are different. In other words, if the user name is unique, then adding the active flag as the second segment of a composite index will not change the order of the index.
Also, note that for the example query, the database won't "scan" the index to find the value. Rather it will seek to the first matching entry, which in the example given consists of a single match. The "scan" would happen if multiple entries pass the WHERE clause.
Having said that, unless there are lots of cases where you have duplicate names, my initial reaction would be to not create the composite key. If the names are "generally" unique, then you would not be buying a lot of savings with the composite key. On the other hand, if there are generally quite a few duplicate names with differing active flag values, it could help. At that point, you may need to just test.
Really we can only second guess what the query optimiser will try and do, however it is commonly recommended that if the selectivity of an index over 20% then a full table scan is preferable over an index access. This would mean it is very likely that even if you index active an index won't actually be used asuming you have many more active than non-active users.
MySQL can only use the index in order, so if you create a composite index of username,active that is entirely pointless as you're not going to have multiple users with the same username.
You really need to analyse your query requirements and then you can design an indexing plan to suite them. Profile each query and don't try to over optimize everything as this can have a negative result.
An index should be added only if the values you expect it to help you filter in/out are representative, statistically speaking.
What does that mean?
If say, the filter in your WHERE clause, on the column you're indexing, is helping you out retrieving 20% of the rows, you should add an index in it. This percent number depends on your special case and should be tryed out but that's the idea.
In your case, just by the name, you would have 100% of exclusion. Adding an index on the active column would be then useless because it wouldn't help reducing the final recordset (except if you have possibly n times the same name but only one active?)
The situation would be different if you decided to filter ONLY active users, not caring about the name.