How to decide which fields must be indexed in a database table - mysql

Explanation
I have a table which does not have a primary key (or not even a composite key).
The table is for storing the time slots (opening hours and food delivery available hours) of the food shops. Let's call the table "business_hours" and the main fields are as below.
shop_id
day (0 - 6, means Sunday - Saturday)
type (open, delivery)
start_time
end_time
As an example, if shop A is opened on Monday from 9.00am - 01.00pm and 05.00pm to 10.00pm, there will be two records in business_hours table for this scenario.
-----------------------------------------------
| shop_id | day | type | start_time | end_time
-----------------------------------------------
| 1000 | 1 | open | 09:00:00 | 13:00:00
-----------------------------------------------
| 1000 | 1 | open | 17:00:00 | 22:00:00
-----------------------------------------------
When I query this table, I will use shop_id always as the first condition in where clause.
Ex:
SELECT COUNT(*) FROM business_hours WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
Question
Applying index for "shop_id" is enough or "day" & "type" fields also should be indexed?
Also better if you can explain how the indexing really works.

It depends on several factors that you should specify:
How fast will the data grow
What is the estimated table size in rows
What queries will be run against that table
How fast do you expect the queries to run
It is more about thinking like: Some service will make thousands of inserts of new records per hour, the old records will be archived nightly and reports are to be created nightly from that table. In such a case you may prefer to not to create many indexes since they slow down inserts.
On the other hand if your table will grow and change slowly and many users will run queries against it, you need to have proper indexes to speed up queries.
If you can, try to create clustered unique primary key that most queries can benefit from. If you have data that form some timeline and most queries will get ranges of data using the datetime criteria (like from - to), it is better to include datetime in clustered index - you will get fastest query performance.
So something like this will grant you best performance for the mentioned select. (But you cannot store duplicate business hours for one shop and type)
CREATE TABLE Business_hours
( shop_id INT NOT NULL
, day INT NOT NULL
--- other columns
, CONSTRAINT Business_hours_PK
PRIMARY KEY (shop_id, day, type, start_time, end_time) -- your clustered index
)
Just creating an index on fields used in the SELECT (all of them or just some of them most used), will speed up your query too:
CREATE INDEX BusinessHours_IX ON business_hours (shop_id,day,type, start_time, end_time);
Difference between clustered and non-clustered is that clustered index affects order in which are db records stored on disk.
You can use EXPLAIN to find missing indexes in your database, see this answer.
For more detail this blog.

Yes, You are create a clustered index on this column (shop_id,day,type). I have create a index like above:
Create clustered index Ix on business_hours (shop_id,day,type)
Use this index your select query like above:
SELECT COUNT(*) FROM business_hours with (index (Ix)) WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
You are get result fast but a table which have a primary key than not create
clustered index and create a non clustered index

It depends on your usability if you are not updating the record then use clustered index
on
CREATE CLUSTERED INDEX Saleperday ON business_hours (shop_id,day,type);
because Clustered index traverse along the B Tree and stores the entire row on node itself, So searching is fast. But Updating records is memory cost effective as it shifts the entire row from memory crating new entry for same record.
OR ELSE
If Your are updating the records then non clustered index.
If you create ware house then use Column Store Indexes
For better understanding your can go to these links
http://www.programmerinterview.com/index.php/database-sql/clustered-vs-non-clustered-index/
http://www.patrickkeisler.com/2014/04/what-is-non-clustered-columnstore-index.html
http://searchsqlserver.techtarget.com/feature/SQL-Server-2014-columnstore-index-the-good-the-bad-and-the-clustered
Please reply for answer.

Having decided against a primary key means the following would be allowed:
| shop_id | day | type | start_time | end_time
+---------+-----+--------+------------+---------
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 17:00:00 | 22:00:00
| 1000 | 1 | closed | 17:00:00 | 22:00:00
So you can have duplicate entries that may lead to strange query results and even have a shop open and closed in the very same time range. (But well, we all know that even with a primary key you'd still need a before-insert trigger to detect a range overlapping, e.g. 12:00-15:00 vs. 13:00-16:00, and throw an error in case. - How I wish there were some built-in range detection, so we could, say, have a unique index on (shop_id, day, range(start_time, end_time)).)
As to your question: Provided your database is built well, you already have a foreign key on shop_id. You don't need any further index as long as you consider your queries fast enough.
Once you think you need to speed them up, you can add composite indexes as needed. That would usually be an index on all columns in the slow query's WHERE clause. If that still doesn't suffice add the columns that are in the GROUP BY clause, if any. Next step would be to add the columns of the HAVING clause, if any. Next would be the columns of the ORDER BY clause. And the last step would be to even add all columns in your SELECT clause, which would give you a covering index, i.e. all data needed for the query would be in the index and the table itself would hence not have to be accessed any longer.
But as mentioned: As long as you don't have performance issues, you don't have to add any composite indexes.

To decide which fields must be indexed in a database table you need to observe the behavior of each query sent to the table. Indexes are the means of providing an efficient access path between the application and the data. The index provides the access path; so, when query asks for data to the database, it will know where to go to retrieve the data.
Here is some official Microsoft documentation
Clustered Indexes A clustered index stores the actual table data pages at the leaf level, and the table data is ordered physically
around the key. A table can have only one clustered index, and when
this index is created, the following events also occur: • Table data
is rearranged. • New index pages are created. • All nonclustered
indexes within the database are rebuilt. As a result, there are many
disk I/O operations and extensive use of system and memory resources.
If you plan to create a clustered index, be sure you have free space
equal to at least 1.5 times the amount of data in the table. The extra
free space ensures that you have enough space to complete the
operation efficiently.
Nonclustered Indexes In a nonclustered index, pages at the leaf level contain a bookmark that tells SQL Server where to find the data
row corresponding to the key in the index. If the table has a
clustered index, the bookmark indicates the clustered index key. If
the table does not have a clustered index, the bookmark is an actual
row locator. When you create a nonclustered index, SQL Server creates
the required index pages but does not rearrange table data.
The Indexing Method recommended by professionals is comprised of three phases: Monitor, Analyze, and then implements the index. That
means you need to observe the behavior of your database when you run a
query then work for get the best performance
SQL server use this operation for fetch the data:
Table scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation
Index scan: Reads the entire leaf level (every row) of the clustered index or non-clustered index. The index scan operation might
filter the rows and return only those rows that meet the criteria, or
it might pass all the rows to another filter operation depending on
the complexity of the criteria. The data may or may not be ordered.
Index seek: Locates specific row(s) data using the index and returns only the selected rows in an ordered list
So, once you know that you can run the query and use the option Display the Estimated Execution Plan and analyses the performance,
I recommend reading this post SQL SERVER – Index Seek Vs. Index Scan and Optimizing Your Query Plans with the SQL

Related

Optimize SQL to fetch 1 day data

I need to fetch last 24 hrs data frequently and this query runs frequently.
Since this scans many rows, using it frequently, affects the database performance.
MySql execution strategy picks index on created_at and that returns 1,00,000 rows approx. and these rows are scanned one by one to filter customer_id = 10 and my final result has 20000 rows.
How can I optimize this query?
explain SELECT *
FROM `order`
WHERE customer_id = 10
and `created_at` >= NOW() - INTERVAL 1 DAY;
id : 1
select_type : SIMPLE
table : order
partitions : NULL
type : range
possible_keys : idx_customer_id, idx_order_created_at
key : idx_order_created_at
key_len : 5
ref : NULL
rows : 103357
filtered : 1.22
Extra : Using index condition; Using where
The first optimization I would do is on the access to the table:
create index ix1 on `order` (customer_id, created_at);
Then, if the query is still slow I would try appending the columns you are selecting to the index. If, for example, you are selecting the columns order_id, amount, and status:
create index ix1 on `order` (customer_id, created_at,
order_id, amount, status);
This second strategy could be beneficial, but you'll need to test it to find out what performance improvement it peoduces in your particular case.
The big improvement of this second strategy is that it walks the secondary index only, by avoiding to walk back to the primary clustered index of the table (that can be time consumming).
Instead of two single indexes on ID and Created, create a single composite index on ( customer_id, created_at ). This way the index engine can use BOTH parts of the where clause instead of just hoping to get the one. Jump right to the customer ID, then jump directly to the date desired, then gives results. it SHOULD be very fast.
Additional Follow-up.
I hear your comment about having multiple indexes, but add those into the main one, just after such as
( customer_id, created_at, updated_at, completion_time )
Then, in your queries could always include some help on the index in the where clause. For example, and I don't know your specific data. A record is created at some given point. The updated and completion time will always be AFTER that. How long does it take (worst-case scenario) from a creation to completion time... 2 days, 10 days, 90 days?
where
customerID = ?
AND created_at >= date - 10 days
AND updated_at >= date -1
Again, just an example, but if a person has 1000's of orders and relatively quick turn-around time, you could jump to those most recent and then find those updated within the time period.. Again, just an option as a single index vs 3, 4 or more indexes.
Seems you are dealing a very quick growing table, I should consider moving this frequent query to a cold table or replica.
One more point is that did you consider partition by customer_id. I am not quite understand the business logic behind to query customer_id = 10. If it's multi tenancy application, try partition.
For this query:
SELECT o.*
FROM `order` o
WHERE o.customer_id = 10 AND
created_at >= NOW() - INTERVAL 1 DAY;
My first inclination would be a composite index on (customer_id, created_at) -- as others have suggested.
But, you appear to have a lot of data and many inserts per day. That suggests partitioning plus an index. The appropriate partition would be on created_at, probably on a daily basis, along with an index for user_id.
A typical query would access the two most recent partitions. Because your queries are focused on recent data, this also reduces the memory occupied by the index, which might be an overall benefit.
This technique should be better than all the other answers, though perhaps by only a small amount:
Instead of orders being indexed thus:
PRIMARY KEY(order_id) -- AUTO_INCREMENT
INDEX(customer_id, ...) -- created_at, and possibly others
do this to "cluster" the rows together:
PRIMARY KEY(customer_id, order_id)
INDEX (order_id) -- to keep AUTO_INCREMENT happy
Then you can optionally have more indexes starting with customer_id as needed. Or not.
Another issue -- What will you do with 20K rows? That is a lot to feed to a client, especially of the human type. If you then munch on it, can't you make a more complex query that does more work, and returns fewer rows? That will probably be faster.

Designing a database for storing 500 million domain names with full text search

I'm about to build an application that stores up to 500 million records of domain names.
I'll index the '.net' or '.com' part and strip the 'www' at the beginning.
So I believe the table would look like this:
domain_id | domain_name | domain_ext
----------+--------------+-----------
1 | dropbox | 2
2 | digitalocean | 2
domain_ext = 2 means it's a '.com' domain.
The queries I'm about to perform::
I need to be able to insert new domains easily.
I also need to make sure I'm not inserting a duplication (each domain should have only 1 record), so I think to make domain_name + domain_ext as UNIQUE index (with MySQL - InnoDB).
Query domains in batches. For example: SELECT * FROM tbl_domains LIMIT 300000, 600;
What do you think? will that table hold hundreds of millions of records?
How about partitioning by first letter of the domain name, would that be good?
Let me know your suggestions, I'm open minded.
Partitioning is unlikely to provide any benefit. Certainly if you are partitioning on the first letter.
Don't use OFFSET and LIMIT for batching. Instead "remember where you left off". See my blog for more details.
If you have declared domain_ext to be INT, then I ask why? INT takes 4 bytes. So does .com. Even if you counter with SMALLINT or .uk, I will counter-counter with "The small difference does not justify the complexity."
Edit (on UNIQUE)
A non-partitioned table can have a UNIQUE index. (Note: A PRIMARY KEY is a UNIQUE index.) When you have a UNIQUE index, checking for uniqueness is virtually instantaneous, even for 500M rows. (Drilling down about 5 levels of BTree is very fast.)
With PARTITIONing, every UNIQUE key must include the "partition key". If the domain is not split, you cannot use PARTITION BY RANGE. Splitting off the extension (top-level domain) as an INT, you could use BY RANGE or BY LIST. The UNIQUE would be possible since the TLD is both the partition key and needed as part of the domain. But it would not gain any performance. A lookup would (1) pick the partition ("partition pruning"), then (2) drill down 4-5 levels of BTree to get to the row to check.
Conclusion: Doing a uniqueness check, while possible in this case, will not be any faster with PARTITIONing.

MySQL performance boost after create & drop index

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.
Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.
How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?
What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)

SQL - Primary Key, Clustered Index, Auto-increment

My website displays posts by DATE even though the SQL table is ordered by ID. Since the order of the ID is not always the same as the order of the DATE, I run the query with ORDER BY 'DATE'.
SQL Table Example:
----------------------------
| ID | DATE |
----------------------------
| 1 | 2011-10-20 00:00:00 |
| 2 | 2012-10-20 00:00:00 |
| 3 | 2010-10-20 00:00:00 |
| 4 | 2011-09-20 00:00:00 |
----------------------------
To query I use: SELECT * FROM `table` ORDER BY 'DATE';
My questions:
Would it benefit the query performance if the cluster index or primary key of the table was the DATE column?
Is it possible to have the ID column auto-increment when it is not the primary key?
What I want to do is make the query as fast as possible (which I think would be possible by making the DATE the cluster index or primary key) but also allow each post to have a unique auto-increment ID. I tried to make DATE the primary key but I got an error saying "there can be only one auto column and it must be defined as a key".
I would not define the date as a primary key, but rather add an index on the field. Unique, if needed. I believe it is possible to have an autoincrement on a non primary key field, but trying it yourself will give you the best answer!
<-- EDIT -->
To answer your comment question, I can't say its a BAD idea, but dates are always picky. For once, you have to decide if you use UTC or local date, preview how daylight saving time affects your program, foresee if the need of a date update would be possible at some time of the application life, and things like that. I rather forget about that and just go with the unique autogenerated key.
If you do go for the date as PK, you can use timestamp and avoid the second sequence column.
I found more info about dates as primary keys at techtarget.com and made2mentor.com.
It is nice for indexes if the values going into it are unordered. Not mandatory but nice. Since they are trees if an index is only an autoincrement column you end up with an unbalanced tree right from the beginning each time you rebulid the index you are guaranteed to always get unbalanced as new data gets added because it will only get added to one leaf of the tree (until the index page is full).
For the clustered indexes on auto increment fields (which primary keys are by default in Sybase, MS SQL and probably everything else) it is probably a good idea to do relatively frequent index rebuilds. My philosophy is to cluster on the most common scan. So I might set my primary key to the ID column but I'd cluster on the DATE so when I do things like select Date from table where or select ... order by Date the query will scan consecutive items in as it reads the pages off disk.

Is it possible to optimize a query that uses the '<>' operator?

This a follow-up to a previous question.
How can I optimize this query so that it does not perform a full table scan?
SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
.
explain SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Employee | ALL | PRIMARY | NULL | NULL | NULL | 5000 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
(Empoyee.id is the primary key, in case that isn't clear.)
Have a covering index for name and id, and it should be able to fulfill the query using the index. This might be faster, because there's a good chance the entire index will already be in memory, while a table scan is more likely to need to go to disk.
Because of the low (non-existent) selectivity of your where clause you may need to provide a hint to get the database to use your index. I'm a sql server guy, and so I'm not sure of the syntax needed in mysql to hint an index, or even if mysql is able to take advantage of a covering index in this manner.
That said, I doubt you can get much improvement: you're returning every row but one. You should expect that to need to scan the table.
There are a lot of things to try, it depends on how the database engine chooses to parse it, really. Some options:
select employee.name from employee where employee.id not in (1000);
You could also try a union with a less than and then a greater than.
But in the specific example you are giving (which may just be too simple for your real case) a table scan isn't necessarily a bad thing. If all the records have to be returned except one, using an index may in fact be slower.
In traditional databases, you cant!
Of course, you could just omit all Employees with the given Id (when it is key or has an index) -- but normally you will still have the total majority of the table under your feet. So using an index might complicate things and thus a fts normally is the faster option.
When you have specialized databases, you could store the names of all employees adjacent to each other.
Edit: I now saw the other answer of Joel. Yes, this could be a way, since in fact your special index is now a specialized form of storing a part of the content. Good databases can just use the index content when it covers the columns needed -- this is rather nice. Of course, you will endup in a so called "full index scan" (but normally much faster as a full-table-scan).
Nothing you can do will increase performance. In this case the database must do a complete table scan, as you are asking for every record save one. Reading every page in an index on top of that would only reduce performance. Fortunately, even if you added an index, the database would be smart enough to ignore it...
EDIT to address #Juergens comment.
Juergen, you are right about a covering index, but there are conflicting effects here. Any use of an index in a scenario like this has bad effects in one sense... The query engine could have to perform one I/O Operation for each level in the index, for each row it needs to examine. If there are, say, 5 levels in the index, and 1M rows, that would be 5 Million I/O operations, compared to only 1M I/Os to do a complete table scan. This is why, in this scenario, most query optimizers would ignore any available index and do the table scan anyway. (unless you force it to use the index with a hint) The only mitigating factor is if EVERY attribute required by the query is in the index (covering index) and the number of index rows per page on disk is sufficiently smaller than the number of table rows per page to counteract the negative effect of having to traverse each level of the index for each row returned by the query.