SQL - Primary Key, Clustered Index, Auto-increment - mysql

My website displays posts by DATE even though the SQL table is ordered by ID. Since the order of the ID is not always the same as the order of the DATE, I run the query with ORDER BY 'DATE'.
SQL Table Example:
----------------------------
| ID | DATE |
----------------------------
| 1 | 2011-10-20 00:00:00 |
| 2 | 2012-10-20 00:00:00 |
| 3 | 2010-10-20 00:00:00 |
| 4 | 2011-09-20 00:00:00 |
----------------------------
To query I use: SELECT * FROM `table` ORDER BY 'DATE';
My questions:
Would it benefit the query performance if the cluster index or primary key of the table was the DATE column?
Is it possible to have the ID column auto-increment when it is not the primary key?
What I want to do is make the query as fast as possible (which I think would be possible by making the DATE the cluster index or primary key) but also allow each post to have a unique auto-increment ID. I tried to make DATE the primary key but I got an error saying "there can be only one auto column and it must be defined as a key".

I would not define the date as a primary key, but rather add an index on the field. Unique, if needed. I believe it is possible to have an autoincrement on a non primary key field, but trying it yourself will give you the best answer!
<-- EDIT -->
To answer your comment question, I can't say its a BAD idea, but dates are always picky. For once, you have to decide if you use UTC or local date, preview how daylight saving time affects your program, foresee if the need of a date update would be possible at some time of the application life, and things like that. I rather forget about that and just go with the unique autogenerated key.
If you do go for the date as PK, you can use timestamp and avoid the second sequence column.
I found more info about dates as primary keys at techtarget.com and made2mentor.com.

It is nice for indexes if the values going into it are unordered. Not mandatory but nice. Since they are trees if an index is only an autoincrement column you end up with an unbalanced tree right from the beginning each time you rebulid the index you are guaranteed to always get unbalanced as new data gets added because it will only get added to one leaf of the tree (until the index page is full).
For the clustered indexes on auto increment fields (which primary keys are by default in Sybase, MS SQL and probably everything else) it is probably a good idea to do relatively frequent index rebuilds. My philosophy is to cluster on the most common scan. So I might set my primary key to the ID column but I'd cluster on the DATE so when I do things like select Date from table where or select ... order by Date the query will scan consecutive items in as it reads the pages off disk.

Related

MySQL - is query with where by timestamp more efficient?

I have a table with primary key, indexed field and an unindexed timestamp field.
Does it more efficient to query by timestamp too? lets say - 12 hours period?
Is it enough to query by primary key or is it better to use indexed fields too? Lets say that query by the indexed field is not a must.
example:
p_key | project | name | timestamp
-----------------------------------------
1 | 1 | a | 18:00
2 | 1 | b | 19:00
I want to get record 1.
should I ask:
SELECT *
FROM tbl
WHERE p_key = 1 AND project = 1 AND timestamp BETWEEN 16:30 AND 18:30)
OR
SELECT *
FROM tbl
WHERE p_key = 1
Lets say that I have many records.
In your example it doesn't matter which query is more efficient in terms of execution time. The important piece to note is that a primary key is unique.
Your query:
SELECT * FROM tbl WHERE p_key = 1
Will return the same row as your other query:
SELECT * FROM tbl WHERE p_key = 1 AND project = 1 AND timestamp BETWEEN 16:30 AND 18:30)
Because both filter on the p_key = 1. The worst case scenario here is that the entry does not actually fall within your time span in the second query and you get no results at all.
I am assuming you have an index on the primary key here. This means there is absolutely no need to run the second query vs the first query, unless it is possible that it does not fall within the timespan requested.
So your efficiency in your database will be in that you do not need to create and maintain a new index for the second query. If you have "many" rows as you stated, this efficiency can become quite important.
A filter by an integer indexed field will be the fastest way to get your data under normal circunstances.
Understanding that your data looks like your example (I mean, the Timestamp is not significant in your query and filtering by the primary key you get a single record...)
In addition, by default a primary key generates an index, so you don't need to create it by yourself on this field.
The second option obviously!!!
SELECT * FROM tbl WHERE p_key = 1
Filtering by primary key is clearly more efficient than by any other field (in your example) since it is the only one to be indexed.
Furthermore the primary key is enough to get the record you expect. No need to add complexity, bug risk...and computing time (yes, the conditions in the where clause need to be processed. The more you add, the longer it can take)

Why does it contain "Using where"?

Here is my table schema.
CREATE TABLE `usr_block_phone` (
`usr_block_phone_uid` BIGINT (20) UNSIGNED NOT NULL AUTO_INCREMENT,
`time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`usr_uid` INT (10) UNSIGNED NOT NULL,
`block_phone` VARCHAR (20) NOT NULL,
`status` INT (4) NOT NULL,
PRIMARY KEY (`usr_block_phone_uid`),
KEY `block_phone` (`block_phone`),
KEY `usr_uid_block_phone` (`usr_uid`, `block_phone`) USING BTREE,
KEY `usr_uid` (`usr_uid`) USING BTREE
) ENGINE = INNODB DEFAULT CHARSET = utf8
And This is my SQL
SELECT
ubp.usr_block_phone_uid
FROM
usr_block_phone ubp
WHERE
ubp.usr_uid = 19
AND ubp.block_phone = '80000000001'
By the way, when I ran "EXPLAIN", I got the result as following.
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| 1 | SIMPLE | ubp | ref | block_phone,usr_uid_block_phone,usr_uid | usr_uid_block_phone | 66 | const,const | 1 | Using where; Using index |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
Why is index usr_uid_block_phone not working?
I want to use using index only.
This table has 20000 rows now.
Your index is actually used, see the key column. At the moment the query looks good and the execution plan is good as well.
Fill it with at least a hundred for it to be used (and ensure you still use a predicate that filters just one row).
And a general advice: it's near to impossible to predict how optimiser would behave in a particular situation unless you're a mysql dbms developer yourself. So it's always better to try on a dataset that is as close (in terms of size and quality of data) to your production as possible.
Both columns that are used in the WHERE clause (usr_uid and block_phone) are present in the usr_uid_block_phone index and this makes it a possible key to be used to process the query. Even more, it is the index selected but because of the small number of rows in the table, MySQL decides that is faster to not use an index.
The reason is in the expressions present in the SELECT clause:
SELECT
ubp.usr_block_phone_uid
Because the column usr_block_phone_uid is not present in the selected index, in order to process the query MySQL needs to read both the index (to determine what rows match the WHERE conditions) and the table data (to get the value of column usr_block_phone_uid of those rows).
It is faster to read only the table data and use the WHERE conditions to find the matching rows and get their usr_block_phone_uid column. It needs to read data from storage from one place. It needs to read the same data and the index data if it uses an index.
The situation (and the report of EXPLAIN) changes when the table grows. At some point, reading information from the index (and using it to filter out rows) is compensated by the large number of rows that are filtered out (i.e. their data is not read from the storage).
The exact point when this happens is not fixed. It depends a lot of the structure of your table and how the values in the table are spread out. Even when the table is large, MySQL can decide to ignore the index in order to read less information from the storage medium. For example, if a large percentage (let's say 90%) of the table rows match the WHERE condition, it is more efficient to read all the table data (and ignore the index) than to read 90% of table data and 90% of the index.
90% in the previous paragraph is a figure I made up for explanation purposes. I don't know how MySQL decides that it's better to ignore the index.

How to decide which fields must be indexed in a database table

Explanation
I have a table which does not have a primary key (or not even a composite key).
The table is for storing the time slots (opening hours and food delivery available hours) of the food shops. Let's call the table "business_hours" and the main fields are as below.
shop_id
day (0 - 6, means Sunday - Saturday)
type (open, delivery)
start_time
end_time
As an example, if shop A is opened on Monday from 9.00am - 01.00pm and 05.00pm to 10.00pm, there will be two records in business_hours table for this scenario.
-----------------------------------------------
| shop_id | day | type | start_time | end_time
-----------------------------------------------
| 1000 | 1 | open | 09:00:00 | 13:00:00
-----------------------------------------------
| 1000 | 1 | open | 17:00:00 | 22:00:00
-----------------------------------------------
When I query this table, I will use shop_id always as the first condition in where clause.
Ex:
SELECT COUNT(*) FROM business_hours WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
Question
Applying index for "shop_id" is enough or "day" & "type" fields also should be indexed?
Also better if you can explain how the indexing really works.
It depends on several factors that you should specify:
How fast will the data grow
What is the estimated table size in rows
What queries will be run against that table
How fast do you expect the queries to run
It is more about thinking like: Some service will make thousands of inserts of new records per hour, the old records will be archived nightly and reports are to be created nightly from that table. In such a case you may prefer to not to create many indexes since they slow down inserts.
On the other hand if your table will grow and change slowly and many users will run queries against it, you need to have proper indexes to speed up queries.
If you can, try to create clustered unique primary key that most queries can benefit from. If you have data that form some timeline and most queries will get ranges of data using the datetime criteria (like from - to), it is better to include datetime in clustered index - you will get fastest query performance.
So something like this will grant you best performance for the mentioned select. (But you cannot store duplicate business hours for one shop and type)
CREATE TABLE Business_hours
( shop_id INT NOT NULL
, day INT NOT NULL
--- other columns
, CONSTRAINT Business_hours_PK
PRIMARY KEY (shop_id, day, type, start_time, end_time) -- your clustered index
)
Just creating an index on fields used in the SELECT (all of them or just some of them most used), will speed up your query too:
CREATE INDEX BusinessHours_IX ON business_hours (shop_id,day,type, start_time, end_time);
Difference between clustered and non-clustered is that clustered index affects order in which are db records stored on disk.
You can use EXPLAIN to find missing indexes in your database, see this answer.
For more detail this blog.
Yes, You are create a clustered index on this column (shop_id,day,type). I have create a index like above:
Create clustered index Ix on business_hours (shop_id,day,type)
Use this index your select query like above:
SELECT COUNT(*) FROM business_hours with (index (Ix)) WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
You are get result fast but a table which have a primary key than not create
clustered index and create a non clustered index
It depends on your usability if you are not updating the record then use clustered index
on
CREATE CLUSTERED INDEX Saleperday ON business_hours (shop_id,day,type);
because Clustered index traverse along the B Tree and stores the entire row on node itself, So searching is fast. But Updating records is memory cost effective as it shifts the entire row from memory crating new entry for same record.
OR ELSE
If Your are updating the records then non clustered index.
If you create ware house then use Column Store Indexes
For better understanding your can go to these links
http://www.programmerinterview.com/index.php/database-sql/clustered-vs-non-clustered-index/
http://www.patrickkeisler.com/2014/04/what-is-non-clustered-columnstore-index.html
http://searchsqlserver.techtarget.com/feature/SQL-Server-2014-columnstore-index-the-good-the-bad-and-the-clustered
Please reply for answer.
Having decided against a primary key means the following would be allowed:
| shop_id | day | type | start_time | end_time
+---------+-----+--------+------------+---------
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 17:00:00 | 22:00:00
| 1000 | 1 | closed | 17:00:00 | 22:00:00
So you can have duplicate entries that may lead to strange query results and even have a shop open and closed in the very same time range. (But well, we all know that even with a primary key you'd still need a before-insert trigger to detect a range overlapping, e.g. 12:00-15:00 vs. 13:00-16:00, and throw an error in case. - How I wish there were some built-in range detection, so we could, say, have a unique index on (shop_id, day, range(start_time, end_time)).)
As to your question: Provided your database is built well, you already have a foreign key on shop_id. You don't need any further index as long as you consider your queries fast enough.
Once you think you need to speed them up, you can add composite indexes as needed. That would usually be an index on all columns in the slow query's WHERE clause. If that still doesn't suffice add the columns that are in the GROUP BY clause, if any. Next step would be to add the columns of the HAVING clause, if any. Next would be the columns of the ORDER BY clause. And the last step would be to even add all columns in your SELECT clause, which would give you a covering index, i.e. all data needed for the query would be in the index and the table itself would hence not have to be accessed any longer.
But as mentioned: As long as you don't have performance issues, you don't have to add any composite indexes.
To decide which fields must be indexed in a database table you need to observe the behavior of each query sent to the table. Indexes are the means of providing an efficient access path between the application and the data. The index provides the access path; so, when query asks for data to the database, it will know where to go to retrieve the data.
Here is some official Microsoft documentation
Clustered Indexes A clustered index stores the actual table data pages at the leaf level, and the table data is ordered physically
around the key. A table can have only one clustered index, and when
this index is created, the following events also occur: • Table data
is rearranged. • New index pages are created. • All nonclustered
indexes within the database are rebuilt. As a result, there are many
disk I/O operations and extensive use of system and memory resources.
If you plan to create a clustered index, be sure you have free space
equal to at least 1.5 times the amount of data in the table. The extra
free space ensures that you have enough space to complete the
operation efficiently.
Nonclustered Indexes In a nonclustered index, pages at the leaf level contain a bookmark that tells SQL Server where to find the data
row corresponding to the key in the index. If the table has a
clustered index, the bookmark indicates the clustered index key. If
the table does not have a clustered index, the bookmark is an actual
row locator. When you create a nonclustered index, SQL Server creates
the required index pages but does not rearrange table data.
The Indexing Method recommended by professionals is comprised of three phases: Monitor, Analyze, and then implements the index. That
means you need to observe the behavior of your database when you run a
query then work for get the best performance
SQL server use this operation for fetch the data:
Table scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation
Index scan: Reads the entire leaf level (every row) of the clustered index or non-clustered index. The index scan operation might
filter the rows and return only those rows that meet the criteria, or
it might pass all the rows to another filter operation depending on
the complexity of the criteria. The data may or may not be ordered.
Index seek: Locates specific row(s) data using the index and returns only the selected rows in an ordered list
So, once you know that you can run the query and use the option Display the Estimated Execution Plan and analyses the performance,
I recommend reading this post SQL SERVER – Index Seek Vs. Index Scan and Optimizing Your Query Plans with the SQL

Optimizing the performance of MySQL regarding aggregation

I'm trying to optimize a report query, as most of report queries this one incorporates aggregation. Since the size of table is considerable and growing, I need to tend to its performance.
For example, I have a table with three columns: id, name, action. And I would like to count the number of actions each name has done:
SELECT name, COUNT(id) AS count
FROM tbl
GROUP BY name;
As simple as it gets, I can't run it in a acceptable time. It might take 30 seconds and there's no index, whatsoever, I can add which is taken into account, nevertheless improves it.
When I run EXPLAIN on the above query, it never uses any of indices of the table, i.e. an index on name.
Is there any way to improve the performance of aggregation? Why the index is not used?
[UPDATE]
Here's the EXPLAIN's output:
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| 1 | SIMPLE | tbl | ALL | NULL | NULL | NULL | NULL | 4025567 | 100.00 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
And here is the table's schema:
CREATE TABLE `tbl` (
`id` bigint(20) unsigned NOT NULL DEFAULT '0',
`name` varchar(1000) NOT NULL,
`action` int unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `inx` (`name`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The problem with your query and use of indexes is that you refer to two different columns in your SELECT statement yet only have one column in your indexes, plus the use of a prefix on the index.
Try this (refer to just the name column):
SELECT name, COUNT(*) AS count
FROM tbl
GROUP BY name;
With the following index (no prefix):
tbl (name)
Don't use a prefix on the index for this query because if you do, MySQL won't be able to use it as a covering index (will still have to hit the table).
If you use the above, MySQL will scan through the index on the name column, but won't have to scan the actual table data. You should see USING INDEX in the explain result.
This is as fast as MySQL will be able to accomplish such a task. The alternative is to store the aggregate result separately and keep it updated as your data changes.
Also, consider reducing the size of the name column, especially if you're hitting index size limits, which you most likely are hence why you're using the prefix. Save some room by not using UTF8 if you don't need it (UTF8 is 3 bytes per character for index).
It's a very common question and key for solution lies in fact, that your table is growing.
So, first way would be: to create index by name column if it isn't created yet. But: this will solve your issue for a time.
More proper approach would be: to create separate statistics table like
tbl_counts
+------+-------+
| name | count |
+------+-------+
And store your counts separately. When changing (insert/update or delete) your data in tbl table - you'll need to adjust corresponding row inside tbl_counts table. This way allows you to get rid of performing COUNT query at all - but will need to add some logic inside tbl table.
To maintain integrity of your statistics table you can either use triggers or do that inside application. This method is good if performance of COUNT query is much more important for you than your data changing queries (but overhead from changing tbl_counts table won't be too high)

What indexes can be used to improve this query?

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.
An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?
Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).
The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.
I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.