MySQL Integer vs DateTime index - mysql

Let me start by saying I have looked at many similar questions asked, but all of them relate to Timestamp and DateTime field type without indexing. At least that is my understanding.
As we all know, there are certain advantages when it comes to DateTime. Putting them aside for a minute, and assuming table's engine is InnoDB with 10+ million records, which query would perform faster when criteria is based on:
DateTime with index
int with index
In other words, it is better to store date and time as DateTime or UNIX timestamp in int? Keep in mind there is no need for any built-in MySQL functions to be used.
Update
Tested with MySQL 5.1.41 (64bit) and 10 million records, initial testing showed significant speed difference in favour of int. Two tables were used, tbl_dt with DateTime and tbl_int with int column. Few results:
SELECT SQL_NO_CACHE COUNT(*) FROM `tbl_dt`;
+----------+
| COUNT(*) |
+----------+
| 10000000 |
+----------+
1 row in set (2 min 10.27 sec)
SELECT SQL_NO_CACHE COUNT(*) FROM `tbl_int`;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (25.02 sec)
SELECT SQL_NO_CACHE COUNT(*) FROM `tbl_dt` WHERE `created` BETWEEN '2009-01-30' AND '2009-12-30';
+----------+
| COUNT(*) |
+----------+
| 835663 |
+----------+
1 row in set (8.41 sec)
SELECT SQL_NO_CACHE COUNT(*) FROM `tbl_int` WHERE `created` BETWEEN 1233270000 AND 1262127600;
+----------+
| COUNT(*) |
+----------+
| 835663 |
+----------+
1 row in set (1.56 sec)
I'll post another update with both fields in one table as suggested by shantanuo.
Update #2
Final results after numerous server crashes :) Int type is significantly faster, no matter what query was run, the speed difference was more or less the same as results above.
"Strange" thing observed was execution time was more or less the same when two both field types are stored in the same table. It seems MySQL is smart enough to figure out when the values are the same when stored in both DateTime and int. Haven't found any documentation on the subject, therefore is just an observation.

I see that in the test mentioned in the above answer, the author basically proves it that when the UNIX time is calculated in advance, INT wins.

My instinct would be to say that ints are always faster. However, this seems not to be the case
http://gpshumano.blogs.dri.pt/2009/07/06/mysql-datetime-vs-timestamp-vs-int-performance-and-benchmarking-with-myisam/
Edited to add: I realize that you're using InnoDB, rather than MyISAM, but I haven't found anything to contradict this in the InnoDB case. Also, the same author did an InnoDB test
http://gpshumano.blogs.dri.pt/2009/07/06/mysql-datetime-vs-timestamp-vs-int-performance-and-benchmarking-with-innodb/

it depends on your application, as you can see in an awesome comparison and benchmark of DATETIME , TIMESTAMP and INT type in Mysql server in MySQL Date Format: What Datatype Should You Use? We Compare Datetime, Timestamp and INT. you can see in some situation INT has better perfomance than other and in some cases DATETIME has better performance. and It completely depends on your application

Related

why is mysql select count(1) taking so long?

When I first started using MySQL, a select count(*) or select count(1) was almost instantaneous. But I'm now using version 5.6.25 hosted at Dreamhost, and it's taking 20-30 seconds, sometimes, to do a select count(1). However, the second time it's fast---like the index is cached---but not super fast, like the data are coming from just the metadata index.
Anybody understand what's going on, and why it has changed?
mysql> select count(1) from times;
+----------+
| count(1) |
+----------+
| 1511553 |
+----------+
1 row in set (22.04 sec)
mysql> select count(1) from times;
+----------+
| count(1) |
+----------+
| 1512007 |
+----------+
1 row in set (0.54 sec)
mysql> select version();
+------------+
| version() |
+------------+
| 5.6.25-log |
+------------+
1 row in set (0.00 sec)
mysql>
I guess when you first started, you used MyISAM, and now you are using InnoDB. InnoDB just doesn't store this information. See documentation: Limits on InnoDB Tables
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If an approximate row count is sufficient, SHOW TABLE STATUS can be used. See Section 9.5, “Optimizing for InnoDB Tables”.
So when your index is entirely in the buffer pool after the (slower) first query, the second query is fast again.
MyISAM doesn't need to care about problems that concurrent transactions might create, because it doesn't support transactions, and so select count(*) from t will just look up and return a stored value very fast.

Why is my sql query much slower if I select a varchar field rather than a numeric one?

I have an SQL query that is rather big (joining 3 huge tables) and running too slow. I'm trying to optimize it and ran in a strange observation :
SELECT board FROM ((foo JOIN bar ON id_bar=bar.id) JOIN baz ON id_baz=baz.id) ORDER BY foo.id DESC LIMIT 1;
+-------+
| board |
+-------+
| 3 |
+-------+
1 row in set (3,99 sec)
board is an int field, there is an index on it. Good. But, now, if I'm selecting an indexed varchar(6) field, I get that slow result :
SELECT type FROM ((foo JOIN bar ON id_bar=bar.id) JOIN baz ON id_baz=baz.id) ORDER BY foo.id DESC LIMIT 1;
+--------+
| type |
+--------+
| normal |
+--------+
1 row in set (17,76 sec)
How is that possible ? I thought the slow part in a query was in the JOIN / ORDER / GROUP / WHERE parts, not in the actual displaying of results. How can I enhance that query ?
INT is 4 bytes long, VARCHAR(6) can be as much as 12 bytes long (in multibyte encoding). That increases the size of index, and thus increases the time.
One thing you could think about is to change type column into another datatype, namely ENUM. ENUM fields let you store efficiently a value from a limited set of possible values (and type columns often have limited amount of possible values). Because it uses less space to store data, the indices on these columns are also small, and thus faster.

Improve performance of count and sum when already indexed

First, here is the query I have:
SELECT
COUNT(*) as velocity_count,
SUM(`disbursements`.`amount`) as summation_amount
FROM `disbursements`
WHERE
`disbursements`.`accumulation_hash` = '40ad7f250cf23919bd8cc4619850a40444c5e90c978f88635a09ccf66a82ffb38e39ea51cdfd651b0ebdac5f5ca37cd7a17e0f60fea6cbce1397ccff5fa37346'
AND `disbursements`.`caller_id` = 1
AND `disbursements`.`active` = 1
AND (version_hash != '86b4111677294b27a1805643d193b8d437b6ddb170b4ed5dec39aa89bf070d160cbbcd697dfc1988efea8429b1f1557625bf956180c65d3dcd3a318280e0d2da')
AND (`disbursements`.`created_at` BETWEEN '2012-12-15 23:33:22'
AND '2013-01-14 23:33:22') LIMIT 1
Explain extended returns the following:
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| 1 | SIMPLE | disbursements | range | unique_request_index,index_disbursements_on_caller_id,disbursement_summation_index,disbursement_velocity_index,disbursement_version_out_index | disbursement_summation_index | 1543 | NULL | 191422 | 100.00 | Using where; Using index |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
The actual query counts about 95,000 rows. If I explain another query that hits ~50 rows the explain is identical, just with fewer rows estimated.
The index being chosen covers accumulation_hash, caller_id, active, version_hash, created_at, amount in that order.
I've tried playing around with doing COUNT(id) or COUNT(caller_id) since these are non-null fields and return the same thing as count(*), but it doesn't have any impact on the plan or the run time of the actual query.
This is also a heavy insert table, essentially every single query will have had a row inserted or updated since the last time it was run, so the mysql query cache isn't entirely useful.
Before I go and make some sort of bucketed time sequence cache with something like memcache or redis, is there an obvious solution to getting this to work much faster? A normal ~50 row query returns in 5MS, the ones across 90k+ rows are taking 500-900MS and I really can't afford anything much past 100MS.
I should point out the dates are a rolling 30 day window that needs to be essentially real time. Expiration could probably happen with ~1 minute granularity, but new items need to be seen immediately upon commit. I'm also on RDS, Read IOPS are essentially 0, and cpu is about 60-80%. When I'm not querying the giant 90,000+ record items, CPU typically stays below 10%.
You could try an index that has created_at before version_hash (might get a better shot at having an index range scan... not clear how that non-equality predicate on the version_hash affects the plan, but I suspect it disables a range scan on the created_at column.
Other than that, the query and the index look about as good as you are going to get, the EXPLAIN output shows the query being satisfied from the index.
And the performance of the statement doesn't sound too unreasonable, given that it's aggregating 95,000+ rows, especially given the key length of 1543 bytes. That's a much larger size than I normally deal with.
What are the datatypes of the columns in the index, and what is the cluster key or primary key?
accumulation_hash - 128-character representation of 512-bit value
caller_id - integer or numeric (?)
active - integer or numeric (?)
version_hash - another 128-characters
created_at - datetime (8bytes) or timestamp (4bytes)
amount - numeric or integer
95,000 rows at 1543 bytes each is on the order of 140MB of data.

Why does mysql decide that this subquery is dependent?

On a MySQL 5.1.34 server, I have the following perplexing situation:
mysql> explain select * FROM master.ObjectValue WHERE id IN ( SELECT id FROM backup.ObjectValue ) AND timestamp < '2008-04-26 11:21:59';
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
| 1 | PRIMARY | ObjectValue | range | IX_ObjectValue_Timestamp,IX_ObjectValue_Timestamp_EventName | IX_ObjectValue_Timestamp_EventName | 9 | NULL | 541944 | Using where |
| 2 | DEPENDENT SUBQUERY | ObjectValue | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using index |
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
2 rows in set (0.00 sec)
mysql> select * FROM master.ObjectValue WHERE id IN ( SELECT id FROM backup.ObjectValue ) AND timestamp < '2008-04-26 11:21:59';
Empty set (2 min 48.79 sec)
mysql> select count(*) FROM master.ObjectValue;
+----------+
| count(*) |
+----------+
| 35928440 |
+----------+
1 row in set (2 min 18.96 sec)
How can it take 3 minutes to examine 500000 records when it only
takes 2 minutes to visit all records?
How can a subquery on a
separate database be classified dependent?
What can I do to speed up
this query?
UPDATE:
The actual query that took a long time was a DELETE, but you can't do explain on those; DELETE is why I used subselect. I have now read the documentation and found out about the syntax "DELETE FROM t USING ..." Rewriting the query from:
DELETE FROM master.ObjectValue
WHERE timestamp < '2008-06-26 11:21:59'
AND id IN ( SELECT id FROM backup.ObjectValue ) ;
into:
DELETE FROM m
USING master.ObjectValue m INNER JOIN backup.ObjectValue b ON m.id = b.id
WHERE m.timestamp < '2008-04-26 11:21:59';
Reduced the time from minutes to .01 seconds for an empty backup.ObjectValue.
Thank you all for good advise.
The dependent subquery slows you outer query down to a crawl (I suppose you know it means it's run once per row of found in the dataset being looked at).
You don't need the subquery there, and not using one will speedup your query quite significantly:
SELECT m.*
FROM master.ObjectValue m
JOIN backup.ObjectValue USING (id)
WHERE m.timestamp < '2008-06-26 11:21:59'
MySQL frequently treats subqueries as dependent even though they are not. I've never really understood the exact reasons for that - maybe it's simply because the query optimizer fails to recognize it as independent. I never bothered looking more in details because in these cases you can virtually always move it to the FROM clause, which fixes it.
For example:
DELETE FROM m WHERE m.rid IN (SELECT id FROM r WHERE r.xid = 10)
// vs
DELETE m FROM m WHERE m.rid IN (SELECT id FROM r WHERE r.xid = 10)
The former will produce a dependent subquery and can be very slow. The latter will tell the optimizer to isolate the subquery, which avoids a table scan and makes the query run much faster.
Notice how it says there is only 1 row for the subquery? There is obviously more than 1 row. That is an indication that mysql is loading only 1 row at a time. What mysql is probably trying to do is "optimize" the subquery so that it only loads records in the subquery that also exist in the master query, a dependent subquery. This is how a join works, but the way you phrased your query you have forced a reversal of the optimized logic of a join.
You've told mysql to load the backup table (subquery) then match it against the filtered result of the master table "timestamp < '2008-04-26 11:21:59'". Mysql determined that loading the entire backup table is probably not a good idea. So mysql decided to use the filtered result of the master to filter the backup query, but the master query hasn't completed yet when trying to filter the subquery. So it needs to check as it loads each record from the master query. Thus your dependent subquery.
As others mentioned, use a join, it's the right way to go. Join the crowd.
How can it take 3 minutes to examine 500000 records when it only takes 2 minutes to visit all records?
COUNT(*) is always transformed to COUNT(1) in MySQL. So it doesn't even have to enter each record, and also, I would imagine that it uses in-memory indexes which speeds things up. And in the long-running query, you use range (<) and IN operators, so for each record it visits, it has to do extra work, especially since it recognizes the subquery as dependent.
How can a subquery on a separate database be classified dependent?
Well, it doesn't matter if it's in a separate database. A subquery is dependent if it depends on values from the outer query, which you could still do in your case... but you don't, so it is, indeed, strange that it's classified as a dependent subquery. Maybe it is just a bug in MySQL, and that's why it's taking so long - it executes the inner query for every record selected by the outer query.
What can I do to speed up this query?
To start with, try using JOIN instead:
SELECT master.*
FROM master.ObjectValue master
JOIN backup.ObjectValue backup
ON master.id = backup.id
AND master.timestamp < '2008-04-26 11:21:59';
The real answer is, don't use MySQL, its optimizer is rubbish. Switch to Postgres, it will save you time in the long run.
To everyone saying "use JOIN", that's just a nonsense perpetuated by the MySQL crowd who have refused for 10 years to fix this glaringly horrible bug.

How can I make MySQL as fast as a flat file in this scenario?

Assume a key-value table with at least 10s of millions of rows.
Define an operation that takes a large number of IDs (again, 10s of millions) finds the corresponding values and sums them.
Using a database, this operation seems like it can approach (disk seek time) * (number of lookups).
Using a flat file, and reading through the entire contents, this operation will approach (file size)/(drive transfer rate).
Plugging in some (rough) values (from wikipedia and/or experimentation):
seek time = 0.5ms
transfer rate = 64MByte/s
file size = 800M (for 70 million int/double key/values)
65 million value lookups
DB time = 0.5ms * 65000000 = 32500s = 9 hours
Flat file = 800M/(64MB/s) = 12s
Experimental results are not as bad for MySQL, but the flat file still wins.
Experiments:
Create InnoDB and MyISAM id/value pair tables. e.g.
CREATE TABLE `ivi` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`val` double DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
Fill with 32 million rows of data of your choice. Query with:
select sum(val) from ivm where id not in (1,12,121,1121); //be sure to change the numbers each time or clear the query cache
Use the following code to create & read key/value flat file from java.
private static void writeData() throws IOException {
long t = -System.currentTimeMillis();
File dat = new File("/home/mark/dat2");
if (dat.exists()){
dat.delete();
}
FileOutputStream fos = new FileOutputStream(dat);
ObjectOutputStream os = new ObjectOutputStream(new BufferedOutputStream(fos));
for (int i=0; i< 32000000; i++){
os.writeInt(i);
os.writeDouble(i / 2.0);
}
os.flush();
os.close();
t += System.currentTimeMillis();
System.out.println("time ms = " + t);
}
private static void performSummationQuery() throws IOException{
long t = -System.currentTimeMillis();
File dat = new File("/home/mark/dat2");
FileInputStream fin = new FileInputStream(dat);
ObjectInputStream in = new ObjectInputStream(new BufferedInputStream(fin));
HashSet<Integer> set = new HashSet<Integer>(Arrays.asList(11, 101, 1001, 10001, 100001));
int i;
double d;
double sum = 0;
try {
while (true){
i = in.readInt();
d = in.readDouble();
if (!set.contains(i)){
sum += d;
}
}
} catch (EOFException e) {
}
System.out.println("sum = " + sum);
t += System.currentTimeMillis();
System.out.println("time ms = " + t);
}
RESULTS:
InnoDB 8.0-8.1s
MyISAM 3.1-16.5s
Stored proc 80-90s
FlatFile 1.6-2.4s (even after: echo 3 > /proc/sys/vm/drop_caches)
My experiments have shown that a flat file wins against the database here. Unfortunately, I sill need to do "standard" CRUD operations on this table. But this is the use pattern that's killing me.
So what's the best way I can have MySQL behave like itself most of the time, yet win over a flat file in the above scenario?
EDIT:
To clarify some points:
1. I have dozens such tables, some will have hundreds of millions of rows and I and cannot store them all in RAM.
2. The case I have described is what I need to support. The values associated to an ID might change, and the selection of IDs is ad-hoc. Therefor there is no way to pre-generate & cache any sums. I need to do the work of "find each value and sum them all" every time.
Thanks.
Your numbers assume that MySQL will perform disk I/O 100% of the time while in practice that is rarely the case. If your MySQL server has enough RAM and your table is indexed appropriately your cache hit rate will rapidly approach 100% and MySQL will perform very little disk I/O as a direct result of your sum operation. If you are frequently having to deal with calculations across 10,000,000 rows you may also consider adjusting your schema to reflect real-world usage (keeping a "cached" sum on hand isn't always a bad idea depending on your specific needs).
I highly recommend you put together a test database, throw in 10s millions of test rows, and run some real queries in MySQL to determine how the system will perform. Spending 15 minutes doing this would give you far more accurate information.
Telling MySQL to ignore the primary (and only) index speeds both queries up.
For InnoDB it saves a second the queries. On MyISAM it keeps the query time consistently at the minimum time seen.
The cange is to add
ignore index(`PRIMARY`)
after the tablename in the query.
EDIT:
I appreciate all the input but much of it was of the form "you shouldn't do this", "do something completely different", etc. None of it addressed the question at hand:
"So what's the best way I can have
MySQL behave like itself most of the
time, yet win over a flat file in the
above scenario?"
So far, the solution I have posted: use MyISAM and ignore the index, seems to be closest to flat file performance for this use case, while still giving me a database when I need the database.
I'd use a summary table maintained by triggers which gives sub 1 second performance - something like as follows:
select
st.tot - v.val
from
ivi_sum_total st
join
(
select sum(val) as val from ivi where id in (1,12,121,1121)
) v;
+---------------------+
| st.tot - v.val |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)
Full schema
drop table if exists ivi_sum_total;
create table ivi_sum_total
(
tot decimal(65,5) default 0
)
engine=innodb;
drop table if exists ivi;
create table ivi
(
id int unsigned not null auto_increment,
val decimal(65,5) default 0,
primary key (id, val)
)
engine=innodb;
delimiter #
create trigger ivi_before_ins_trig before insert on ivi
for each row
begin
update ivi_sum_total set tot = tot + new.val;
end#
create trigger ivi_before_upd_trig before update on ivi
for each row
begin
update ivi_sum_total set tot = (tot - old.val) + new.val;
end#
-- etc...
Testing
select count(*) from ivi;
+----------+
| count(*) |
+----------+
| 32000000 |
+----------+
select
st.tot - v.val
from
ivi_sum_total st
join
(
select sum(val) as val from ivi where id in (1,12,121,1121)
) v;
+---------------------+
| st.tot - v.val |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)
select sum(val) from ivi where id not in (1,12,121,1121);
+---------------------+
| sum(val) |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (29.89 sec)
select * from ivi_sum_total;
+---------------------+
| tot |
+---------------------+
| 1048317683047.43227 |
+---------------------+
1 row in set (0.03 sec)
select * from ivi where id = 2;
+----+-------------+
| id | val |
+----+-------------+
| 2 | 11781.30443 |
+----+-------------+
1 row in set (0.01 sec)
start transaction;
update ivi set val = 0 where id = 2;
commit;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
select * from ivi where id = 2;
+----+---------+
| id | val |
+----+---------+
| 2 | 0.00000 |
+----+---------+
1 row in set (0.00 sec)
select * from ivi_sum_total;
+---------------------+
| tot |
+---------------------+
| 1048317671266.12784 |
+---------------------+
1 row in set (0.00 sec)
select
st.tot - v.val
from
ivi_sum_total st
join
(
select sum(val) as val from ivi where id in (1,12,121,1121)
) v;
+---------------------+
| st.tot - v.val |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (0.01 sec)
select sum(val) from ivi where id not in (1,12,121,1121);
+---------------------+
| sum(val) |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (31.07 sec)
You are comparing apples and oranges as far as I see. MySQL (or any other relational databases) doesn't suppose work with data which does I/O all the time. then you are destroying the meaning of index. Even worse index would become a burden since it doesn't fit to RAM at all. Thats why people use sharding / summary tables. In you example the size of database (so the disk io) would be much more than flat file since there is a primary index on top of data itself. as z5h stated ignoring primary index can save you some time but it would never be as fast as plain text file.
I would suggest you to use summary tables like having a bg job doing a summary and you UNION this summary table with the rest of the "live" table. But even mysql would not handle rapidly growing data well after some 100s of millions it would start to fail. Thats why people are working for distributed systems like hdfs and map/reduce frameworks like hadoop.
P.S: My technical examples are not 100% right, I just want to go through the concepts.
Is it a single-user system?
Performance of Flat file will degrade significantly with multiple users. With DB, it "should" schedule disk reads to satisfy queries running in parallel.
There is one option nobody has consider as of yet...
Since the aforementioned JAVA code uses a HashSet, why not use a Hash Index ?
By default, indexes in MyISAM tables use BTREE indexing.
By default, indexes in MEMORY tables use HASH indexing.
Simply force the MyISAM table to use a HASH index instead of a BTREE
CREATE TABLE `ivi`
(
`id` int(11) NOT NULL AUTO_INCREMENT,
`val` double DEFAULT NULL,
PRIMARY KEY (`id`) USING HASH
) ENGINE=MyISAM;
Now that should level the playing field a litte. However, index range searching has poor performance when using a hash index. If you retrieve one id at a time, it should be faster than your previous testing n MyISAM.
If you want to load the data much faster
Get rid of the AUTO_INCREMENT property
Get rid of the primary key
Use a regular index
CREATE TABLE `ivi`
(
`id` int(11) NOT NULL,
`val` double DEFAULT NULL,
KEY id (`id`) USING HASH
) ENGINE=MyISAM;
Then do something like this:
ALTER TABLE ivi DISABLE KEYS;
...
... (Load data and manually generate id)
...
ALTER TABLE ivi ENABLE KEYS;
This will build the index after it is done being load
You shoudld also consider sizing the key_buffer_size in /etc/my.cnf to handle large numbers of MyISAM keys.
Give it a Try and let us know if this helped and what you found !!!
You might want to have a look at NDBAPI. I imagine these people were able to achieve speeds that are close to working with file but still have the data stored in the InnoDB.