Which performs better in a MySQL where clause: YEAR() vs BETWEEN? - mysql

I need to find all records created in a given year from a MySQL database. Is there any way that one of the following would be slower than the other?
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
or
WHERE YEAR(create_date) = '2009'

This:
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
...works better because it doesn't alter the data in the create_date column. That means that if there is an index on the create_date, the index can be used--because the index is on the actual value as it exists in the column.
An index can't be used on YEAR(create_date), because it's only using a portion of the value (that requires extraction).

Whenever you use a function against a column, it must perform the function on every row in order to see if it matches the constant. This prevents the use of an index.
The basic rule of thumb, then, is to avoid using functions on the left side of the comparison.
Sargable means that the DBMS can use an index. Use a column on the left side and a constant on the right side to allow the DBMS to utilize an index.
Even if you don't have an index on the create_date column, there is still overhead on the DBMS to run the YEAR() function for each row. So, no matter what, the first method is most likely faster.

I would expect the former to be quicker as it is sargable.

Ideas:
Examine the explain plans; if they are identical, query performance will probably be nearly the same.
Test the performance on a large corpus of test data (which has most of its rows in years other than 2009) on a production-grade machine (ensure that the conditions are the same, e.g. cold / warm caches)
But I'd expect BETWEEN to win. Unless the optimiser is clever enough to do the optimisation for YEAR(), in which case would be the same.
ANOTHER IDEA:
I don't think you care.
If you have only a few records per year, then the query would be fast even if it did a full table scan, because even with (say) 100 years' data, there are so few records.
If you have a very large number of records per year (say 10^8) then the query would be very slow in any case, because returning that many records takes a long time.
You didn't say how many years' data you keep. I guess if it's an archaeological database, you might have a few thousand, in which case you might care if you have a massive load of data.
I find it extremely unlikely that your application will actually notice the difference between a "good" explain plan (using an index range scan) and a "bad" explain plan (full table scan) in this case.

Related

Making a groupby query faster

This is my data from my table:
I mean i have exactly one million rows so it is just a snippet.
I would like to make this query faster:
Which basically groups the values by time (ev represents year honap represents month and so on.). It has one problem that it takes a lot of time. I tried to apply indexes as you can see here:
but it does absolutely nothing.
Here is my index:
I have tried also to put the perc (which represents minute) due to cardinality but mysql doesnt want to use that. Could you give me any suggestions?
Is the data realistic? If so, why run the query -- it essentially delivers exactly what was in the table.
If, on the other hand, you had several rows per minute, then the GROUP BY makes sense.
The index you have is not worth using. However, the Optimizer seemed to like it. That's a bug.
In that case, I would simply this:
SELECT AVG(konyha1) AS 'avg',
LEFT(time, 16) AS 'time'
FROM onemilliondata
GROUP BY LEFT(time, 16)
A DATE or TIME or DATETIME can be treated as such a datatype or as a VARCHAR. I'm asking for it to be a string.
Even in this case, no index is useful. However, this would make it a little faster:
PRIMARY KEY(time)
and the table would have only 2 columns: time, konyha1.
It is rarely beneficial to break a date and/or time into components and put them into columns.
A million points will probably choke a graphing program. And the screen -- which has a resolution of only a few thousand.
Perhaps you should group by hour? And use LEFT(time, 13)? Performance would probably be slightly faster -- but only because less data is being sent to the client.
If you are collecting this data "forever", consider building and maintaining a "summary table" of the averages for each unit of time. Then the incremental effort is, say, aggregating yesterday's data each morning.
You might find MIN(konyha1) and MAX(konyha1) interesting to keep on an hourly or daily basis. Note that daily or weekly aggregates can be derived from hourly values.

Indexing not working when large data affected in where condition

I have a query. As follows
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases,
SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount,
SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp)
FROM caseDetails USE INDEX (updatedAt_caseDetails)
WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/06/30 23:59:00'
It uses indexing effectively. Screen shot of result of explain:
There are 154500 records in date range '2016/06/01 00:00:00' AND '2016/07/26 23:59:00'.
But when I increase data range as,
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases, SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount, SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp) FROM caseDetails USE INDEX (updatedAt_caseDetails) WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
Now this is not using indexing. Screen shot of result of explain:
There are 3089464 records in date range '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
After increasing date range query not using indexing anymore, so it gets too much slow. Even after I am forcing to use index. I am not able to figure out why this is happening as there is no change in query as well as indexing. Can you please help me to know about why this is happening.
Don't use USE INDEX or FORCE INDEX. This will slow down the query when most of the table is being accessed. In particular, the Optimizer will decide, rightly, to do a table scan if the index seems to point to more than about 20% of the rows. Using an index involves bouncing back and forth between the index and the data, whereas doing a table scan smoothly reads the data sequentially (albeit having to skip over many of the rows).
There is another solution to the real problem. I assume you are building "reports" summarizing data from a large Data Warehouse table?
Instead of always starting with raw data ('Fact' table), build and maintain a "Summary Table". For your data, it would probably have 1 row per day. Each night you would tally the SUMs and COUNTs for the various things for the day. Then the 'report' would sum the sums and sum the counts to get the desired tallies for the bigger date range.
More discussion: http://mysql.rjweb.org/doc.php/summarytables
Your 'reports' will run more than 10 times as fast, and you won't even be tempted to FORCE INDEX. After all, 60 rows should be a lot faster than 3089464.
less time (more likely)
Using an index might be inferior even when disk reads would be fewer (see below). Most disk drives support bulk read. That is, you request data from a certain block/page and from the n following pages. This is especially fast for almost all rotating disks, tapes and all other hard drives where accessing data in a sequential manner is more efficient than random access (like ... really more efficient).
Essentially you gain a time advantage by sequential read versus random access.
fewer disk reads (less likely)
Using an index is effective, when you actually gain speed/efficiency. An index is good, when you reduce the number of disk reads significantly and need less time. When reading the index and reading the resulting rows determined by using the index will result in almost the same disk reads as reading the whole table, usage of an index is probably unwise.
This will probably happen if your data is spread out enough (in respect to search criteria), so that you most likely have to read (almost) all pages/blocks anyway.
ideas for a fix
if you only access your table in this way (that is, the date is the most important search criteria) it might very much be worth the time to order the data on disk. I believe mysql might provide such a feature ... (optimize table appears to do some of this)
this would decrease query duration for index usage (and the index is more likely to be used)
alternatives
see post from Rick James (essentially: store aggregates instead of repeatedly calculating them)
Hey it has been long time I had ask this question, Now I have better solution for this which is working really smoothly for me. I hope my answer may help someone.
I used Partitioning method, and observed that performance of the query is really high now. I alter table by creating range partitioning on updatedAt column.
Range Partitioning

Rails 4/ postgresql index - Should I use as index filter a datetime column which can have infinite number of values?

I need to optimize a query fetching all the deals in a certain country before with access by users before a certain datetime a certain time.
My plan is to implement the following index
add_index(:deals, [:country_id, :last_user_access_datetime])
I am doubting the relevance and efficientness of this index as the column last_user_access_datetime can have ANY value of date ex: 13/09/2015 3:06pm and it will change very often (updated each time a user access it).
That makes an infinite number of values to be indexed if I use this index?
Should I do it or avoid using 'infinite vlaues possible column such as a totally free datetime column inside an index ?
If you have a query like this:
select t.
from table t
where t.country_id = :country_id and t.last_user_access_datetime >= :some_datetime;
Then the best index is the one you propose.
If you have a heavy load on the machine in terms of accesses (and think many accesses per second), then maintaining the index can become a burden on the machine. Of course, you are updating the last access date time value anyway, so you are already incurring overhead.
The number of possible values does not have an effect on the value. A database cannot store an "infinite" number of values (at least on any hardware currently available), so I'm not sure what your concern is.
The index will be used. Time for UPDATE and INSERT statements just take that much longer, because the index is updated each time also. For tables with much more UPDATE/INSERT than SELECTs, it may not be fruitful to index the column. Or, you may want to make an index that looks more like the types of queries that are hitting the table. Include the IDs and timestamps that are in the SELECT clause. Include the IDs and timestamps that are in the WHERE clause. etc.
Also, if a table has a lot of DELETEs, a lot of indices can slow down operations a lot.

Mysql query very slow within date range in timestamp

There is one table in mysql table which has about 1.76 Million records and growing. Seems like the more records it has the slower it gets. It takes about 65 seconds to run a simple query below. Date_run is a timestamp field. I wonder if that makes it slower to run. Any suggestions that I can tweak in the Options File to make this babe faster?
select *
from stocktrack
where date(date_run) >= '2014-5-22'
and date(date_run) <= '2014-5-29'
mysql version 5.6
Windows 8.1 64 bit
Intel Core i7-4770, 3.40Ghz 12gb RAM
To improve performance of this query, have a suitable index available (with date_run as the leading column in the index), and reference the "bare column" in equivalent predicates.
Wrapping the column in a function (like DATE(), as in your query) disables the MySQL optimizer from using a range scan operation. With your query, even with an index available, MySQL is doing a full scan of every single row in the table, each time you run that query.
For improved performance, use predicate on the "bare" column, like this, for example:
WHERE date_run >= '2014-5-22' AND date_run < '2014-5-29' + INTERVAL 1 DAY
(Note that when we leave out the time portion of a date literal, MySQL assumes a time component of midnight '00:00:00'. We know every datetime/timestamp value with a date component equal to '2014-05-29' is guaranteed to be less than midnight of '2014-05-30'.)
An appropriate index is needed for MySQL to use an efficient range scan operation for this particular query. The simplest index suitable for this query would be:
... ON stocktrack (date_run)
(Note that any index with date_run as the leading column would be suitable.)
The range scan operation using an index is (usually) much more efficient (and faster) on large sets, because MySQL can very quickly eliminate vast swaths of rows from consideration. Absent a range scan operation, MySQL has to check every single row in the table.
Use EXPLAIN to compare MySQL query plans, between the original and the modified.
The question you asked...
"... any tweaks to the options file ..."
The answer to that question is really going to depend on which storage engine you are using (MyISAM or InnoDB). The biggest bang for the buck comes from allocating sufficient buffer area to hold database blocks in memory, to reduce I/O... but that's at the cost of having less memory available to whatever else is running, and there's no benefit to over allocating memory. Questions about MySQL server tuning, beyond query performance, would probably be better asked on dba.stackexchange.com.
First, you must create index for date_run column, then compare like this (in case your format is 'Y-m-d H:i:s')
select
*
from
stocktrack
where
date_run >= '2014-05-22 00:00:00'
and date_run <= '2014-05-29 00:00:00'

Does it improve performance to index a date column?

I have a table with millions of rows where one of the columns is a TIMESTAMP and against which I frequently select for date ranges. Would it improve performance any to index that column, or would that not furnish any notable improvement?
EDIT:
So, I've indexed the TIMESTAMP column. The following query
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
Takes 3.1 seconds.
There are just over 3 million records in the interactions table.
The above query produces a result of ~976k
Does this seem like a reasonable amount of time to perform this task?
If you want improvement on the efficiency of queries, you need 2 things:
First, index the column.
Second, and this is more important, make sure the conditions on your queries are sargable, i.e. that indexes can be used. In particular, functions should not be used on the columns. In your example, one way to write the condition would be:
WHERE interaction_time >= '2013-10-10'
AND interaction_time < (CURRENT_DATE + INTERVAL 1 DAY)
The general rule with indexes is they speed retrieval of data with large data sets, but SLOW the insertion and update of records.
If you have millions of rows, and need to select a small subset of them, then an index most likely will improve performance when doing a SELECT. (If you need most or all of them if will make little or no difference.)
Without an index, a table scan (ie read of every record to locate required ones) will occur which can be slow.
With tables with only a few records, a table scan can actually be faster than an index, but this is not your situation.
Another consideration is how many discrete values you have. If you only have a handful of different dates, indexing probably won't help much if at all, however if you have a wide range of dates the index will most likely help.
One caveat, if the index is very big and won't fit in memory, you may not get the performance benefits you might hope for.
Also you need to consider what other fields you are retrieving, joins etc, as they all have an impact.
A good way to check how performance is impacted is to use the EXPLAIN statement to see how mySQL will execute the query.
It would improve performance if:
there are at least "several" different values
your query uses a date range that would select less than "most" of the rows
To find out for sure, use EXPLAIN to show what index is being used. Use explain before creating the index and again after - you should see that the new index is being used or not. If its being used, you can be confident performance is better.
You can also simply compare query timings.
For
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
query to be optimized you need to do the following:
Use just interaction_time instead of date(interaction_time)
Create an index that covers interaction_time column
(optional) Use just '2013-10-10' not date('2013-10-10')
You need #1 because indexes are only used if the columns are used in comparisons as-is, not as arguments in another expressions.
Adding an index on date column definitely increases performance.
My table has 11 million rows, and a query to fetch rows which were updated on a particular date took the following time according to conditions:
Without index: ~2.5s
With index: ~5ms