Mysql query very slow within date range in timestamp - mysql

There is one table in mysql table which has about 1.76 Million records and growing. Seems like the more records it has the slower it gets. It takes about 65 seconds to run a simple query below. Date_run is a timestamp field. I wonder if that makes it slower to run. Any suggestions that I can tweak in the Options File to make this babe faster?
select *
from stocktrack
where date(date_run) >= '2014-5-22'
and date(date_run) <= '2014-5-29'
mysql version 5.6
Windows 8.1 64 bit
Intel Core i7-4770, 3.40Ghz 12gb RAM

To improve performance of this query, have a suitable index available (with date_run as the leading column in the index), and reference the "bare column" in equivalent predicates.
Wrapping the column in a function (like DATE(), as in your query) disables the MySQL optimizer from using a range scan operation. With your query, even with an index available, MySQL is doing a full scan of every single row in the table, each time you run that query.
For improved performance, use predicate on the "bare" column, like this, for example:
WHERE date_run >= '2014-5-22' AND date_run < '2014-5-29' + INTERVAL 1 DAY
(Note that when we leave out the time portion of a date literal, MySQL assumes a time component of midnight '00:00:00'. We know every datetime/timestamp value with a date component equal to '2014-05-29' is guaranteed to be less than midnight of '2014-05-30'.)
An appropriate index is needed for MySQL to use an efficient range scan operation for this particular query. The simplest index suitable for this query would be:
... ON stocktrack (date_run)
(Note that any index with date_run as the leading column would be suitable.)
The range scan operation using an index is (usually) much more efficient (and faster) on large sets, because MySQL can very quickly eliminate vast swaths of rows from consideration. Absent a range scan operation, MySQL has to check every single row in the table.
Use EXPLAIN to compare MySQL query plans, between the original and the modified.
The question you asked...
"... any tweaks to the options file ..."
The answer to that question is really going to depend on which storage engine you are using (MyISAM or InnoDB). The biggest bang for the buck comes from allocating sufficient buffer area to hold database blocks in memory, to reduce I/O... but that's at the cost of having less memory available to whatever else is running, and there's no benefit to over allocating memory. Questions about MySQL server tuning, beyond query performance, would probably be better asked on dba.stackexchange.com.

First, you must create index for date_run column, then compare like this (in case your format is 'Y-m-d H:i:s')
select
*
from
stocktrack
where
date_run >= '2014-05-22 00:00:00'
and date_run <= '2014-05-29 00:00:00'

Related

Do I need NoSQL solution like Elasticsearch if I want to query 2 billion rows in one table or MySQL is enough?

I have application that will go to production soon. Application will insert approximately 10 million rows in one table that has 16 columns in Amazon Aurora MySQL database. Column types are bigint, int, bit, datetime. Other tables in database have less than 3 thousand rows.
We will make SQL queries with few inner joins and where clause will only have datetime range and bigint value of one column on that large table in last 6 months. That means we will have 2 billion rows in that large table.
Data older than 6 months will be deleted from that large table.
If I put index on date column it will probably slow insert command and maybe querying will still be slow.
If I use Elasticsearch and create some application task that will insert rows into multiple shards grouped by date every 5 minutes and create MySQL database backups every 3 days and delete rows older than 3 days, maybe querying data will be faster.
What do you think?
Is it better and more efficient to use Elasticsearch or MySQL is enough?
It depends on the usage you are giving to your data (analysis or log handling are two completely different purposes) and how you access it.
What are you worried about?
Query times: indexing the column that will be more queried will increase the speed. Also, designing this table with a snowflake shceme may help. There are other things like picking a column-oriented DBMS that could improve the speeds. Also, depending on the type of data you are storing (maybe you are using dates, so Timescale could help). If this is all about logging, then Elasticsearch can do this pretty nicely, but it seems like you already know what data will fit.
Storage issues: if you are worried about really large tables, there are database systems that are distributed (eg. VoltDB). Which has redundancy as well.
The last thing I can think of is how it will fit in your architecture. If you are already using MySQL, you don't really need for 10 million rows to change a lot of things. The above answer is in case you start increasing the size, and you need some specific solutions.
If MySQL only...
1 billion or 10 million -- MySQL can handle either. But there are limitations on what queries will run "fast".
I need to see a specific query.
Purging old data -- Plan on PARTITIONing by the datetime column. Otherwise, the Delete will be terribly slow. See Partition For "6 months", I recommend 8 monthly or 29 weekly partitions. See the link for a discussion of the starting and "future" partitions.
DATE / DATETIME
It is usually unwise to have an index stating with the datatime. I need to see your query to go into more detail on what the index should look like.
A DATE column is equivalent to a DATETIME for midnight of that morning. I recommend the following pattern for testing against a date or datetime range. It avoids leap issues, midnight issues, etc, etc. This correctly tests for a one-week range:
WHERE d >= '2022-04-24'
AND d < '2022-04-24' + INTERVAL 7 DAY
You want a 1 day range?
WHERE d >= '2022-04-24'
AND d < '2022-04-24' + INTERVAL 1 DAY
Or noon to noon the next day:
WHERE d >= '2022-04-24 12:00:00'
AND d < '2022-04-24 12:00:00' + INTERVAL 24 HOUR
Those work for DATE, DATETIME, TIMESTAMP -- with or without fractional seconds.
Extra indexes are not a big deal. By "10M rows, did you mean 10M/day? That is 120/second (or more during spikes). That is "moderate", assuming SSD drives are being used.
Is your INSERT application single-threaded? Can it batch, say, 100 rows at a time?
If latitude/longitude are involved, say so; that is a different kettle of fish.
Will 2 billion rows slow down Inserts? I need to see the tentative list of indexes (PRIMARY, UNIQUE, SPATIAL, FULLTEXT, and others). Properly designed, I don't see a problem.
Normalization
You should normalize the Fact table to help with disk space (hence speed), but don't over-normalize. To advise further, I need a feel for the data, not just the datatypes. Do not normalize datetime columns or any other columns to be tested as a "range"; such values need to be in the Fact table.
What you have said so far does not indicate the need for sharding (in a MySQL-only implementation).
(I cannot address whether ElasticSearch would be better or worse that MySQL. NoSQL requires re-inventing much of what MySQL or ES do automatically.)

Indexing not working when large data affected in where condition

I have a query. As follows
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases,
SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount,
SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp)
FROM caseDetails USE INDEX (updatedAt_caseDetails)
WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/06/30 23:59:00'
It uses indexing effectively. Screen shot of result of explain:
There are 154500 records in date range '2016/06/01 00:00:00' AND '2016/07/26 23:59:00'.
But when I increase data range as,
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases, SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount, SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp) FROM caseDetails USE INDEX (updatedAt_caseDetails) WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
Now this is not using indexing. Screen shot of result of explain:
There are 3089464 records in date range '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
After increasing date range query not using indexing anymore, so it gets too much slow. Even after I am forcing to use index. I am not able to figure out why this is happening as there is no change in query as well as indexing. Can you please help me to know about why this is happening.
Don't use USE INDEX or FORCE INDEX. This will slow down the query when most of the table is being accessed. In particular, the Optimizer will decide, rightly, to do a table scan if the index seems to point to more than about 20% of the rows. Using an index involves bouncing back and forth between the index and the data, whereas doing a table scan smoothly reads the data sequentially (albeit having to skip over many of the rows).
There is another solution to the real problem. I assume you are building "reports" summarizing data from a large Data Warehouse table?
Instead of always starting with raw data ('Fact' table), build and maintain a "Summary Table". For your data, it would probably have 1 row per day. Each night you would tally the SUMs and COUNTs for the various things for the day. Then the 'report' would sum the sums and sum the counts to get the desired tallies for the bigger date range.
More discussion: http://mysql.rjweb.org/doc.php/summarytables
Your 'reports' will run more than 10 times as fast, and you won't even be tempted to FORCE INDEX. After all, 60 rows should be a lot faster than 3089464.
less time (more likely)
Using an index might be inferior even when disk reads would be fewer (see below). Most disk drives support bulk read. That is, you request data from a certain block/page and from the n following pages. This is especially fast for almost all rotating disks, tapes and all other hard drives where accessing data in a sequential manner is more efficient than random access (like ... really more efficient).
Essentially you gain a time advantage by sequential read versus random access.
fewer disk reads (less likely)
Using an index is effective, when you actually gain speed/efficiency. An index is good, when you reduce the number of disk reads significantly and need less time. When reading the index and reading the resulting rows determined by using the index will result in almost the same disk reads as reading the whole table, usage of an index is probably unwise.
This will probably happen if your data is spread out enough (in respect to search criteria), so that you most likely have to read (almost) all pages/blocks anyway.
ideas for a fix
if you only access your table in this way (that is, the date is the most important search criteria) it might very much be worth the time to order the data on disk. I believe mysql might provide such a feature ... (optimize table appears to do some of this)
this would decrease query duration for index usage (and the index is more likely to be used)
alternatives
see post from Rick James (essentially: store aggregates instead of repeatedly calculating them)
Hey it has been long time I had ask this question, Now I have better solution for this which is working really smoothly for me. I hope my answer may help someone.
I used Partitioning method, and observed that performance of the query is really high now. I alter table by creating range partitioning on updatedAt column.
Range Partitioning

Does it improve performance to index a date column?

I have a table with millions of rows where one of the columns is a TIMESTAMP and against which I frequently select for date ranges. Would it improve performance any to index that column, or would that not furnish any notable improvement?
EDIT:
So, I've indexed the TIMESTAMP column. The following query
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
Takes 3.1 seconds.
There are just over 3 million records in the interactions table.
The above query produces a result of ~976k
Does this seem like a reasonable amount of time to perform this task?
If you want improvement on the efficiency of queries, you need 2 things:
First, index the column.
Second, and this is more important, make sure the conditions on your queries are sargable, i.e. that indexes can be used. In particular, functions should not be used on the columns. In your example, one way to write the condition would be:
WHERE interaction_time >= '2013-10-10'
AND interaction_time < (CURRENT_DATE + INTERVAL 1 DAY)
The general rule with indexes is they speed retrieval of data with large data sets, but SLOW the insertion and update of records.
If you have millions of rows, and need to select a small subset of them, then an index most likely will improve performance when doing a SELECT. (If you need most or all of them if will make little or no difference.)
Without an index, a table scan (ie read of every record to locate required ones) will occur which can be slow.
With tables with only a few records, a table scan can actually be faster than an index, but this is not your situation.
Another consideration is how many discrete values you have. If you only have a handful of different dates, indexing probably won't help much if at all, however if you have a wide range of dates the index will most likely help.
One caveat, if the index is very big and won't fit in memory, you may not get the performance benefits you might hope for.
Also you need to consider what other fields you are retrieving, joins etc, as they all have an impact.
A good way to check how performance is impacted is to use the EXPLAIN statement to see how mySQL will execute the query.
It would improve performance if:
there are at least "several" different values
your query uses a date range that would select less than "most" of the rows
To find out for sure, use EXPLAIN to show what index is being used. Use explain before creating the index and again after - you should see that the new index is being used or not. If its being used, you can be confident performance is better.
You can also simply compare query timings.
For
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
query to be optimized you need to do the following:
Use just interaction_time instead of date(interaction_time)
Create an index that covers interaction_time column
(optional) Use just '2013-10-10' not date('2013-10-10')
You need #1 because indexes are only used if the columns are used in comparisons as-is, not as arguments in another expressions.
Adding an index on date column definitely increases performance.
My table has 11 million rows, and a query to fetch rows which were updated on a particular date took the following time according to conditions:
Without index: ~2.5s
With index: ~5ms

Which sorting algorithm(s) does MySQL use?

If MySQL is to run a query select * from table order by datetime, where datetime is a datetime column, on a table with >10 million rows, which sorting algorithm does it use?
You can find the details in the documentation on Order By Optimisation.
Essentially the MySQL engine will consider using an index, if a suitable one is available, and it is estimated that using it would be beneficial to the performance.
If no such index is selected, then a so-called "filesort" operation will be performed, which -- despite its name -- might very well execute completely in memory. But it may also use temporary files to swap in/out partitions that are (to be) sorted, and to merge sorted partitions into bigger ones.
In-memory sorting is performed with Quick Sort. You can find a mf_qsort.c file in the source files in the mysys folder.
A datetime is represented by 5 to 8 bytes (depending on whether second fractions are used), and sorting by it is no different than sorting a bigint which also occupies 8 bytes.
It works if your column datetime has declared index. Without index, your query will be slower on on millions of records. If you are using it alone for reporting, you should be fine.
In general usage (fast interaction with many users) this is not a good practise. It is recommended to use additional conditions on WHERE clause to further filter your data. As well as having additional index on columns used in WHERE clause. LIMIT clause also helps.

Which performs better in a MySQL where clause: YEAR() vs BETWEEN?

I need to find all records created in a given year from a MySQL database. Is there any way that one of the following would be slower than the other?
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
or
WHERE YEAR(create_date) = '2009'
This:
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
...works better because it doesn't alter the data in the create_date column. That means that if there is an index on the create_date, the index can be used--because the index is on the actual value as it exists in the column.
An index can't be used on YEAR(create_date), because it's only using a portion of the value (that requires extraction).
Whenever you use a function against a column, it must perform the function on every row in order to see if it matches the constant. This prevents the use of an index.
The basic rule of thumb, then, is to avoid using functions on the left side of the comparison.
Sargable means that the DBMS can use an index. Use a column on the left side and a constant on the right side to allow the DBMS to utilize an index.
Even if you don't have an index on the create_date column, there is still overhead on the DBMS to run the YEAR() function for each row. So, no matter what, the first method is most likely faster.
I would expect the former to be quicker as it is sargable.
Ideas:
Examine the explain plans; if they are identical, query performance will probably be nearly the same.
Test the performance on a large corpus of test data (which has most of its rows in years other than 2009) on a production-grade machine (ensure that the conditions are the same, e.g. cold / warm caches)
But I'd expect BETWEEN to win. Unless the optimiser is clever enough to do the optimisation for YEAR(), in which case would be the same.
ANOTHER IDEA:
I don't think you care.
If you have only a few records per year, then the query would be fast even if it did a full table scan, because even with (say) 100 years' data, there are so few records.
If you have a very large number of records per year (say 10^8) then the query would be very slow in any case, because returning that many records takes a long time.
You didn't say how many years' data you keep. I guess if it's an archaeological database, you might have a few thousand, in which case you might care if you have a massive load of data.
I find it extremely unlikely that your application will actually notice the difference between a "good" explain plan (using an index range scan) and a "bad" explain plan (full table scan) in this case.