Indexing not working when large data affected in where condition - mysql

I have a query. As follows
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases,
SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount,
SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp)
FROM caseDetails USE INDEX (updatedAt_caseDetails)
WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/06/30 23:59:00'
It uses indexing effectively. Screen shot of result of explain:
There are 154500 records in date range '2016/06/01 00:00:00' AND '2016/07/26 23:59:00'.
But when I increase data range as,
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases, SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount, SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp) FROM caseDetails USE INDEX (updatedAt_caseDetails) WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
Now this is not using indexing. Screen shot of result of explain:
There are 3089464 records in date range '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
After increasing date range query not using indexing anymore, so it gets too much slow. Even after I am forcing to use index. I am not able to figure out why this is happening as there is no change in query as well as indexing. Can you please help me to know about why this is happening.

Don't use USE INDEX or FORCE INDEX. This will slow down the query when most of the table is being accessed. In particular, the Optimizer will decide, rightly, to do a table scan if the index seems to point to more than about 20% of the rows. Using an index involves bouncing back and forth between the index and the data, whereas doing a table scan smoothly reads the data sequentially (albeit having to skip over many of the rows).
There is another solution to the real problem. I assume you are building "reports" summarizing data from a large Data Warehouse table?
Instead of always starting with raw data ('Fact' table), build and maintain a "Summary Table". For your data, it would probably have 1 row per day. Each night you would tally the SUMs and COUNTs for the various things for the day. Then the 'report' would sum the sums and sum the counts to get the desired tallies for the bigger date range.
More discussion: http://mysql.rjweb.org/doc.php/summarytables
Your 'reports' will run more than 10 times as fast, and you won't even be tempted to FORCE INDEX. After all, 60 rows should be a lot faster than 3089464.

less time (more likely)
Using an index might be inferior even when disk reads would be fewer (see below). Most disk drives support bulk read. That is, you request data from a certain block/page and from the n following pages. This is especially fast for almost all rotating disks, tapes and all other hard drives where accessing data in a sequential manner is more efficient than random access (like ... really more efficient).
Essentially you gain a time advantage by sequential read versus random access.
fewer disk reads (less likely)
Using an index is effective, when you actually gain speed/efficiency. An index is good, when you reduce the number of disk reads significantly and need less time. When reading the index and reading the resulting rows determined by using the index will result in almost the same disk reads as reading the whole table, usage of an index is probably unwise.
This will probably happen if your data is spread out enough (in respect to search criteria), so that you most likely have to read (almost) all pages/blocks anyway.
ideas for a fix
if you only access your table in this way (that is, the date is the most important search criteria) it might very much be worth the time to order the data on disk. I believe mysql might provide such a feature ... (optimize table appears to do some of this)
this would decrease query duration for index usage (and the index is more likely to be used)
alternatives
see post from Rick James (essentially: store aggregates instead of repeatedly calculating them)

Hey it has been long time I had ask this question, Now I have better solution for this which is working really smoothly for me. I hope my answer may help someone.
I used Partitioning method, and observed that performance of the query is really high now. I alter table by creating range partitioning on updatedAt column.
Range Partitioning

Related

Making a groupby query faster

This is my data from my table:
I mean i have exactly one million rows so it is just a snippet.
I would like to make this query faster:
Which basically groups the values by time (ev represents year honap represents month and so on.). It has one problem that it takes a lot of time. I tried to apply indexes as you can see here:
but it does absolutely nothing.
Here is my index:
I have tried also to put the perc (which represents minute) due to cardinality but mysql doesnt want to use that. Could you give me any suggestions?
Is the data realistic? If so, why run the query -- it essentially delivers exactly what was in the table.
If, on the other hand, you had several rows per minute, then the GROUP BY makes sense.
The index you have is not worth using. However, the Optimizer seemed to like it. That's a bug.
In that case, I would simply this:
SELECT AVG(konyha1) AS 'avg',
LEFT(time, 16) AS 'time'
FROM onemilliondata
GROUP BY LEFT(time, 16)
A DATE or TIME or DATETIME can be treated as such a datatype or as a VARCHAR. I'm asking for it to be a string.
Even in this case, no index is useful. However, this would make it a little faster:
PRIMARY KEY(time)
and the table would have only 2 columns: time, konyha1.
It is rarely beneficial to break a date and/or time into components and put them into columns.
A million points will probably choke a graphing program. And the screen -- which has a resolution of only a few thousand.
Perhaps you should group by hour? And use LEFT(time, 13)? Performance would probably be slightly faster -- but only because less data is being sent to the client.
If you are collecting this data "forever", consider building and maintaining a "summary table" of the averages for each unit of time. Then the incremental effort is, say, aggregating yesterday's data each morning.
You might find MIN(konyha1) and MAX(konyha1) interesting to keep on an hourly or daily basis. Note that daily or weekly aggregates can be derived from hourly values.

SQL Index to Optimize WHERE Query

I have a Postgres table with several columns, one column is the datetime that the column was last updated. My query is to get all the updated rows between a start and end time. It is my understanding for this query to use WHERE in this query instead of BETWEEN. The basic query is as follows:
SELECT * FROM contact_tbl contact
WHERE contact."UpdateTime" >= '20150610' and contact."UpdateTime" < '20150618'
I am new at creating SQL queries, I believe this query is doing a full table scan. I would like to optimize it if possible. I have placed a Normal index on the UpdateTime column, which takes a long time to create, but with this index the query is faster. One thing I am not sure about is if have to keep recalculating this index if the table gets bigger/columns get changed. Also, I am considering a CLUSTERED index on the UpdateTime row, but I wanted to ask if there was a canonical way of optimizing this/if I was on the right track first
Placing an index on UpdateTime is correct. It will allow the index to be used instead of full table scans.
2 WHERE conditions like the above vs. using the BETWEEN keyword are the exact same:
http://dev.mysql.com/doc/refman/5.7/en/comparison-operators.html#operator_between
BETWEEN is just "syntactical sugar" for those that like that syntax better.
Indexes allow for faster reads, but slow down writes (because like you mention, the new data has to be inserted into the index as well). The entire index does not need to be recalculated. Indexes are smart data structures, so the extra data can be added without a lot of extra work, but it does take some.
You're probably doing many more reads than writes, so using an index is a good idea.
If you're doing lots of writes and few reads, then you'd want to think a bit more about it. It would then come down to business requirements. Although overall the throughput may be slowed, read latency may not be a requirement but write latency may be, in which case you wouldn't want the index.
For instance, think of this lottery example: Everytime someone buys a ticket, you have to record their name and the ticket number. However the only time you ever have to read that data, is after the 1 and only drawing to see who had that ticket number. In this database, you wouldn't want to index the ticket number since they'll be so many writes and very few reads.

Does it improve performance to index a date column?

I have a table with millions of rows where one of the columns is a TIMESTAMP and against which I frequently select for date ranges. Would it improve performance any to index that column, or would that not furnish any notable improvement?
EDIT:
So, I've indexed the TIMESTAMP column. The following query
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
Takes 3.1 seconds.
There are just over 3 million records in the interactions table.
The above query produces a result of ~976k
Does this seem like a reasonable amount of time to perform this task?
If you want improvement on the efficiency of queries, you need 2 things:
First, index the column.
Second, and this is more important, make sure the conditions on your queries are sargable, i.e. that indexes can be used. In particular, functions should not be used on the columns. In your example, one way to write the condition would be:
WHERE interaction_time >= '2013-10-10'
AND interaction_time < (CURRENT_DATE + INTERVAL 1 DAY)
The general rule with indexes is they speed retrieval of data with large data sets, but SLOW the insertion and update of records.
If you have millions of rows, and need to select a small subset of them, then an index most likely will improve performance when doing a SELECT. (If you need most or all of them if will make little or no difference.)
Without an index, a table scan (ie read of every record to locate required ones) will occur which can be slow.
With tables with only a few records, a table scan can actually be faster than an index, but this is not your situation.
Another consideration is how many discrete values you have. If you only have a handful of different dates, indexing probably won't help much if at all, however if you have a wide range of dates the index will most likely help.
One caveat, if the index is very big and won't fit in memory, you may not get the performance benefits you might hope for.
Also you need to consider what other fields you are retrieving, joins etc, as they all have an impact.
A good way to check how performance is impacted is to use the EXPLAIN statement to see how mySQL will execute the query.
It would improve performance if:
there are at least "several" different values
your query uses a date range that would select less than "most" of the rows
To find out for sure, use EXPLAIN to show what index is being used. Use explain before creating the index and again after - you should see that the new index is being used or not. If its being used, you can be confident performance is better.
You can also simply compare query timings.
For
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
query to be optimized you need to do the following:
Use just interaction_time instead of date(interaction_time)
Create an index that covers interaction_time column
(optional) Use just '2013-10-10' not date('2013-10-10')
You need #1 because indexes are only used if the columns are used in comparisons as-is, not as arguments in another expressions.
Adding an index on date column definitely increases performance.
My table has 11 million rows, and a query to fetch rows which were updated on a particular date took the following time according to conditions:
Without index: ~2.5s
With index: ~5ms

No real performance gain after index rebuild on SQL Server 2008

I have a table whose compound clustered index (int, DateTime) was 99% fragmented.
After defragmenting and making sure that statistics were updated, I still get the same response time when I run this query:
SELECT *
FROM myTable
WHERE myIntField = 1000
AND myDateTimeField >= '2012-01-01'
and myDateTimeField <= '2012-12-31 23:59:59.999'
Well, I see a small response time improvement (like 5-10%) but I was really expected to burst my queries after that index rebuild and stats update.
The estimated execution plan is:
SELECT Cost: 0%
Clustered Index Seek (Clustered)[MyTable].[IX_MyCompoundIndex] Cost: 100%
Is this because the index is a clustered index? Am I missing something?
You should avoid SELECT * - probably even if you do need all of the columns in the table (which is rare).
Also, you are doing something very dangerous here. Did you know that your end range rounds up, so you may be including data from 2013-01-01 at midnight? Try:
AND myDateTimeColumn >= '20120101'
AND myDateTimeColumn < '20130101'
(This won't change performance, but it is easier to generate and is guaranteed to be accurate no matter what the underlying data type is.)
To eliminate network delays from your analysis of query time, you could consider SQL Sentry Plan Explorer - which allows you to generate an actual plan by running the query against the server, but discards the results, so that isn't an interfering factor.
Disclaimer: I work for SQL Sentry.
The execution time of the query is going to be spent reading enough pages of the index's btree to generate the result. Defragmenting the index will put adjacent rows together, reducing the number of pages that need to be read. It can also benefit from turning a largely random io pattern into a sequential one.
If your rows are wide and you don't get many rows per page you won't see much reduction in the number of rows.
If your index fill factor is low, you'll not get as many rows per page.
If your pages are in cache, you won't see any streaming v random IO benefit.
If you have spare CPU capacity on the machine, you may benefit from using page compression. This essentially trades more CPU for less IO.

Which performs better in a MySQL where clause: YEAR() vs BETWEEN?

I need to find all records created in a given year from a MySQL database. Is there any way that one of the following would be slower than the other?
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
or
WHERE YEAR(create_date) = '2009'
This:
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
...works better because it doesn't alter the data in the create_date column. That means that if there is an index on the create_date, the index can be used--because the index is on the actual value as it exists in the column.
An index can't be used on YEAR(create_date), because it's only using a portion of the value (that requires extraction).
Whenever you use a function against a column, it must perform the function on every row in order to see if it matches the constant. This prevents the use of an index.
The basic rule of thumb, then, is to avoid using functions on the left side of the comparison.
Sargable means that the DBMS can use an index. Use a column on the left side and a constant on the right side to allow the DBMS to utilize an index.
Even if you don't have an index on the create_date column, there is still overhead on the DBMS to run the YEAR() function for each row. So, no matter what, the first method is most likely faster.
I would expect the former to be quicker as it is sargable.
Ideas:
Examine the explain plans; if they are identical, query performance will probably be nearly the same.
Test the performance on a large corpus of test data (which has most of its rows in years other than 2009) on a production-grade machine (ensure that the conditions are the same, e.g. cold / warm caches)
But I'd expect BETWEEN to win. Unless the optimiser is clever enough to do the optimisation for YEAR(), in which case would be the same.
ANOTHER IDEA:
I don't think you care.
If you have only a few records per year, then the query would be fast even if it did a full table scan, because even with (say) 100 years' data, there are so few records.
If you have a very large number of records per year (say 10^8) then the query would be very slow in any case, because returning that many records takes a long time.
You didn't say how many years' data you keep. I guess if it's an archaeological database, you might have a few thousand, in which case you might care if you have a massive load of data.
I find it extremely unlikely that your application will actually notice the difference between a "good" explain plan (using an index range scan) and a "bad" explain plan (full table scan) in this case.