Making a groupby query faster - mysql

This is my data from my table:
I mean i have exactly one million rows so it is just a snippet.
I would like to make this query faster:
Which basically groups the values by time (ev represents year honap represents month and so on.). It has one problem that it takes a lot of time. I tried to apply indexes as you can see here:
but it does absolutely nothing.
Here is my index:
I have tried also to put the perc (which represents minute) due to cardinality but mysql doesnt want to use that. Could you give me any suggestions?

Is the data realistic? If so, why run the query -- it essentially delivers exactly what was in the table.
If, on the other hand, you had several rows per minute, then the GROUP BY makes sense.
The index you have is not worth using. However, the Optimizer seemed to like it. That's a bug.
In that case, I would simply this:
SELECT AVG(konyha1) AS 'avg',
LEFT(time, 16) AS 'time'
FROM onemilliondata
GROUP BY LEFT(time, 16)
A DATE or TIME or DATETIME can be treated as such a datatype or as a VARCHAR. I'm asking for it to be a string.
Even in this case, no index is useful. However, this would make it a little faster:
PRIMARY KEY(time)
and the table would have only 2 columns: time, konyha1.
It is rarely beneficial to break a date and/or time into components and put them into columns.
A million points will probably choke a graphing program. And the screen -- which has a resolution of only a few thousand.
Perhaps you should group by hour? And use LEFT(time, 13)? Performance would probably be slightly faster -- but only because less data is being sent to the client.
If you are collecting this data "forever", consider building and maintaining a "summary table" of the averages for each unit of time. Then the incremental effort is, say, aggregating yesterday's data each morning.
You might find MIN(konyha1) and MAX(konyha1) interesting to keep on an hourly or daily basis. Note that daily or weekly aggregates can be derived from hourly values.

Related

Indexing not working when large data affected in where condition

I have a query. As follows
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases,
SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount,
SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp)
FROM caseDetails USE INDEX (updatedAt_caseDetails)
WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/06/30 23:59:00'
It uses indexing effectively. Screen shot of result of explain:
There are 154500 records in date range '2016/06/01 00:00:00' AND '2016/07/26 23:59:00'.
But when I increase data range as,
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases, SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount, SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp) FROM caseDetails USE INDEX (updatedAt_caseDetails) WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
Now this is not using indexing. Screen shot of result of explain:
There are 3089464 records in date range '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
After increasing date range query not using indexing anymore, so it gets too much slow. Even after I am forcing to use index. I am not able to figure out why this is happening as there is no change in query as well as indexing. Can you please help me to know about why this is happening.
Don't use USE INDEX or FORCE INDEX. This will slow down the query when most of the table is being accessed. In particular, the Optimizer will decide, rightly, to do a table scan if the index seems to point to more than about 20% of the rows. Using an index involves bouncing back and forth between the index and the data, whereas doing a table scan smoothly reads the data sequentially (albeit having to skip over many of the rows).
There is another solution to the real problem. I assume you are building "reports" summarizing data from a large Data Warehouse table?
Instead of always starting with raw data ('Fact' table), build and maintain a "Summary Table". For your data, it would probably have 1 row per day. Each night you would tally the SUMs and COUNTs for the various things for the day. Then the 'report' would sum the sums and sum the counts to get the desired tallies for the bigger date range.
More discussion: http://mysql.rjweb.org/doc.php/summarytables
Your 'reports' will run more than 10 times as fast, and you won't even be tempted to FORCE INDEX. After all, 60 rows should be a lot faster than 3089464.
less time (more likely)
Using an index might be inferior even when disk reads would be fewer (see below). Most disk drives support bulk read. That is, you request data from a certain block/page and from the n following pages. This is especially fast for almost all rotating disks, tapes and all other hard drives where accessing data in a sequential manner is more efficient than random access (like ... really more efficient).
Essentially you gain a time advantage by sequential read versus random access.
fewer disk reads (less likely)
Using an index is effective, when you actually gain speed/efficiency. An index is good, when you reduce the number of disk reads significantly and need less time. When reading the index and reading the resulting rows determined by using the index will result in almost the same disk reads as reading the whole table, usage of an index is probably unwise.
This will probably happen if your data is spread out enough (in respect to search criteria), so that you most likely have to read (almost) all pages/blocks anyway.
ideas for a fix
if you only access your table in this way (that is, the date is the most important search criteria) it might very much be worth the time to order the data on disk. I believe mysql might provide such a feature ... (optimize table appears to do some of this)
this would decrease query duration for index usage (and the index is more likely to be used)
alternatives
see post from Rick James (essentially: store aggregates instead of repeatedly calculating them)
Hey it has been long time I had ask this question, Now I have better solution for this which is working really smoothly for me. I hope my answer may help someone.
I used Partitioning method, and observed that performance of the query is really high now. I alter table by creating range partitioning on updatedAt column.
Range Partitioning

In MySql, is it worthwhile creating more than one multi-column indexes on the same set of columns?

I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.
To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument
A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure
First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.
Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?
SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.

Does it improve performance to index a date column?

I have a table with millions of rows where one of the columns is a TIMESTAMP and against which I frequently select for date ranges. Would it improve performance any to index that column, or would that not furnish any notable improvement?
EDIT:
So, I've indexed the TIMESTAMP column. The following query
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
Takes 3.1 seconds.
There are just over 3 million records in the interactions table.
The above query produces a result of ~976k
Does this seem like a reasonable amount of time to perform this task?
If you want improvement on the efficiency of queries, you need 2 things:
First, index the column.
Second, and this is more important, make sure the conditions on your queries are sargable, i.e. that indexes can be used. In particular, functions should not be used on the columns. In your example, one way to write the condition would be:
WHERE interaction_time >= '2013-10-10'
AND interaction_time < (CURRENT_DATE + INTERVAL 1 DAY)
The general rule with indexes is they speed retrieval of data with large data sets, but SLOW the insertion and update of records.
If you have millions of rows, and need to select a small subset of them, then an index most likely will improve performance when doing a SELECT. (If you need most or all of them if will make little or no difference.)
Without an index, a table scan (ie read of every record to locate required ones) will occur which can be slow.
With tables with only a few records, a table scan can actually be faster than an index, but this is not your situation.
Another consideration is how many discrete values you have. If you only have a handful of different dates, indexing probably won't help much if at all, however if you have a wide range of dates the index will most likely help.
One caveat, if the index is very big and won't fit in memory, you may not get the performance benefits you might hope for.
Also you need to consider what other fields you are retrieving, joins etc, as they all have an impact.
A good way to check how performance is impacted is to use the EXPLAIN statement to see how mySQL will execute the query.
It would improve performance if:
there are at least "several" different values
your query uses a date range that would select less than "most" of the rows
To find out for sure, use EXPLAIN to show what index is being used. Use explain before creating the index and again after - you should see that the new index is being used or not. If its being used, you can be confident performance is better.
You can also simply compare query timings.
For
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
query to be optimized you need to do the following:
Use just interaction_time instead of date(interaction_time)
Create an index that covers interaction_time column
(optional) Use just '2013-10-10' not date('2013-10-10')
You need #1 because indexes are only used if the columns are used in comparisons as-is, not as arguments in another expressions.
Adding an index on date column definitely increases performance.
My table has 11 million rows, and a query to fetch rows which were updated on a particular date took the following time according to conditions:
Without index: ~2.5s
With index: ~5ms

...still not getting results trying to optimize mysql innodb table for fast count

i posted this question here a while ago. i tried out the suggestions and came to the conclusion that i must be doing something fundamentally wrong.
What i basically want to do is this:
i have a table containing 83Mio. time/price pairs. As index im using a millisecond accurate unix timestamp, the price ranges between 1.18775 and 1.60400 (decimal with precision 5).
i have a client that needs to get out the price densities for a given time interval, meaning i want to take a specified interval of time and count how many times all the different prices appear in this interval.
How would you guys do this? How would you design/index the table? Right now im building a temporary subtable containing only the data for the given interval and then do the counts on the prices. Is there a better way to do this? My general db settings are already tuned out and pretty performant. Thanks for any hints! I will provide any additional information needed as fast as i can!
Given that you have a large amount of data and its growing v rapidly I'd be inclined to add a second table of:
price (primary key)
time( some block - also part of PK )
count
Do an 'insert on duplicate key update count++' sort of thing. Group the time field by some predetermined interval (depends on the sorts of queries you get.. ms/sec/hour/whatever). This way you:
don't have to mess with temp tables - with a table of this size it will write to disk - slow even with SSD
don't have to touch the initial table every time you want to do your query - might run into locking issues
You will have to avg out your data a bit but the granularity can be predetermined to cause as few issues as possible.

Which performs better in a MySQL where clause: YEAR() vs BETWEEN?

I need to find all records created in a given year from a MySQL database. Is there any way that one of the following would be slower than the other?
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
or
WHERE YEAR(create_date) = '2009'
This:
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
...works better because it doesn't alter the data in the create_date column. That means that if there is an index on the create_date, the index can be used--because the index is on the actual value as it exists in the column.
An index can't be used on YEAR(create_date), because it's only using a portion of the value (that requires extraction).
Whenever you use a function against a column, it must perform the function on every row in order to see if it matches the constant. This prevents the use of an index.
The basic rule of thumb, then, is to avoid using functions on the left side of the comparison.
Sargable means that the DBMS can use an index. Use a column on the left side and a constant on the right side to allow the DBMS to utilize an index.
Even if you don't have an index on the create_date column, there is still overhead on the DBMS to run the YEAR() function for each row. So, no matter what, the first method is most likely faster.
I would expect the former to be quicker as it is sargable.
Ideas:
Examine the explain plans; if they are identical, query performance will probably be nearly the same.
Test the performance on a large corpus of test data (which has most of its rows in years other than 2009) on a production-grade machine (ensure that the conditions are the same, e.g. cold / warm caches)
But I'd expect BETWEEN to win. Unless the optimiser is clever enough to do the optimisation for YEAR(), in which case would be the same.
ANOTHER IDEA:
I don't think you care.
If you have only a few records per year, then the query would be fast even if it did a full table scan, because even with (say) 100 years' data, there are so few records.
If you have a very large number of records per year (say 10^8) then the query would be very slow in any case, because returning that many records takes a long time.
You didn't say how many years' data you keep. I guess if it's an archaeological database, you might have a few thousand, in which case you might care if you have a massive load of data.
I find it extremely unlikely that your application will actually notice the difference between a "good" explain plan (using an index range scan) and a "bad" explain plan (full table scan) in this case.