How does Indexes work with MySql partitioned table - mysql

I have a table which contains information about time, So the table has columns like year, month, day, hour and so on.
Table has data across years and quite big so i decided to make partition on this table and started learning about Mysql partitioning but caught up by few questions.
I will really appreciate, if someone can help me understand how partition and indexes will work together.
If i create partition over year column and also have an index on the same column, how partition and index will work together? How it will impact the performance over, if i had index on year column only and table has no partition?
Ex. Sql: Select month, day, hour ... from time_table where year = '2017';
If table has partition over year column and query is filtering records over month column and month column is indexed. How index over month and partition over year will impact the select performance.
Ex Sql: Select year, month, day .... from time_table where month = '05';

Partitioning splits a table up into, shall we say, "sub-tables". Each sub-table is essentially a table, with data and index(es).
When SELECTing from the table, the first thing done is to decide which partition(s) may contain the desired data. This is "partition pruning" and uses the "partition key" (which is apparently to be year). Then the select is applied to the subtables that are relevant, using whatever index is appropriate. In that case it is a waste to have INDEX(year, ...), since you are already pruned down to the year.
Your sample select cannot do partition pruning since you did not specify year in the WHERE clause. Hence, it will look in all partitions, and will be slower than if you did not partition the table.
Don't use partitioning unless you expect at least a million rows. (That would be a lot of years.)
Don't use partitioning unless you have a use case where it will help you. (Apparently not your case.)
Don't have columns for the parts of a datetime, when it is so easy to compute the parts: YEAR(date), MONTH(date), etc.
Don't index columns with low cardinality; the Optimizer will end up scanning the entire table anyway -- because it is faster. (eg: month='05')
If you would like to back up a step and explain what you are trying to accomplish, perhaps we can discuss another approach.

Related

MySQL - Group By date/time functions on a large table

I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:

What keys should be indexed here to make this query optimal

I have a query that looks like the following:
SELECT * from foo
WHERE days >= DATEDIFF(CURDATE(), last_day)
In this case, days is an INT. last_day is a DATE column.
so I need two individual indexes here for days and last_day?
This query predicate, days >= DATEDIFF(CURDATE(), last_day), is inherently not sargeable.
If you keep the present table design you'll probably benefit from a compound index on (last_day, days). Nevertheless, satisfying the query will require a full scan of that index.
Single-column indexes on either one of those columns, or both, will be useless or worse for improving this query's performance.
If you must have this query perform very well, you need to reorganize your table a bit. Let's figure that out. It looks like you are trying to exclude "overdue" records: you want expiration_date < CURDATE(). That is a sargeable search predicate.
So if you added a new column expiration_date to your table, and then set it as follows:
UPDATE foo SET expiration_date = last_day + INTERVAL days DAY
and then indexed it, you'd have a well-performing query.
You must be careful with indexes, they can help you reading, but they can reduce performance in insert.
You may consider to create a partition over last_day field.
I should try to create only in last_day field, but, I think the best is making some performance tests with different configurations.
Since you are using an expression in the where criteria, mysql will not be able to use indexes on any of the two fields. If you use this expression regularly and you have at least mysql v5.7.8, then you can create a generated column and create an index on it.
The other option is to create a regular column and set its value to the result of this expression and index this column. You will need triggers to keep it updated.

How can I make my mysql getting one record per month query faster?

I have a big database with about 3 million records with records containing a time stamp.
Now I want to select one record per month and it works using this query:
SELECT timestamp, id, gas_used, kwh_used1, kwh_used2 FROM energy
GROUP BY MONTH(timestamp) ORDER BY timestamp ASC
It works but it is very slow.
I have indexes on id and on timestamp.
What can I do to make this query fast?
GROUP BY MONTH(timestamp) is forcing the engine to look at each record individually, aka a sequential scan, which obviously is very slow when you have 30M records.
A common solution is to add an indexed column with just the criterium you will want to select on. However, I highly suspect that you will actually want to select on Year-Month, if your db is not reset every year.
To avoid data corruption issues, it may be best to create an insert trigger that automatically fills that field. That way this extra column doesn't interfere with your business logic.
It is not a good practice to SELECT columns that don't appear in GROUP BY statement, unless they are handled with aggregating function such as MIN(), MAX(), SUM() etc.
In your query this applies to columns:
id, gas_used, kwh_used1, kwh_used2
You will not get the "earliest" (by timestamp) row for each month in this case.
More:
https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html

Two different queries on the same table with the same WHERE clause

I have two different queries. But they are both on the same table and have both the same WHERE clause. So they are selecting the same row.
Query 1:
SELECT HOUR(timestamp), COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY HOUR(timestamp)
Query 2:
SELECT country, COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY country
How can I make this more efficient?
If this table is indexed correctly, it honestly doesn't matter how big the entire table is because you're only looking at today's rows.
If the table is indexed incorrectly the performance of these queries will be terrible no matter what you do.
Your WHERE timestamp >= CURDATE() clause means you need to have an index on the timestamp column. In one of your queries the GROUP BY country shows that a compound covering index on (timestamp, country) will be a great help.
So, a single compound index (timestamp, country) will satisfy both the queries in your question.
Let's explain how that works. To look for today's records (or indeed any records starting and ending with particular timestamp values) and group them by country, and count them, MySQL can satisfy the query by doing these steps:
random-access the index to the first record that matches the timestamp. O(log n).
grab the first country value from the index.
scan to the next country value in the index and count. O(n).
repeat step three until the end of the timestamp range.
This index scan operation is about as fast as a team of ace developers (the MySQL team) can get it to be with a decade of hard work. (You may not be able to outdo them on a Saturday afternoon.) MySQL satisfies the whole query with a small subset of the index, so it doesn't really matter how big the table behind it is.
If you run one of these queries right after the other, it's possible that MySQL will still have some or all the index data blocks in a RAM cache, so it might not have to re-fetch them from disk. That will help even more.
Do you see how your example queries lead with timestamp? The most important WHERE criterion chooses a timestamp range. That's why the compound index I suggested has timestamp as its first column. If you don't have any queries that lead with country your simple index on that column probably is useless.
You asked whether you really need compound covering indexes. You probably should read about how they work and make that decision for yourself.
There's obviously a tradeoff in choosing indexes. Each index slows the process of INSERT and UPDATE a little, and can speed up queries a lot. Only you can sort out the tradeoffs for your particular application.
Since both queries have different GROUP BY clauses they are inherently different and cannot be combined. Assuming there already is an index present on the timestamp field there is no straightforward way to make this more efficient.
If the dataset is huge (10 million or more rows) you might get a little extra efficiency out of making an extra combined index on country, timestamp, but that's unlikely to be measurable, and the lack of it will usually be mitigated by in-memory buffering of MySQL itself if these 2 queries are executed directly after another.

mysql partitioning with int and timestamp

I have MySQL 5.6.12 Community Server.
I am trying to partition my MySQL innoDB table which contains 5M(and growing always) rows of history data. It is getting slower and slower and I figured partitioning will solve it.
I have columns.
stationID int(4)
valueNumberID(int 5)
logTime(timestamp)
value(double)
(stationID,valueNumberID,logTime) is my PRIMARY key.
I have 50 different stationID's. From each station comes history data and I need to store it for 5 years. There are only 2-5 different valueNumberID's from each stationID but hundreds of value changes per day. Each query in the system uses stationID,valueNumberID and logTime in that order. In most cases the queries are limited to current year.
I would like to create partitioning with stationID with each stationID having own partition so the queries use smaller physical table to scan, and further reduce the size of the table by subpartitioning it by logTime. I do not know how to create own partition for 50 different stationID's and create subpartitions for them using timestamp.
Thank you for your replies. SELECT queries are getting slower. To me it seems like they are getting slower linearly with the speed the table is growing. The issue must be with the GROUP-statement.This is an example query. SELECT DATE_FORMAT(logTime,"%Y%m%d%H%i%s") AS 'logTime', SUM(Value) FROM His WHERE stationID=23 AND valueNumberID=4 AND logTime > '2013-01-01 00:00:00' AND logTime < '2013-11-14 00:00:00' GROUP BY DATE_FORMAT( logPVM,"%Y%m") ORDER BY logTime LIMIT 0,120;
Objective of this query/queries like this is to give either AVG,MAX,MIN,SUM in hour,day,week,month intervals. Result of the query is bound tightly to how the results are presented to the user in various ways(graph,excel file) and it would take long time to change if I would change the queries. So I was looking an easy way out with partitioning.
And estimate of 1.2-1.4M rows per month comes to this table.
Thank you