Partition pruning with impala and parquet - partitioning

We have a fact table we wish to partition by month. (This is because of our quantity of data, and wanting to hit partition file sizes that are at least 256mb as per parquet best practice). I guess if data increases we may want to go weekly.
The table will ALWAYS be queried for a specific day, and one day only. (It's a snapshot)
So; I tried a simple test. A basic table, with an integer date key, partitioned with integer date-month key.
I imagined that if i queried for 01/01/2011 then it would use the 01-2011 partition. Unfortunately it doesn't. The explain plan shows it scans both partitions.
I computed stats too - thinking the stats would know the min and max values of the date columns, and would therefore know not to hit one of the partitions, but this didn't change anything.
Is that expected? Maybe my example is too simplistic. Is the explain plan misleading? I can imagine many many use cases where you would filter by a single date field, but be partitioned by year and month, how is this supposed to work?

Related

Making a groupby query faster

This is my data from my table:
I mean i have exactly one million rows so it is just a snippet.
I would like to make this query faster:
Which basically groups the values by time (ev represents year honap represents month and so on.). It has one problem that it takes a lot of time. I tried to apply indexes as you can see here:
but it does absolutely nothing.
Here is my index:
I have tried also to put the perc (which represents minute) due to cardinality but mysql doesnt want to use that. Could you give me any suggestions?
Is the data realistic? If so, why run the query -- it essentially delivers exactly what was in the table.
If, on the other hand, you had several rows per minute, then the GROUP BY makes sense.
The index you have is not worth using. However, the Optimizer seemed to like it. That's a bug.
In that case, I would simply this:
SELECT AVG(konyha1) AS 'avg',
LEFT(time, 16) AS 'time'
FROM onemilliondata
GROUP BY LEFT(time, 16)
A DATE or TIME or DATETIME can be treated as such a datatype or as a VARCHAR. I'm asking for it to be a string.
Even in this case, no index is useful. However, this would make it a little faster:
PRIMARY KEY(time)
and the table would have only 2 columns: time, konyha1.
It is rarely beneficial to break a date and/or time into components and put them into columns.
A million points will probably choke a graphing program. And the screen -- which has a resolution of only a few thousand.
Perhaps you should group by hour? And use LEFT(time, 13)? Performance would probably be slightly faster -- but only because less data is being sent to the client.
If you are collecting this data "forever", consider building and maintaining a "summary table" of the averages for each unit of time. Then the incremental effort is, say, aggregating yesterday's data each morning.
You might find MIN(konyha1) and MAX(konyha1) interesting to keep on an hourly or daily basis. Note that daily or weekly aggregates can be derived from hourly values.

In MySql, is it worthwhile creating more than one multi-column indexes on the same set of columns?

I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.
To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument
A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure
First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.
Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?
SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.

How to partition MySQL table by day?

I'm running MySQL 5.1 and storing data from web logs into a table. There is a datetime column which I want to partition by day. Every night I add new data from the previous day into the table, which is why I want to partition by day. It is usually a few million rows. I want to partition by day because it usually takes 20 seconds for a MySQL query to complete.
In short, I want to partition by each day because users can click on a calendar to get web log information consisting of a day's worth of data. The data spans millions of row (for a single day).
The problem that I've seen with a lot of partitioning articles is that you have to explicitly specify what values you want to partition for? I don't like this way because it means that I'll have to alter the table every night in order to add an extra partition. Is there a built in MySQL feature to do this for me automatically, or will I have to write a bash script/cron job to alter the table for me every night?
For example, if I were to follow the following example:
http://datacharmer.blogspot.com/2008/12/partition-helper-improving-usability.html
In one year, I would have 365 partitions.
Indexes are a must for any table. The details of the index(es) derive from the SELECTs you have; let's see them.
Rules of thumb:
Don't partition a table of less than a million rows
Don't use more than about 50 partitions.
If you are 'purging old data' after some number of days/weeks/months, see my blog for the code on how to do that.
PARTITION BY RANGE() is the only useful partition mechanism.
I tried this once. I ended up creating a cron job to do the partitioning on a regular basis (once a month). Keep in mind that you have a maximum of 1024 partitions per table (http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations.html).
Offhand, I probably wouldn't recommend it. For my needs, I saw this created a significant slowdown in any searches that that required cross-partition results.
Based on your updated explanation, I would first recommend to create the necessary indexes. I would read MySQL Optimization chapter (in specific the section on indexes), to better learn how to ensure you have the necessary indexes. You can also use the slow_query log to help isolate the problematic queries.
Once you have that narrowed down, I can see your need for partitioning change to wanting to partition to limit the size of a particular partition (perhaps for storage space or for quick truncation, etc). At that point, you may decide to partition on a monthly or annual basis.
Partitioning using the date as a partition key will obviously force you into creating an index for the date field. Start with that and see how it goes before you get into the extra efforts of partitioning on a scheduled basis.

...still not getting results trying to optimize mysql innodb table for fast count

i posted this question here a while ago. i tried out the suggestions and came to the conclusion that i must be doing something fundamentally wrong.
What i basically want to do is this:
i have a table containing 83Mio. time/price pairs. As index im using a millisecond accurate unix timestamp, the price ranges between 1.18775 and 1.60400 (decimal with precision 5).
i have a client that needs to get out the price densities for a given time interval, meaning i want to take a specified interval of time and count how many times all the different prices appear in this interval.
How would you guys do this? How would you design/index the table? Right now im building a temporary subtable containing only the data for the given interval and then do the counts on the prices. Is there a better way to do this? My general db settings are already tuned out and pretty performant. Thanks for any hints! I will provide any additional information needed as fast as i can!
Given that you have a large amount of data and its growing v rapidly I'd be inclined to add a second table of:
price (primary key)
time( some block - also part of PK )
count
Do an 'insert on duplicate key update count++' sort of thing. Group the time field by some predetermined interval (depends on the sorts of queries you get.. ms/sec/hour/whatever). This way you:
don't have to mess with temp tables - with a table of this size it will write to disk - slow even with SSD
don't have to touch the initial table every time you want to do your query - might run into locking issues
You will have to avg out your data a bit but the granularity can be predetermined to cause as few issues as possible.

Best way to handle MySQL date for performance with thousands of users

I am currently part of a team designing a site that will potentially have thousands of users who will be doing a number of date related searches. During the design phase we have been trying to determine which makes more sense for performance optimization.
Should we store the datetime field as a mysql datetime. Or should be break it up into a number of fields (year, month, day, hour, minute, ...)
The question is with a large data set and a potentially large set of users, would we gain performance wise breaking the datetime into multiple fields and saving on relying on mysql date functions? Or is mysql already optimized for this?
Have a look at the MySQL Date & Time Functions documentation, because you can pull specific information from a date using existing functions like YEAR, MONTH, etc. But while these exist, if you have an index on the date column(s), using these functions means those indexes can not be used...
The problem with storing a date as separate components is the work needed to reconstruct them into a date when you want to do range comparisons or date operations.
Ultimately, choose what works best with your application. If there's seldom need for the date to be split out, consider using a VIEW to expose the date components without writing possibly redundant information into your tables.
Use a regular datetime field. You can always switch over to the separated components down the line if performance becomes an issue. Try to avoid premature optimization - in many cases, YAGNI. You may wind up employing both the datetime field and the separated component methodology, since they both have their strengths.
If you know ahead of time some key criteria that all searches will have, MySQL (>= v5.1) table partitioning might help.
For example, if you have a table like this:
create table Books(pubDate dateTime, title varchar(50));
And you know all searches must at least include a year, you could partition it on the date field, along these lines:
create table Books(pubDate dateTime,title varchar(50)
partition by hash(year(pubDate)) partitions 10;
Then, when you run a select against the table, if your where clause includes criteria that limit the partition the results can exist on, the search will only scan that partition, rather than a full table scan. You can see this in action with:
-- scans entire table
explain partitions select * from Books where title='%title%';
versus something like:
-- scans just one partition
explain partitions select * from Books
where year(pubDate)=2010
and title='%title%';
The MySQL documentation on this is quite good, and you can choose from multiple partitioning algorithms.
Even if you opt to break up the date, a table partition on, say, year (int) (assuming searches will always specify a year) could help.