Can I construct partitions in DolphinDB database with hours? - partitioning

From DolphinDB tutorial, the optimal size of a partition is around 100MB. For my use case, partitions based on dates are too large. I need to construct partitions based on date + hour. Does anyone know the easiest way to do it?

You can use a COMPOSITE partition of DATE and HOUR, but a more straightforward way is to use the data type "DATEHOUR".
https://www.dolphindb.com/help/index.html?datehour.html

Related

In MySql, is it worthwhile creating more than one multi-column indexes on the same set of columns?

I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.
To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument
A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure
First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.
Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?
SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.

Should I use timestamp or separated time columns in large EAV table?

I have an extremely long table with EAV model.
My columns are tipically ID, Timestamp, Value.
Actually I create an index on ID and Timestamp to increase performance in my queries but it seems to be still slow..
What happend if I split the timestamp in separated integer fields and create an index on those fields? Something like this:
Year(Int), Month(Int), Day(Int), Time(TimeStamp), ID, Value.
Does it increase performance?
Today I'm using two kind of db, MySql and PostgreSQL but I have the same doubt for both.
It's definitely wrong direction. The new index won't increase performance. Additionally, some of your queries may make you some troubles. Think e.g. about the condition
where tstamp between '2015-11-22' and '2016-02-03'
and try to write it so it can use the new index(es).
(At least in MySQL...)
Do not split up a datetime.
Do not use partitioning.
Do not use multiple tables (one per month / user / etc).
Do think about alternatives to EAV; that model is inherently bad. (I added a tag; look at other discussions.)
Do use appropriate 'composite' indexes.
Do provide SHOW CREATE TABLE and some SELECTs so we can help you more specifically.
Do use this pattern for date ranges:
WHERE dt >= '2017-02-26'
AND dt < '2017-02-26' + INTERVAL 7 DAY

MySQL Partitioning user generated rows by user

I have two tables: userMessages and userStatistics
I realized that I need to set up a partitioning in order to ensure efficiency. With all the information I could gather, I am suppose to use HASH partitioning.
PARTITION BY HASH(user_id) PARTITIONS 101
Why do I have to define number of partitions? Is it possible to partition by number of users? I want to partition all the messages and statistics by each user. What partitions number should I use?
More Context
Let use my userStatistics for example. This will store a new entry every day to capture daily activity of users impressions and click-throughs etc... I expect this database to get very large over time. I expect it to be very large within a year (>1m rows). I was thinking of just creating separate tables for each user using an index, but was told about partitioning using HASH. What is the best way to approach this case?
Partition by HASH will not provide any efficiency. In fact, there are only a few RANGE partitionings that provide any efficiency. I suspect your use case does not apply. See http://mysql.rjweb.org/doc.php/partitionmaint
Describe your use case further if you would like to debate the issue.

MySQL Partition usage stats

I applied partitioning to my tables today, and would now like to see stats for each partition (how many rows per partition).
Now, I partitioned it by date, so it's quite easy to get it via "SELECT COUNT(*) FROM table WHERE date >= ... AND date <= ..."... However, what happens when you break your tables by i.e. KEY?
I checked MySQL online manual, but they only use solutions similar to one I explained above. There's gotta be a simpler method (or more fancy looking, so to speak).
Cheers
Put EXPLAIN PARTITIONS in front of your select:
EXPLAIN PARTITIONS SELECT ... FROM table ....
For more info see:
http://dev.mysql.com/doc/refman/5.1/en/partitioning-info.html

Best way to handle MySQL date for performance with thousands of users

I am currently part of a team designing a site that will potentially have thousands of users who will be doing a number of date related searches. During the design phase we have been trying to determine which makes more sense for performance optimization.
Should we store the datetime field as a mysql datetime. Or should be break it up into a number of fields (year, month, day, hour, minute, ...)
The question is with a large data set and a potentially large set of users, would we gain performance wise breaking the datetime into multiple fields and saving on relying on mysql date functions? Or is mysql already optimized for this?
Have a look at the MySQL Date & Time Functions documentation, because you can pull specific information from a date using existing functions like YEAR, MONTH, etc. But while these exist, if you have an index on the date column(s), using these functions means those indexes can not be used...
The problem with storing a date as separate components is the work needed to reconstruct them into a date when you want to do range comparisons or date operations.
Ultimately, choose what works best with your application. If there's seldom need for the date to be split out, consider using a VIEW to expose the date components without writing possibly redundant information into your tables.
Use a regular datetime field. You can always switch over to the separated components down the line if performance becomes an issue. Try to avoid premature optimization - in many cases, YAGNI. You may wind up employing both the datetime field and the separated component methodology, since they both have their strengths.
If you know ahead of time some key criteria that all searches will have, MySQL (>= v5.1) table partitioning might help.
For example, if you have a table like this:
create table Books(pubDate dateTime, title varchar(50));
And you know all searches must at least include a year, you could partition it on the date field, along these lines:
create table Books(pubDate dateTime,title varchar(50)
partition by hash(year(pubDate)) partitions 10;
Then, when you run a select against the table, if your where clause includes criteria that limit the partition the results can exist on, the search will only scan that partition, rather than a full table scan. You can see this in action with:
-- scans entire table
explain partitions select * from Books where title='%title%';
versus something like:
-- scans just one partition
explain partitions select * from Books
where year(pubDate)=2010
and title='%title%';
The MySQL documentation on this is quite good, and you can choose from multiple partitioning algorithms.
Even if you opt to break up the date, a table partition on, say, year (int) (assuming searches will always specify a year) could help.