Apache Drill Kudu query doesn't support range + hash multilevel partition - apache-drill

Drill Kudu query doesn't support range + hash multilevel partition.
Kudu table :
CREATE TABLE test1 (
id int ,
name string,
value string,
prmary key(id, name)
),
PARTITION BY HASH (name) PARTITIONS 8,
PARTITION BY RANGE (id) (
PARTITION 0 <= VALUES < 10000,
PARTITION 10000 <= VALUES < 20000,
PARTITION 20000 <= VALUES < 30000,
PARTITION 30000 <= VALUES < 40000
);
and then insert 20002 rows into test1, but query not support.
query sql : select count(1) kudu.table_name, result : **No result found.**

Related

MySQL optimize query with Date range

I have the following table structure.
id (INT) index
date (TIMESTAMP) index
companyId (INT) index
This is the problem I am facing
companyId 111: hasta a total of 100000 rows in a 1 year time period.
companyId 222: has a total of 8000 rows in a 1 year time period.
If companyId 111 has 100 rows between '2020-09-01 00:00:00' AND '2020-09-06 23:59:59' and companyId 222 has 2000 rows in the same date range, companyId 111 is much slower than 222 even if it has less rows in the selected date range.
Shouldn't MySQL ignore all the rows outside the date range so the query becomes faster?
This is a query example I am using:
SELECT columns FROM table WHERE date BETWEEN '2020-09-01 00:00:00' AND '2020-09-06 23:59:59' AND companyId = 111;
Thank you
I would suggest a composite index here:
CREATE INDEX idx ON yourTable (companyId, date);
The problem with your premise is that, while you have an index on each column, you don't have any indices completely covering the WHERE clause of your example query. As a result, MySQL might even choose to not use any of your indices. You can also try reversing the order of the index above to compare performance:
CREATE INDEX idx ON yourTable (date, companyId);

Select one piece of data from every day at a specific hour MySQL

My database has data imputed every 1 minute and is stored in the format 2020-04-05 16:20:04 under a column called timestamp.
I need a MySQL query to select data from every day at a specific hour (the second does not matter), for for example I want to get the data from 16:00 of every day from the past 30 days.
It currently, just grabs the data from the past 30 days and then the PHP application sorts it, however, this is causing very slow loading time, hence wanting to only select the wanted data from the database.
Example of data
Please try the following sql:
select
d.timestamp, hour(d.timestamp)
from
demo1 d
where
DATEDIFF(NOW(), d.timestamp) < 30 and hour(d.timestamp) = 16;
The create sql is as following:
CREATE TABLE `demo1` (
`id` int(11) not null auto_increment primary key,
`serverid` int(11) not null,
`timestamp` datetime not null,
KEY `idx_timestamp` (`timestamp`)
) engine = InnoDB;
insert into `demo1` (serverid, timestamp)
VALUES (1, "2020-07-05 16:20:04"),
(2, "2020-07-06 17:20:04"),
(3, "2020-07-07 16:40:04"),
(4, "2020-07-08 08:20:04"),
(5, "2020-07-05 15:20:04"),
(5, "2020-07-05 16:59:04"),
(5, "2020-06-04 16:59:04");
Zhiyong's response will work, but wont perform well. You need to figure out a way to get the query to use indexes.
You can add a simple index on timestamp and run the query this way:
SELECT
d.timestamp, d.*
FROM demo1 d
WHERE 1
AND d.timestamp > CURDATE() - INTERVAL 30 DAY
AND hour(d.timestamp) = 16;
In MySQL 5.7 and up, you can created a generated column (also called calculated column) top store the hour of the timestamp in a separate column. You can then index this column, perhaps as a composite index of hour + timestamp, so that the query above will perform really quickly.
ALTER TABLE demo1
ADD COLUMN hour1 tinyint GENERATED ALWAYS AS (HOUR(timestamp)) STORED,
ADD KEY (hour1, timestamp);
The result query would be:
SELECT
d.timestamp, d.*
FROM demo1 d
WHERE 1
AND d.timestamp > CURDATE() - INTERVAL 30 DAY
AND hour1 = 16;
More info on that here:
https://dev.mysql.com/doc/refman/5.7/en/create-table-generated-columns.html
https://dev.mysql.com/doc/refman/5.7/en/generated-column-index-optimizations.html

MySQL - Move data between partitions aka re-partition

I have a mysql table whose partitions look as below
p2015h1 - Contains data where date < 2015-07-01 (Has data from 2016-06-01. Hence only month worth of data)
p2015h2 - Contains data where date < 2016-01-01
p2016h1 - Contains data where date < 2016-07-01
p2016h2 - Contains data where date < 2017-01-01
I'd like the new partitions to be quarterly based as below -
p0 - Contains data where date < 2015-10-01
p1 - Contains data where date < 2016-01-01
p2 - Contains data where date < 2016-04-01
p3 - Contains data where date < 2016-07-01
I started by reorganizing the first partition & executed the below command. All went well.
alter table `table1` reorganize partition `p2015half1` into (partition `p0` values less than ('2015-10-01'));
Now as the existing partition p2015h2 has data that includes data upto 2015-10-01, how could I move this part into the partition p0 ? I would need to do the same thing with the other partitions too as I continue building the new ones.
I did try to remove partitioning on the table fully, but, the table is billions of rows in size & hence the operation will take days. Post this I will have to rebuild the partitions which will take days again. Hence, I decided to take the approach of splitting partitions.
I'm stuck at this point in time. I'd fully appreciate any guidance here please.
mysql> alter table `table1` reorganize partition p0,p2015half2 into (partition p00 values less than ('2015-07-01'), partition p1 values less than ('2016-01-01'));
mysql> alter table `table1` reorganize partition p00 into (partition p0 values less than ('2015-07-01'));
mysql> alter table `table1` reorganize partition p2016half1,p2016half2 into (partition p2 values less than ('2016-04-01'), partition p3 values less than ('2016-07-01'),partition p4 values less than maxvalue);

Join two tables with a lesser than or equals condition in join

Below two tables I have with sample data. Table A contains dollar rate (into Indian rupee) as per year, and Table B contains amount per year. I want to convert
dollar into rupee as per year.
Table A
Rate Year
47 2001
49 2003
55 2004
Table B
Amt Year
25$ 2001
34$ 2002
Question: for first record (year 2001) we have entry in both tables so we can do this easily by using below query
sel A.Rate * B.Amt
from A,
B
where B.year = A.year
But for second record (i.e. year 2002) we do not have entry in table A (which is rate table), so for these kind cases I want to use rate value from previous year (i.e.: 47 rupee from year 2001.)
Here is the solution :
select A.rate*B.amt
from A,B
where B.Year = (select max(year) from B where B.year <= A.year);
Oracle : use the LEAD analytic function so you can work out the validity period of each rate.
documentation for LEAD
This is my code :
SELECT
trx.*
,rates.rate_start_date
,rates.rate_end_date
,rates.rate
,trx.amount * rates.rate rup_amount
FROM
xxcjp_forex_trx trx
--this inline view works out the validity period of each rate by ordering all
--the rates and working out the start date of the next row. It uses analytic
--function LEAD
,(SELECT
xfr.rate_date rate_start_date
,xfr.rate
,xfr.currency
,(LEAD(xfr.rate_date) OVER (ORDER BY xfr.currency, xfr.rate_date))-1 rate_end_date
FROM
xxcjp_forex_rates xfr
) rates
WHERE 1=1
AND trx.trx_date BETWEEN rates.rate_start_date AND rates.rate_end_date
AND rates.currency = 'RUP'
ORDER BY
trx.trx_date
;
Based on this data :
CREATE TABLE XXCJP_FOREX_RATES
(rate_date DATE
,currency VARCHAR2(20)
,rate NUMBER
)
;
CREATE TABLE XXCJP_FOREX_TRX
(trx_date DATE
,currency VARCHAR2(20)
,amount NUMBER
)
;
INSERT INTO XXCJP_FOREX_RATES VALUES (TO_DATE('01/03/2016','DD/MM/YYYY'),'RUP',47) ;
INSERT INTO XXCJP_FOREX_RATES VALUES (TO_DATE('03/03/2016','DD/MM/YYYY'),'RUP',49) ;
INSERT INTO XXCJP_FOREX_RATES VALUES (TO_DATE('10/03/2016','DD/MM/YYYY'),'RUP',55) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('01/03/2016','DD/MM/YYYY'),'USD',10) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('02/03/2016','DD/MM/YYYY'),'USD',20) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('03/03/2016','DD/MM/YYYY'),'USD',30) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('04/03/2016','DD/MM/YYYY'),'USD',40) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('05/03/2016','DD/MM/YYYY'),'USD',50) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('06/03/2016','DD/MM/YYYY'),'USD',60) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('07/03/2016','DD/MM/YYYY'),'USD',70) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('08/03/2016','DD/MM/YYYY'),'USD',80) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('09/03/2016','DD/MM/YYYY'),'USD',90) ;
INSERT INTO XXCJP_FOREX_TRX VALUES (TO_DATE('10/03/2016','DD/MM/YYYY'),'USD',100) ;

SQL range partitioning using arithmetical elements (addition) on usigned ints - will it optimise WHERE queries? (MySQL, PostgreSQL)

I read about range partitioning in MySQL (and PostgreSQL) here. I am also aware, that if I partition my table, some WHERE queries will be optimised.
For example partitioning by used_at date:
PARTITION BY RANGE (used_at) (
PARTITION p0 VALUES LESS THAN ('2012-01-01'),
PARTITION p1 VALUES LESS THAN ('2013-01-01'),
PARTITION p2 VALUES LESS THAN ('2014-01-01'),
);
Will make querting things like:
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
faster for example as it will only use a 1/3 size subtable practically for the search.
Well the question is if I have two tables:
user (3 000 000 records):
user_id UNSIGNED INT ...
...
messages (50 000 000 records)
sender UNSIGNED INT (refers to user)
recipient UNSIGNED INT (refers to user)
We get threads like:
WHERE ... (sender = 1234567 OR recipient = 1234567)
...
GROUP BY (sender + recipient)
Well, my question is:
a) Am I able to partition by
PARTITION BY RANGE (sender + recipient) (
PARTITION p0 VALUES LESS THAN (1000000),
PARTITION p1 VALUES LESS THAN (2000000),
...
PARTITION p5 VALUES LESS THAN (6000000),
);
?
b) If yes, will it optimise WHERE conditions like
WHERE ... (sender = 1234567 OR recipient = 1234567)
in case of unsigned ints?
The question is basically about MySQL but I am also curious about PostgreSQL and Oracle for the future.
MySQL...
WHERE ... (sender = 1234567 OR recipient = 1234567)
Does not optimize well. it would be better to do
( SELECT ... WHERE sender = 1234567 )
UNION DISTINCT
( SELECT ... WHERE recipient = 1234567 )
and have separate indexes on sender and recipient (or at least starting with each).
PARTITION can handle a very few expressions, not including (x+y).
GROUP BY (sender + recipient)
cannot be optimized by any form of INDEX or PARTITION. It will involve a full scan and probably a filesort.
If you meant GROUP BY sender, recipient, that's another matter.
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
Does not benefit from partitioning, at least no compared to having some INDEX starting with used_at.
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
AND x = 1
Would like INDEX(x, used_at)
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
AND x > 1
Is problematical -- two ranges. In this example BY RANGE partitioning on either x or used_at would be beneficial. This is because: first "partition pruning" would first pick the desired partition(s), then ordinary indexing (if any) would take over to finish the task. (It is impossible to say what the optimal INDEX would be without further details on the table and the data distribution.)
I did not get your point on users plus messages.