SQL range partitioning using arithmetical elements (addition) on usigned ints - will it optimise WHERE queries? (MySQL, PostgreSQL) - mysql

I read about range partitioning in MySQL (and PostgreSQL) here. I am also aware, that if I partition my table, some WHERE queries will be optimised.
For example partitioning by used_at date:
PARTITION BY RANGE (used_at) (
PARTITION p0 VALUES LESS THAN ('2012-01-01'),
PARTITION p1 VALUES LESS THAN ('2013-01-01'),
PARTITION p2 VALUES LESS THAN ('2014-01-01'),
);
Will make querting things like:
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
faster for example as it will only use a 1/3 size subtable practically for the search.
Well the question is if I have two tables:
user (3 000 000 records):
user_id UNSIGNED INT ...
...
messages (50 000 000 records)
sender UNSIGNED INT (refers to user)
recipient UNSIGNED INT (refers to user)
We get threads like:
WHERE ... (sender = 1234567 OR recipient = 1234567)
...
GROUP BY (sender + recipient)
Well, my question is:
a) Am I able to partition by
PARTITION BY RANGE (sender + recipient) (
PARTITION p0 VALUES LESS THAN (1000000),
PARTITION p1 VALUES LESS THAN (2000000),
...
PARTITION p5 VALUES LESS THAN (6000000),
);
?
b) If yes, will it optimise WHERE conditions like
WHERE ... (sender = 1234567 OR recipient = 1234567)
in case of unsigned ints?
The question is basically about MySQL but I am also curious about PostgreSQL and Oracle for the future.

MySQL...
WHERE ... (sender = 1234567 OR recipient = 1234567)
Does not optimize well. it would be better to do
( SELECT ... WHERE sender = 1234567 )
UNION DISTINCT
( SELECT ... WHERE recipient = 1234567 )
and have separate indexes on sender and recipient (or at least starting with each).
PARTITION can handle a very few expressions, not including (x+y).
GROUP BY (sender + recipient)
cannot be optimized by any form of INDEX or PARTITION. It will involve a full scan and probably a filesort.
If you meant GROUP BY sender, recipient, that's another matter.
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
Does not benefit from partitioning, at least no compared to having some INDEX starting with used_at.
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
AND x = 1
Would like INDEX(x, used_at)
WHERE used_at >= '2013-05-01' AND used_at < '2013-09-01'
AND x > 1
Is problematical -- two ranges. In this example BY RANGE partitioning on either x or used_at would be beneficial. This is because: first "partition pruning" would first pick the desired partition(s), then ordinary indexing (if any) would take over to finish the task. (It is impossible to say what the optimal INDEX would be without further details on the table and the data distribution.)
I did not get your point on users plus messages.

Related

MySQL Partition By DATEDIFF

I have a table with the ReferenceDate field. I intend to partition it using this field as follows:
partition_0: Values more than 1 year old;
partition_1: Values older than 6 months;
partition_2: Values older than 3 months;
partition_3: Values for the last 3 months;
For this I tried the following script to change the table:
ALTER TABLE `MyTable`
PARTITION BY RANGE (DATEDIFF(NOW(), `ReferenceDate`))
(
PARTITION p0_historic_data VALUES LESS THAN (90),
PARTITION p1_intermediary_data VALUES LESS THAN (180),
PARTITION p2_intermediary_data VALUES LESS THAN (365),
PARTITION p3_current_data VALUES LESS THAN MAXVALUE
);
However, I believe that I cannot use the Now () function, in the partitioning clause, something I was able to do was use TO_DATE, but it doesn't give me the return I need, with DIFF I have the value of the difference of the current date and ReferenceDate , TO_DATE returns the value in days from year 0 to the current date.
I would like to know if there is really no way to use DIFF, or if there is any alternative in that sense.
A PARTITIONed table is one where some of the rows are permanently put in one 'sub-table' or another, based on the instructions in PARTITION BY ....
So, it is flatly not possible. To implement such, MySQL would have to move rows from one partition to another, even when you are not touching the table.
Even if it were possible, it might not provide any performance improvement. After all, you can have something like this:
WHERE ReferenceDate >= NOW() - INTERVAL 180 DAY
AND ReferenceDate < NOW() - INTERVAL 90 DAY
Then, if you also have
AND CustomerId = 123
then this index would be excellent for finding the desired rows:
INDEX(CustomerId, ReferenceDate)
That does not need PARTITIONing.

Multiple queries in phpmyadmin - Distance using coordinates, Slope, Intercept, Angle, and few more

I having around 500 excel sheets in .csv format with data captured for my experiment having following columns in place.
Now I need to calculate the following parameters using this data. I have done these in excel, however doing this repeatedly for each excel so many times is difficult, so I want to write an SQL query in PhpmyAdmin will help some time.
Last charecter typed - need to capture last charecter from the column 'CharSq'
Slope (in column J) =(B3-B2)/(A3-A2)
Intercept (in column K) =B2-(A2*(J3))
Angle (in degrees) =MOD(DEGREES(ATAN2((A3-A2),(B3-B2))), 360) -
Index of Difficulty =LOG(((E1/7.1)+1),2)
Speed Value length (if speed value length >3, then mark as 1 or else 0) = =IF(LEN(D3) >= 3, "1","0")
Wrong Sequence (if I3=I2,then mark search time, else actual time) =IF(I3=I2,"Search Time","Actual Time")
Mark charecter into (1,2,3) = =IF(I2="A",1, IF(I2="B",2, IF(I2="C",3, 0)))
I have started with this SQL query
SELECT id, type, charSq, substr(charSq,-1,1) AS TypedChar, xCoordinate, yCoordinate, angle, distance, timestamp, speed FROM table 1 WHERE 1
Need help for the rest of the parameters. Thanks.
Note - I am going to run this in phpMyAdmin SQL
create table test.Table10 select mm.myid,mm.id,mm.type1 as GESTURE,MM.CHARSQ,MM.TYPE2 as TYPEDCHAR,MM.MYCHAR,MM.XCOR,MM.YCOR,MM.SLOPE,l4-(l2*(SLOPE)) as Intercept,
if (ANGLE1<0, (ANGLE1+360) , ANGLE1 ) as ANGLE0,MM.DISTANCE,MM.DW,MM.INDDIFF,MM.TIME1,MM.SPEED,MM.SPDFILT,MM.TIMETYPE from (select c11.*,((YCOR-l4)/(XCOR-l2)) as SLOPE,MOD(DEGREES (ATAN2((YCOR-l4),(XCOR-l2))), 360) as ANGLE1,(YCOR-l4)/(XCOR-l2) ATT,LOG2(((DW)+1)) as INDDIFF,
if(TYPE2=(LAG(TYPE2) OVER (
PARTITION BY MYID
ORDER BY ID)),"Search Time","Actual Time") as TIMETYPE,case when type2="A" then "1"
when type2="B" then 2
when type2="C" then 3
else 0
end as MYCHAR from (SELECT b.*,LEAD(XCOR) OVER (
PARTITION BY charsq) l1,LAG(XCOR) OVER (
PARTITION BY MYID
ORDER BY ID) l2,LEAD(YCOR) OVER (
PARTITION BY MYID) l3,LAG(YCOR) OVER (
PARTITION BY MYID
ORDER BY ID) l4,distance/7.1 as DW,IF(length(speed) >= 3, "1","0") as SPDFILT,RIGHT(charSq,1) as TYPE2 FROM test.table2 b) c11) mm

Optimizing MySQL query with a composite index

I have a table which currently has about 80 million rows, created as follows:
create table records
(
id int auto_increment primary key,
created int not null,
status int default '0' not null
)
collate = utf8_unicode_ci;
create index created_and_status_idx
on records (created, status);
The created column contains unix timestamps and status can be an integer between -10 and 10. The records are evenly distributed regarding the created date, and around half of them are of status 0 or -10.
I have a cron that selects records that are between 32 and 8 days old, processes them and then deletes them, for certain statuses. The query is as follows:
SELECT
records.id
FROM records
WHERE
(records.status = 0 OR records.status = -10)
AND records.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
The query was fast when the records were at the beginning of the creation interval, but now that the cleanup reaches the records at the end of interval it takes about 10 seconds to run. Explaining the query says it uses the index, but it parses about 40 million records.
My question is if there is anything I can do to improve the performance of the query, and if so, how exactly.
Thank you.
I think union all is your best approach:
(SELECT r.id
FROM records r
WHERE r.status = 0 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
) UNION ALL
(SELECT r.id
FROM records r
WHERE r.status = -10 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
)
LIMIT 500;
This can use an index on records(status, created, id).
Note: use union if records.id could have duplicates.
You are also using LIMIT with no ORDER BY. That is generally discouraged.
Your index is in the wrong order. You should put the IN column (status) first (you phrased it as an OR), and put the 'range' column (created) last:
INDEX(status, created)
(Don't give me any guff about "cardinality"; we are not looking at individual columns.)
Are there really only 3 columns in the table? Do you need id? If not, get rid of it and change to
PRIMARY KEY(status, created)
Other techniques for walking through large tables efficiently.

Index is not being considered in mysql

I have a problem in deciding on index. I have a table in mySQL say X with atlas 70 million records. I need to query few fields based on filter like Year = X and Quarter = X and ( user = X or manager = X or ..)
I have an index on year and quarter which is not considered. So if it is considered, then less than 10% of the data is used.
I have index on year, quarter and all the user fields. Even then the index is not considered.
What am I doing wrong?

mysql query execution consuming time

I have seen several question in SO and based in that I improved my sql query also.
but it sometime take 12 second or it sometime takes 3 seconds to execute. so minimum time we can its 3 seconds. query is like this way
SELECT ANALYSIS.DEPARTMENT_ID
,SCORE.ID
,SCORE.KPI_ SCORE.R_SCORE
,SCORE.FACTOR_SCORE
,SCORE.FACTOR_SCORE
,SCORE.FACTOR_SCORE
,SCORE.CREATED_DATE
,SCORE.UPDATED_DATE
FROM SCORE_INDICATOR SCORE
,AG_SENTIMENT ANALYSIS
WHERE SCORE.TAG_ID = ANALYSIS.ID
AND ANALYSIS.ORGANIZATION_ID = 1
AND ANALYSIS.DEPARTMENT_ID IN (1,2,3,4,5)
AND DATE (ANALYSIS.REVIEW_DATE) BETWEEN DATE ('2016-05-02') AND DATE ('2017-05-02')
ORDER BY ANALYSIS.DEPARTMENT_ID
now one table SCORE_INDIACATOR has 19345116 and later has 19057025 rows total. and I added index on ORGANIZATION_ID and department_id and another as combination of ORGANIZATION_ID and department_id . is there any other way to improve it or is it maximum I can achieve with this amount of data?
Here is checklist:
1) Make sure logs table (ANALYSIS) uses MyISAM engine (it's fast for OLAP queries).
2) Make sure that You've indexed ANALYSIS.REVIEW_DATE field.
3) Make sure that ANALYSIS.REVIEW_DATE is type of DATE (not CHAR, VARCHAR)
4) Change query (rearrange query plan):
SELECT
ANALYSIS.DEPARTMENT_ID
,SCORE.ID
,SCORE.KPI_ SCORE.R_SCORE
,SCORE.FACTOR_SCORE
,SCORE.FACTOR_SCORE
,SCORE.FACTOR_SCORE
,SCORE.CREATED_DATE
,SCORE.UPDATED_DATE
FROM SCORE_INDICATOR SCORE
,AG_SENTIMENT ANALYSIS
WHERE
SCORE.TAG_ID = ANALYSIS.ID
AND
ANALYSIS.REVIEW_DATE >= '2016-05-02' AND ANALYSIS.REVIEW_DATE < '2016-05-03'
AND
ANALYSIS.ORGANIZATION_ID = 1
AND
ANALYSIS.DEPARTMENT_ID IN (1,2,3,4,5)
ORDER BY ANALYSIS.DEPARTMENT_ID;
I have changed the order and style to JOIN syntax. The Score table seems to be the child to the primary criteria of the Analysis table. All your criteria is based on qualifying Analysis records. Now, the indexing. By doing a DATE() function call on a column does not help the optimizer. So, to get all possible date/time components, I have changed from between to >= the first date and LESS THAN one day beyond the end. In your example DATE( '2017-05-02' ) is the same as LESS than '2017-05-03' which will include 2017-05-02 up to 23:59:59 and the date can be applied better.
Now for the index. DO a compound index based on fields for join and order by might help
AG_Segment table... index ON(Organization_ID, Department_ID, Review_Date, ID)
SELECT
ANALYSIS.DEPARTMENT_ID,
SCORE.ID,
SCORE.KPI_ SCORE.R_SCORE,
SCORE.FACTOR_SCORE,
SCORE.FACTOR_SCORE,
SCORE.FACTOR_SCORE,
SCORE.CREATED_DATE,
SCORE.UPDATED_DATE
FROM
AG_SENTIMENT ANALYSIS
JOIN SCORE_INDICATOR SCORE
ON ANALYSIS.ID = SCORE.TAG_ID
where
ANALYSIS.ORGANIZATION_ID = 1
AND ANALYSIS.DEPARTMENT_ID IN (1,2,3,4,5)
AND ANALYSIS.REVIEW_DATE >= '2016-05-02'
AND ANALYSIS.REVIEW_DATE < '2017-05-03'
ORDER BY
ANALYSIS.DEPARTMENT_ID