I have a database with four columns corresponding to the geographical coordinates x,y for the start and end position. The columns are:
x0
y0
x1
y1
I have an index for these four columns with the sequence x0, y0, x1, y1.
I have a list of about a hundred combination of geographical pairs. How would I go about querying this data efficiently?
I would like to do something like this as suggested on this SO answer but it only works for Oracle database, not MySQL:
SELECT * FROM my_table WHERE (x0, y0, x1, y1) IN ((4, 3, 5, 6), ... ,(9, 3, 2, 1));
I was thinking it might be possible to do something with the index? What would be the best approach (ie: fastest query)? Thanks for your help!
Notes:
I cannot change the schema of the database
I have about 100'000'000 rows
EDIT:
The code as-is was actually working, however it was extremely slow and did not take advantage of the index (as we have an older version of MySQL v5.6.27).
To make effective use of the index, we could rewrite the IN predicate
example
(x0, y0, x1, y1) IN ((4, 3, 5, 6),(9, 3, 2, 1))
Like this:
( ( x0 = 4 AND y0 = 3 AND x1 = 5 AND y1 = 6 )
OR ( x0 = 9 AND y0 = 3 AND x1 = 2 AND y1 = 1 )
)
EDIT
Newer versions of MySQL optimizer fix the performance problem; generate execution plans that make more effective use of available indexes.
The (a,b) IN ((7,43),(7,44),(8,1)) syntax has been supported in MySQL many versions back, but there were performance problems with it (at least with with non-trivial sets) because of the suboptimal execution plan generated by the optimizer.
But the optimizer has been improved in newer versions of MySQL; the newer optimizer can generate more efficient execution plans.
Note a similar related problem with OR constructs. Here's an example query intended to get the "next page" of 20 rows ordered by columns seq and sub (unique tuple). The last fetched page (seq,sub)=(7,42)
With much older versions of MySQL, this syntax would not be accepted
WHERE (seq,sub) > (7,42)
ORDER BY seq, sub
LIMIT 20
And when MySQL did support the syntax, we would get an execution plan like if we had written
WHERE ( seq > 7 )
OR ( seq = 7 AND sub > 42 )
ORDER BY sub, seq
LIMIT 20
we would get a much more efficient the execution plan if we instead write something subtly different:
WHERE ( seq >= 7 )
AND ( seq > 7 OR sub > 42 )
ORDER BY sub, seq
LIMIT 20
and we would get a much better plan from the MySQL optimizer. we'd expect the optimizer plan to use available UNIQUE INDEX on (sub,seq), and return rows in index order from a range scan operation...
I do not understand your point. The following query is valid MySQL syntax:
SELECT *
FROM my_table
WHERE (x0, y0, x1, y1) IN ((4, 3, 5, 6), ... ,(9, 3, 2, 1));
I would expect MySQL to use the composite index that you have described. But, if it doesn't you could do:
SELECT *
FROM my_table
WHERE x0 = 4 AND y0 = 3 AND x1 = 5 AND y1 = 6
UNION ALL
. . .
SELECT *
FROM my_table
WHERE x0 = 9 AND y0 = 3 AND x1 = 2 AND y1 = 1
The equality comparisons in the WHERE clause will take advantage of an index.
MySQL allows row constructor comparisons like you show, but the optimizer didn't know how to use an index to help performance until MySQL 5.7.
https://dev.mysql.com/doc/refman/5.7/en/row-constructor-optimization.html
You can concatenate the four values into a string and check them like that:
SELECT *
FROM my_table
WHERE CONCAT_WS(',', x0, y0, x1, y1) IN ('4,3,5,6', ..., '9,3,2,1');
The way you are doing is giving correct results in the mysql version on my machine. I am using v5.5.55. Maybe you are using an older one. Please check that.
If you still want to solve this problem in your own version or the above mentioned solution doesn't work then only read the next solution.
I am still not clear about data types and range of all your columns here. So I am assuming that data type is integer and range is between 0 to 9. If this is the case you can easily do this as given below.
select * from s1 where x0+10*x1+100*y1+1000*y2 in (4356,..., 9321);
Related
I have a table with close to a billion records, and need to query it with HAVING. It's very slow (about 15 minutes on decent hardware). How to speed it up?
SELECT ((mean - 3.0E-4)/(stddev/sqrt(N))) as t, ttest.strategyid, mean, stddev, N,
kurtosis, strategies.strategyId
FROM ttest,strategies
WHERE ttest.strategyid=strategies.id AND dataset=3 AND patternclassid="1"
AND exitclassid="1" AND N>= 300 HAVING t>=1.8
I think the problem is t cannot be indexed because it needs to be computed. I cannot add it as a column because the '3.0E-4' will vary per query.
Table:
create table ttest (
strategyid bigint,
patternclassid integer not null,
exitclassid integer not null,
dataset integer not null,
N integer,
mean double,
stddev double,
skewness double,
kurtosis double,
primary key (strategyid, dataset)
);
create index ti3 on ttest (mean);
create index ti4 on ttest (dataset,patternclassid,exitclassid,N);
create table strategies (
id bigint ,
strategyId varchar(500),
primary key(id),
unique key(strategyId)
);
explain select.. :
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
ttest
NULL
range
PRIMARY,ti4
ti4
17
NULL
1910344
100.00
Using index condition; Using MRR
1
SIMPLE
strategies
NULL
eq_ref
PRIMARY
PRIMARY
8
Jellyfish_test.ttest.strategyid
1
100.00
Using where
The query needs to reformulated and an index needs to be added.
Plan A:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
s.strategyId
FROM ttest AS tt
JOIN strategies AS s ON tt.strategyid = s.id
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
and a 'composite' and 'covering' index on test. Replace your ti4 with this (to make it 'covering'):
INDEX(dataset, patternclassid, exitclassid, -- any order
N, strategyid) -- in this order
Plan B:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
( SELECT s.strategyId
FROM strategies AS s
WHERE s.id = tt.strategyid = s.id
) AS strategyId
FROM ttest AS tt
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
With the same index.
Unfortunately the expression for t needs to be repeated. By moving it from HAVING to WHERE, avoids gathering unwanted rows, only to end up throwing them away. Maybe the optimizer will do that automatically. Please provide EXPLAIN SELECT ... to see.
Also, it is unclear whether one of the two formulations will run faster than the other.
To be honest, I've never seen HAVING being used like this; for 20+ years I've assumed it can only be used in GROUP BY situations!
Anyway, IMHO you don't need it here, as Rick James points out, you can put it all in the WHERE.
Rewriting it a bit I end up with:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
Most of that we can indeed foresee a reasonable index. The problem remains with the last calculation:
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
However, before we go to that: how many rows are there if you ignore this 'formula'? 100? 200? If so, indexing as foreseen in Rick James' answer should be sufficient IMHO.
If it's 1000's or many more than the question becomes: how much of those are thrown out by the formula? 1%? 50% 99%? If it's on the low side then again, indexing as proposed by Rick James will do. If however you only need to keep a few you may want to further optimize this and index accordingly.
From your explanation I understand that 3.0E-4 is variable so we can't include it in the index.. so we'll need to extract the parts we can:
If my algebra isn't failing me you can play with the formula like this:
AND ((t.mean - 3.0E-4) / (t.stddev / sqrt(t.N))) >= 1.8
AND ((t.mean - 3.0E-4) ) >= 1.8 * (t.stddev / sqrt(t.N))
AND t.mean - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N)))
AND - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N))) - t.mean
So the query becomes:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND (1.8 * (t.stddev / sqrt(t.N))) - t.mean <= -3.0E-4
I'm not familiar with mysql but glancing the documentation it should be possible to include 'generated columns' in the index. So, we'll do exactly that with (1.8 * (t.stddev / sqrt(t.N)) - t.mean).
Your indexed fields thus become:
dataset, paternclassid, exitclassid, N, (1.8 * (t.stddev / sqrt(t.N))) - t.mean)
Note that the system will have to calculate this value for each and every row on insert (and possibly update) you do on the table. However, once there (and indexed) it should make the query quite a bit faster.
I have the following query:
SELECT
(sign(mr.p1_h2h_win_one_time - mr.p2_h2h_win_one_time)) AS h2h_win_one_time_1,
(abs(mr.p1_h2h_win_one_time - mr.p2_h2h_win_one_time) ^ 2) AS h2h_win_one_time_2
FROM belgarath.match_result AS mr
LIMIT 10
Which returns:
However, when I try to multiply the two fields:
SELECT
(
sign(mr.p1_h2h_win_one_time - mr.p2_h2h_win_one_time)
) *
(
abs(mr.p1_h2h_win_one_time - mr.p2_h2h_win_one_time) ^ 2
) AS h2h_win_one_time_comb
FROM belgarath.match_result AS mr
LIMIT 10
Workbench simply returns OK instead of any rows.
Doing some investigation I can get the first two rows to display if I use LIMIT 2. Looking at the returned values above I guess there must be some issue with multiplying the minus values or zero values from rows 3-10. However, this can be done simply on a calculator so what am I missing?
Maybe you think that the operator ^ is the power operator when in fact it is the Bitwise XOR operator.
MySql has the function pow() for your case:
pow(abs(mr.p1_h2h_win_one_time - mr.p2_h2h_win_one_time), 2)
I have some initial rows in a table. I would like to modify them with a recursive call. In my example code this function is a simple multiplication by two, and I would like to execute it 5 times:
WITH RECURSIVE cte (n,v) AS
(
-- initial values
SELECT 0,2
UNION ALL
SELECT 0,3
UNION ALL
-- generator
SELECT n + 1, v * 2 FROM cte WHERE n < 5
)
SELECT v FROM cte where n = 5;
It works, but my problem is that it only filters out the unneeded values at the end of the query. If I start with much more rows, it can degrade performance, because I have way more rows in the memory as I should. Is it possible to keep the newest values only in each iteration?
SQLFiddle: http://sqlfiddle.com/#!5/9eecb7/6761
In SQLite you can use OFFSET clause
The OFFSET clause, if it is present and has a positive value N,
prevents the first N rows from being added to the recursive table. The
first N rows are still processed by the recursive-select — they just
are not added to the recursive table. Rows are not counted toward
fulfilling the LIMIT until all OFFSET rows have been skipped.
Demo: http://sqlfiddle.com/#!5/9eecb7/6804
WITH RECURSIVE cte (n,v) AS
(
-- initial values
SELECT 0,2
UNION ALL
SELECT 0,3
UNION ALL
-- generator
SELECT n + 1, v * 2 FROM cte WHERE n < 5 LIMIT 1000 OFFSET 10
)
SELECT * FROM cte
| n | v |
|---|----|
| 5 | 64 |
| 5 | 96 |
In the example above the offset is calculated as the number of initial rows in the initial select (2 rows) times the number of iterations (5) => 2*5=10
By the way, in this concrete example the better solution would be calculating simple X * 2^5 (X mltipled by power of 2 to 5) instead of recursion.
In SQLite, the CTE is implemented as a coroutine (as shown by the EXPLAIN output), so only the current row is kept in memory, and performance will not degrade due to memory usage.
MySQL does not allow LIMIT in the recursive SELECT part. If I interpret WL#3634 correctly, the implementation in version 8.0 always completely materializes recursive CTEs.
So in SQLite, you do not need to do anything, and in MySQL, you cannot do anything.
I am quite new about queries and I would like to know if there is an easier solution for the query I am working on.
For instance I want to get the data where x is 5,7,9,11,13,15 and 17.
I have a query like below;
SELECT * FROM abc WHERE x = 5 or x = 7 or x = 9 or x = 11 or x = 13 or x = 15 or x = 17;
Is it okay to use this query or are there any other simpler and efficient solution?
EDIT
Does it affect the perfomance when I use x=[5,7,8,11,13,15,17] vs x=[5,11,7,15,8,17,13]
X is the ID of another category for instance.
This is shorter but performs equally
SELECT * FROM abc WHERE x in (5,7,9,11,13,15,17)
But remember if one entry in the in clause is null then it returns FALSE.
A simple quiz:
Probably many guys know this before,
In my app there is a query in which Im using concat in where condition like this,
v_book_id and v_genre_id are 2 variables in my procedure.
SELECT link_id
FROM link
WHERE concat(book_id,genre_id) = concat(v_book_id,v_genre_id);
Now, I know there is a catch/bug in this, which will occur only twice in your lifetime. Can you tell me what is it?
I found this out yesterday and thought I should make a noise about all others practicing this.
Thanks.
Let's have a look
WHERE concat(book_id,genre_id) = concat(v_book_id,v_genre_id);
as opposed to
WHERE book_id = v_book_id AND genre_id = v_genre_id;
There. The second solution is
faster (optimal index usage)
easier to write (less code)
easier to read (what on earth was the author thinking to concatenate numbers???)
more correct (as Alnitak also stated in the question's comments). check out this sample data:
book_id | genre_id
1 | 12
11 | 2
Now add (or concat) v_book_id = 1 and v_genre_id = 12 and see how you'll get funny results with your concat() query
Note, some databases (including MySQL) allow operations on tuples, which may be what the clever author of the above really intended to do:
WHERE (book_id, genre_id) = (v_book_id, v_genre_id);
A working example of such a tuple predicate:
SELECT * FROM (
SELECT 1 x, 2 y FROM DUAL UNION ALL
SELECT 1 x, 3 y FROM DUAL UNION ALL
SELECT 1 x, 2 y FROM DUAL
) a
WHERE (x, y) = (1, 2)
Note, some databases will need extra parentheses around the right-hand side tuple : ((1, 2))