Efficiently joining over interval ranges in SQL - mysql

Suppose I have two tables as follows (data taken from this SO post):
Table d1:
x start end
a 1 3
b 5 11
c 19 22
d 30 39
e 7 25
Table d2:
x pos
a 2
a 3
b 3
b 12
c 20
d 52
e 10
The first row in both tables are column headers. I'd like to extract all the rows in d2 where column x matches with d1 and pos1 falls within (including boundary values) d1's start and end columns. That is, I'd like the result:
x pos start end
a 2 1 3
a 3 1 3
c 20 19 22
e 10 7 25
The way I've seen this done so far is:
SELECT * FROM d1 JOIN d2 USING (x) WHERE pos BETWEEN start AND end
But what is not clear to me is if this operation is done as efficient as it can be (i.e., optimised internally). For example, computing the entire join first is not really a scalable approach IMHO (in terms of both speed and memory).
Are there any other efficient query optimisations (ex: using interval trees) or other algorithms that can handle ranges efficiently (again, in terms of both speed and memory) in SQL that I can make use of? It doesn't matter if it's using SQLite, PostgreSQL, mySQL etc..
What is the most efficient way to perform this operation in SQL?
Thank you very much.

Not sure how it all works out internally, but depending on the situation I would advice to play around with a table that 'rolls out' all the values from d1 and then join on that one. This way the query engine can pinpoint the right record 'exactly' instead of having to find a combination of boundaries that match the value being looked for.
e.g.
x value
a 1
a 2
a 3
b 5
b 6
b 7
b 8
b 9
b 10
b 11
c 19 etc..
given an index on the value column (**), this should be quite a bit faster than joining with the BETWEEN start AND end on the original d1 table IMHO.
Off course, each time you make changes to d1, you'll need to adjust the rolled out table too (trigger?). If this happens frequently you'll spend more time updating the rolled out table than you gained in the first place! Additionally, this might take quite a bit of (disk)space quickly if some of the intervals are really big; and also, this assumes we don't need to look for non-whole numbers (e.g. what if we look for the value 3.14 ?)
(You might consider experimenting with a unique one on (value, x) here...)

Related

Mysql put record between two records order

here are records and we want move id #1 between #3 & #4
id title sort
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
method one :
get #3 sort number and plus 1 to it and update #1 sort with that so we have
id title sort
1 a 4
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
and then plus 1 to #4 sort and any records after that
and we have
id title sort
1 a 4
2 b 2
3 c 3
4 d 5
5 e 6
6 f 7
and after sort
id title sort
2 b 2
3 c 3
1 a 4
4 d 5
6 e 6
6 f 7
it works fine but imagine we have 2,000,000 records and all records must be update...
method two
get sum sort of #3 and #4 and divide on 2 => (3+4)/2=3.5
and just put it for #1 sort
id title sort
2 b 2
3 c 3
1 a 3.5
4 d 4
5 e 5
6 f 6
it is work fine too but imagine thousand of this operation make big floats like 3.99999999999 and after a while its horrible
is there any mysql/mariadb trick or method for do this ?
Your "drop it half-way between items" method may be the best.
Let's go with BIGINT UNSIGNED since it gives you 64 bits in 8 bytes. Less good: DOUBLE would give you 53 bits in 8 bytes, and some funny business with exponents. DECIMAL gives you more bits at a cost of more bytes, while not eliminating the need for the following code.
You know which row to put it "after" based on user input?
Discover the row after by using ORDER BY ... ASC LIMIT 1.
Average the two values; check to see if the avg is equal either of them -- if so, you have a bad case.
Digression... 2M rows. Start with 2K, 4K, 6K, etc as the sort values (2M*2K = 4G, the limit of BIGINT UNSIGNED.)
This says you can squeeze 2K items between any adjacent pair. However, in the worst case of repeatedly inserting exactly after the first value, you get only 11 inserts before hitting the wall. 11 ~= log2(2000). That is, the re-sort is may be quick, but up to 1 time in 11, it will be costly.
(Please don't quibble between 2K meaning 2000 vs 2048; it does not matter to the algorithm.)
So, what to do when there is no room to insert a new sort value? Rebuilding the numbers would lock the table (of 2M rows) for "too long", so let's try to avoid that.
How about this:
Grab the 10 rows before and after (2 SELECTs with ORDER BY and LIMIT). Fix those sort values so that they are evenly spread out.
Possibly no issue with hitting the start or end of the table; it would be less than 20 rows. And there is a silent 0 and 4G-1 boundaries.
If the 20 rows are not enough, then broaden the span.
Do all this (including the original, simple, half-way code) in a transaction.
Use FOR UPDATE on all(?) SELECTs so that other threads are blocked.
Check for deadlocks. If encountered, start over completely. (The second try will probably find that the half-way attempt works fine -- because some other thread finished spreading the sort values out.)
Timing:
The half-way case, even with transaction, will probably take a millisecond or so.
The more complex case won't take much longer, in spite of locking and updating 20 rows.
You could probably handle 1K actions per second.

Speeding up an SQL query selection from a large table of integers

I have a table numbers that looks like this
id (int) | start (int u) | end (int u)
1 50 100
2 250 396
3 900 1000
It has about 400k rows and the data in it never changes.
The ranges do not overlap.
I am running a query like this against it:
SELECT id FROM numbers WHERE *somenumber* BETWEEN start AND end LIMIT 1
The query takes about .3s to finish which is an eternity, so I tried to come up with some solutions to make it faster.
The only thing I came up with, was slapping some indexes on the start and end columns, but doing so actually made it SLOWER, the same query now amazingly takes .9s to finish with INDEXES present on the two columns.
So, how can I make this query faster if at all possible?
First try an index on numbers(start).
If that doesn't help (and the between can impede things), then let me assume that the ranges don't overlap. If not, then try this:
SELECT id
FROM numbers
WHERE *somenumber* >= start
ORDER BY start DESC
LIMIT 1;
If the ranges do overlap, then you have a bigger issue. I would recommend creating a new table with non-overlapping ranges.
Creating an multi-index on column start and column end will speed up the process for your use case.
REVISED...
After re-thinking, it can even be simplified further down to
Lets extrapolate on your sample data even in the condition that the Id numbers are not in exact sequential order
id (int) | start (int u) | end (int u)
1 50 100
2 250 396
3 900 1000
4 101 175
5 418 724
6 397 417
7 176 249
Say you are looking for number 723 (now in record #5).
SELECT N.*
FROM numbers N
WHERE N.start <= 723
AND N.End >= 723
AND N.start < 723
The between is the same as the explicit >= and <=, but by also adding that the start MUST be less than the number you want, you are eliminating all those higher from any consideration. it forces the list to the lowest qualifier.

KDB: apply dyadic function across two lists

Consider a function F[x;y] that generates a table. I also have two lists; xList:[x1;x2;x3] and yList:[y1;y2;y3]. What is the best way to do a simple comma join of F[x1;y1],F[x1;y2],F[x1;y3],F[x2;y1],..., thereby producing one large table?
You have asked for the cross product of your argument lists, so the correct answer is
raze F ./: xList cross yList
Depending on what you are doing, you might want to look into having your function operate on the entire list of x and the entire list of y and return a table, rather than on each pair and then return a list of tables which has to get razed. The performance impact can be considerable, for example see below
q)g:{x?y} //your core operation
q)//this takes each pair of x,y, performs an operation and returns a table for each
q)//which must then be flattened with raze
q)fm:{flip `x`y`res!(x;y; enlist g[x;y])}
q)//this takes all x, y at once and returns one table
q)f:{flip `x`y`res!(x;y;g'[x;y])}
q)//let's set a seed to compare answers
q)\S 1
q)\ts do[10000;rm:raze fm'[x;y]]
76 2400j
q)\S 1
q)\ts do[10000;r:f[x;y]]
22 2176j
q)rm~r
1b
Setup our example
q)f:{([] total:enlist x+y; x:enlist x; y:enlist y)}
q)x:1 2 3
q)y:4 5 6
Demonstrate F[x1;y1]
q)f[1;4]
total x y
---------
5 1 4
q)f[2;5]
total x y
---------
7 2 5
Use the multi-valent apply operator together with each' to apply to each pair of arguments.
q)raze .'[f;flip (x;y)]
total x y
---------
5 1 4
7 2 5
9 3 6
Another way to achieve it using each-both :
x: 1 2 3
y: 4 5 6
f:{x+y}
f2:{ a:flip x cross y ; f'[a 0;a 1] }
f2[x;y]
5j, 6j, 7j, 6j, 7j, 8j, 7j, 8j, 9j

how to push data down a row in sql results

I would like help with sql query code to push the consequent data in a specific column down by a row.
For example in a random table like the following,
x column y column
6 6
9 4
89 30
34 15
the results should be "pushed" down a row, meaning
x column y column
6 null or 0 (preferably)
9 6
89 4
34 30
SQL tables have no inherent concept of ordering. Hence, the concept of "next row" does not make sense.
Your example has no column that specifies the order for the rows. There is no definition of next. So, what you want to do cannot be done.
I am not aware of a simple way to do this with the way you are showing the table being formatted. If your perhaps added two consecutively numbered integer fields that provide row number and row number + 1 values, you could join the table to itself and get that information.
After taking a backup of you table:
Make a PHP function that will:
- Load all values of Y into an array
- Set Y = 0 (MYSQL UPDATE)
- load the values back from PHP array to MYSQL

MySQL: Matching inexact values using "ON"

I'm way out of my league here...
I have a mapping table (table1) to assign particular values (value) to a whole number (map_nu). My second table (table2), is a collection of averages (avg) for each user (user_id).
(I couldn't figure out how to properly make a markdown table, please feel free to edit!)
table1: table2:
(value)(Map_nu) (user_id)(avg)
---- -----
1 1 1 1.111
1.045 2 2 1.2
1.09 3 3 1.33333
1.135 4 4 1
1.18 5 5 1.389
1.225 6 6 1.42
1.27 7 7 1.07
1.315 8
1.36 9
1.405 10
The value Map_nu is a special number that each user gets assigned according to their average. I need to find a way to match the averages from table2 to the closest value in table1. I only need to match to the 2 digit past the decimal, so I've added the Truncated function
SELECT table2.user_id, map_nu
FROM `table1`
JOIN table2 ON TRUNCATE(table1.value,2)=TRUNCATE(table2.avg,2)
I still miss the values that don't match the averages exactly. Is there a way to pick the nearest truncated value or even to round to the second decimal? Rounding up/down wont matter as long as its applied to all values the same.
I am trying to have the following result (if rounded up):
(user_id)(Map_nu)
----
1 4
2 6
3 6
4 1
5 10
6 11
7 3
Thanks!
i think you might have to do this in 2 separate queries. there is no 'nearest' operator in sql, so you can either calculate it in your software, or you could use
select map_nu from table1 ORDER BY abs(value - $avg) LIMIT 1
inside a loop. however, that cannot be used as a join function as it requires the ORDER and LIMIT which are not valid as joins.
another way of looking at it is it seems that your map_nu and value are deterministic in relation to each other - value = 1 + ((map_nu - 1) * 0.045) - so maybe you could make use of that fact and calculate an integer based on that equation? assuming that relationship holds true for all values of map_nu.
This is an awkward database design. What is the data representing and what are you trying to solve? There might be a better way.
Maybe do something like...
SELECT a.user_id, b.map_nu, abs(a.avg - b.value)
FROM
table2 a
join table1 b
left join table1 c on abs(a.avg - b.value) > abs(a.avg - c.value)
where c.value is null
order by a.user_id
Doesn't actually produce the same output as the one you were expecting for (doesn't do any rounding). Though you should be able to tweak it from there. Above query will produce the output below (w/ data you've provided):
user_id map_nu abs(a.avg - b.value)
------- ------ --------------------
1 3 0.0209999999999999
2 5 0.02
3 8 0.01833
4 1 0
5 10 0.016
6 10 0.0149999999999999
7 3 0.02
Beware though if you're dealing with large tables. Evaluate the explain of the above query if it'll be practical to run it within MySQL or if better to be done outside it.
Note 2: Will produce duplicate rows if there are avg values that are equi-distant to value values within table1 (Ex. if value for map_nu's 11 and 12 are 2 and 3 and someone get's an avg of 2.5). Your question doesn't really specify what to do for that so you might want to take that into account.
Its taking a little extra work, but I figure the easiest way to get my results will be to map all values to the second decimal place in table1:
1 1
1.01 1
1.02 1
1.03 1
1.04 1
1.05 2
1.06 2
1.07 2
1.08 2
1.09 3
1.1 3
1.11 3
1.12 3
1.13 3
1.14 4
...
Thanks for the suggestions! Sorry I couldn't present the question more clear.