Speeding up an SQL query selection from a large table of integers

Speeding up an SQL query selection from a large table of integers - mysql

I have a table numbers that looks like this
id (int) | start (int u) | end (int u)
1 50 100
2 250 396
3 900 1000
It has about 400k rows and the data in it never changes.
The ranges do not overlap.
I am running a query like this against it:
SELECT id FROM numbers WHERE *somenumber* BETWEEN start AND end LIMIT 1
The query takes about .3s to finish which is an eternity, so I tried to come up with some solutions to make it faster.
The only thing I came up with, was slapping some indexes on the start and end columns, but doing so actually made it SLOWER, the same query now amazingly takes .9s to finish with INDEXES present on the two columns.
So, how can I make this query faster if at all possible?

First try an index on numbers(start).
If that doesn't help (and the between can impede things), then let me assume that the ranges don't overlap. If not, then try this:
SELECT id
FROM numbers
WHERE *somenumber* >= start
ORDER BY start DESC
LIMIT 1;
If the ranges do overlap, then you have a bigger issue. I would recommend creating a new table with non-overlapping ranges.

Creating an multi-index on column start and column end will speed up the process for your use case.

REVISED...
After re-thinking, it can even be simplified further down to
Lets extrapolate on your sample data even in the condition that the Id numbers are not in exact sequential order
id (int) | start (int u) | end (int u)
1 50 100
2 250 396
3 900 1000
4 101 175
5 418 724
6 397 417
7 176 249
Say you are looking for number 723 (now in record #5).
SELECT N.*
FROM numbers N
WHERE N.start <= 723
AND N.End >= 723
AND N.start < 723
The between is the same as the explicit >= and <=, but by also adding that the start MUST be less than the number you want, you are eliminating all those higher from any consideration. it forces the list to the lowest qualifier.

Related

Mysql put record between two records order

here are records and we want move id #1 between #3 & #4
id title sort
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
method one :
get #3 sort number and plus 1 to it and update #1 sort with that so we have
id title sort
1 a 4
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
and then plus 1 to #4 sort and any records after that
and we have
id title sort
1 a 4
2 b 2
3 c 3
4 d 5
5 e 6
6 f 7
and after sort
id title sort
2 b 2
3 c 3
1 a 4
4 d 5
6 e 6
6 f 7
it works fine but imagine we have 2,000,000 records and all records must be update...
method two
get sum sort of #3 and #4 and divide on 2 => (3+4)/2=3.5
and just put it for #1 sort
id title sort
2 b 2
3 c 3
1 a 3.5
4 d 4
5 e 5
6 f 6
it is work fine too but imagine thousand of this operation make big floats like 3.99999999999 and after a while its horrible
is there any mysql/mariadb trick or method for do this ?

Your "drop it half-way between items" method may be the best.
Let's go with BIGINT UNSIGNED since it gives you 64 bits in 8 bytes. Less good: DOUBLE would give you 53 bits in 8 bytes, and some funny business with exponents. DECIMAL gives you more bits at a cost of more bytes, while not eliminating the need for the following code.
You know which row to put it "after" based on user input?
Discover the row after by using ORDER BY ... ASC LIMIT 1.
Average the two values; check to see if the avg is equal either of them -- if so, you have a bad case.
Digression... 2M rows. Start with 2K, 4K, 6K, etc as the sort values (2M*2K = 4G, the limit of BIGINT UNSIGNED.)
This says you can squeeze 2K items between any adjacent pair. However, in the worst case of repeatedly inserting exactly after the first value, you get only 11 inserts before hitting the wall. 11 ~= log2(2000). That is, the re-sort is may be quick, but up to 1 time in 11, it will be costly.
(Please don't quibble between 2K meaning 2000 vs 2048; it does not matter to the algorithm.)
So, what to do when there is no room to insert a new sort value? Rebuilding the numbers would lock the table (of 2M rows) for "too long", so let's try to avoid that.
How about this:
Grab the 10 rows before and after (2 SELECTs with ORDER BY and LIMIT). Fix those sort values so that they are evenly spread out.
Possibly no issue with hitting the start or end of the table; it would be less than 20 rows. And there is a silent 0 and 4G-1 boundaries.
If the 20 rows are not enough, then broaden the span.
Do all this (including the original, simple, half-way code) in a transaction.
Use FOR UPDATE on all(?) SELECTs so that other threads are blocked.
Check for deadlocks. If encountered, start over completely. (The second try will probably find that the half-way attempt works fine -- because some other thread finished spreading the sort values out.)
Timing:
The half-way case, even with transaction, will probably take a millisecond or so.
The more complex case won't take much longer, in spite of locking and updating 20 rows.
You could probably handle 1K actions per second.

Is it faster to search by column integer or column string in mysql?

I have a table "transactions" with million records
id trx secret_string (varchar(50)) secret_id (int(2.))
1 80 52987624f7cb03c61d403b7c68502fb0 1
2 28 52987624f7cb03c61d403b7c68502fb0 1
3 55 8502fb052987624f61d403b7c67cb03c 2
4 61 52987624f7cb03c61d403b7c68502fb0 1
5 39 8502fb052987624f61d403b7c67cb03c 2
..
999997 27 8502fb052987624f61d403b7c67cb03c 2
999998 94 8502fb052987624f61d403b7c67cb03c 2
999999 40 52987624f7cb03c61d403b7c68502fb0 1
1000000 35 8502fb052987624f61d403b7c67cb03c 2
As you can notice, secret_string and secret_id will always match.
Let's say, I need to select records where secret_string = "52987624f7cb03c61d403b7c68502fb0".
Is it faster to do:
SELECT id FROM transactions WHERE secret_id = 1
Than:
SELECT id FROM transactions WHERE secret_string = "52987624f7cb03c61d403b7c68502fb0"
Or it does not matter? What about for other operations such as SUM(trx), COUNT(trx), AVG(trx), etc?
Column secret_id currently does not exist, but if it is faster to search records by it, I am planning to create it upon row insertions.
Thank you
I hope I make sense.

Int comparisons are faster than varchar comparisons, for the simple fact that ints take up much less space than varchars.
This holds true both for unindexed and indexed access. The fastest way to go is an indexed int column.
There is another reason to use an int, and that is to normalise the database. Instead of having the text '52987624f7cb03c61d403b7c68502fb0' stored thousands of times in the table,you should store it's id and have the secret string stored once in a separate table. It's the same deal for other operations such as SUM COUNT AVG.

As the others told you: selecting int is definitly faster than strings. However if you need to select by secret_string, all given strings look like a hex string, that said you can consider to cast those strings to an int (or big int) using hex('52987624f7cb03c61d403b7c68502fb0') and store those int values instead of strings

Calculate max value of list of numbers with a maximum combination of "x"

ok, i'm not sure if i can explain this right.
Lets say i have a table with three columns (id, price, maxcombo)
maybe there's like 5 rows in this table with random numbers for price. 2. id is just incremental unique key)
maxcombo specified if that price can be in a combination of up to whatever number it is.
If x was 3, i would need to find the combination that has the maximum value of the sum 1-3 columns.
So say the table had:
1 - 100 - 1
2 - 50 - 3
3 - 10 - 3
4 - 15 - 3
5 - 20 - 2
the correct answer with be just row id 1.
since 100 alone (and can only be alone based on the maxcombo number)
is greater than say 50 + 20 + 15 or 20 + 15 or 10 + 20 etc.
Does that make sense?
I mean i could just calculate all the diff combinations and see which has the largest value, but i would imagine that would take a very long time if the table was larger than 5 rows.
Was wondering any math genius or super dev out there had some advice or creative way to figure this out in a more efficient manner.
Thanks ahead of time!

I built this solution to achieve the desired query. However, it hasn't been tested in terms of efficiency.
Following the example of colums 1-3:
SELECT max(a+b+c) FROM sample_table WHERE a < 3;
EDIT:
Looking at:
The correct answer will be just row id 1
...I considered maybe I misunderstood your question, and you want the query just obtain the rowid. So, I made this other one:
SELECT a FROM sum_combo WHERE a+b+c=(
SELECT max(a+b+c) FROM sum_combo WHERE a > 3
);
Which would for sure take too long in larger tables than just 5 rows.

Efficiently joining over interval ranges in SQL

Suppose I have two tables as follows (data taken from this SO post):
Table d1:
x start end
a 1 3
b 5 11
c 19 22
d 30 39
e 7 25
Table d2:
x pos
a 2
a 3
b 3
b 12
c 20
d 52
e 10
The first row in both tables are column headers. I'd like to extract all the rows in d2 where column x matches with d1 and pos1 falls within (including boundary values) d1's start and end columns. That is, I'd like the result:
x pos start end
a 2 1 3
a 3 1 3
c 20 19 22
e 10 7 25
The way I've seen this done so far is:
SELECT * FROM d1 JOIN d2 USING (x) WHERE pos BETWEEN start AND end
But what is not clear to me is if this operation is done as efficient as it can be (i.e., optimised internally). For example, computing the entire join first is not really a scalable approach IMHO (in terms of both speed and memory).
Are there any other efficient query optimisations (ex: using interval trees) or other algorithms that can handle ranges efficiently (again, in terms of both speed and memory) in SQL that I can make use of? It doesn't matter if it's using SQLite, PostgreSQL, mySQL etc..
What is the most efficient way to perform this operation in SQL?
Thank you very much.

Not sure how it all works out internally, but depending on the situation I would advice to play around with a table that 'rolls out' all the values from d1 and then join on that one. This way the query engine can pinpoint the right record 'exactly' instead of having to find a combination of boundaries that match the value being looked for.
e.g.
x value
a 1
a 2
a 3
b 5
b 6
b 7
b 8
b 9
b 10
b 11
c 19 etc..
given an index on the value column (**), this should be quite a bit faster than joining with the BETWEEN start AND end on the original d1 table IMHO.
Off course, each time you make changes to d1, you'll need to adjust the rolled out table too (trigger?). If this happens frequently you'll spend more time updating the rolled out table than you gained in the first place! Additionally, this might take quite a bit of (disk)space quickly if some of the intervals are really big; and also, this assumes we don't need to look for non-whole numbers (e.g. what if we look for the value 3.14 ?)
(You might consider experimenting with a unique one on (value, x) here...)

Sql Popularity algorithm with weighted score

I'm implement an algorithm that returns popular posts at the moment, given his likes and dislikes.
To do this, for each post I add all his likes (1) and dislikes (-1) to get his score but each like/dislike is weighted : the latest, the heaviest. For example, at the moment an user likes a post, his like weights 1. After 1 day, it weights 0.95 (or -0.95 if it's a dislike), after 2 days, 0.90, and so on... With a minimal of 0.01 reached after 21 days. (PS: Theses are totally approximate values)
Here are how my tables are made :
Posts table
id | Title | user_id | ...
-------------------------------------------
1 | Random post | 10 | ...
2 | Another post | 36 | ...
n | ... | n | ...
Likes table
id | vote | post_id | user_id | created
----------------------------------------
1 | 1 | 2 | 10 | 2014-08-18 15:34:20
2 | -1 | 1 | 24 | 2014-08-15 18:54:12
3 | 1 | 2 | 54 | 2014-08-17 21:12:48
Here is the SQL query I'm currently using which does the job
SELECT Post.*, Like.*,
SUM(Like.vote *
(1 - IF((TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21 > 0.99, 0.99, (TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21))
) AS score
FROM posts Post
LEFT JOIN likes Like ON (Post.id = Like.post_id)
GROUP BY Post.id
ORDER BY score DESC
PS: I'm using TIMESTAMPDIFF with MINUTE and not DAY directly because I'm calculating the day myself otherwise it returns me an integrer and I want a float value, in order to gradually decay overtime and not day per day. So TIMESTAMPDIFF(MINUTE, Like.created, NOW())/60/24 just gives me the number of day passed since the like creation with the decimal part.
Here are my questions :
Look at the IF(expr1, expr2, expr3) part : it is necessary in order to set minimal value for the like's weight, so it will not go under 0.01 and become negative (and so the like, even older still has a little weight). But I'm calculating 2 times the same thing : expr1 is the same as expr2. Isn't there a way to avoid this duplicate expression ?
I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
Should I create Index in Like table for post_id ? And for created ?
Thank you !
EDIT: Imagine a Post can have multiple tags, and each tag can belong to multiple posts. If I want to get populars Posts given a Tag or multiple Tag, I can't cache each query ; as there is a good amount of possible queries. Is the query still viable so ?
EDIT FOR FINAL SOLUTION: I finally did some tests. I created a table Post with 30 000 entries and Like with 250 000 entries.
Without index, the query was incredibly long (timed out > 10mn), but with indexes on Post.id (primary), Like.id(primary) and Like.post_id it took ~0.5s.
So I'm not caching the data, neither using update every 5mn. If the table keeps growing this is still possible solution (over 1s it's not acceptable).

2: I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
10000 and 50000 are considered small on current hardware. With those table sizes you probably won't need any cache, unless the query will run several times per second.
Anyway, I would do a performance test before deciding to have a cache.
3: Should I create Index in Like table for post_id ? And for created ?
I would create an index for (post_id, created, vote). That way the query can get all information from the index and doesn't need to read the table at all.
Edit (response to comments):
An extra index will slow down inserts/updates slightly. In the end, the path you choose will dictate the characteristics of what you need in terms of CPU/RAM/Disk I/O.
If you have enough RAM for the DB so that you expect the entire Like table to be cached in RAM then you might be better off with an index on just post_id.
In terms of total load you need to consider the ratio between insert and select and the relative cost of insert and select with or without the index.
My gut feeling is that the total load will be lower with the index.
Regarding your question on concurrency (selecting and inserting simultaneously). What happens depends on the isolation level. The general advice is to keep inserts/updates as short as possible. If you don't do unneccessary things between the start of the insert and the commit you should be fine.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008