&I need to compare looooong bit-sequences (up to 1000 digits) with each other, currently I've got a table with 16 columns (16*64 = 1024):
userid block_1 block_2 block_3 ...
1 1001... 1100... 0010...
2 1101... 1011... 0111...
3 1011... 0111... 1100...
my statement:
select sh.userid,
sum(bit_count(se.block_1 & sh.block_1)+bit_count(se.block_2 & sh.block_2)+...)
from my_table se, my_table sh
where se.userid = 1
and se.userId != sh.userId
group by sh.userId
With about 5 million entries, the query time is ~ 1.5 seconds, which is already pretty good, I guess. Most of the time is lost for the bit_count- and &-part, therefore I'm asking myself if there's some room for improvement.
Is there a better way to compare very long binary sequences?
EDIT
I'm expecting the se.block_X to contain much more 0s than sh.block_X; does it make a difference if I do
se.block_X & sh.block_X
or
sh.block_X & se.block_X
? I would expect the first one to be faster.
the 1 and 0 represent yes and no in a lot of categories and I'm only interested in where BOTH entries said yes. So
10101011100101011
&
10010101101001010
=================
10000001100001010
and now I now where both of them said yes. For my use case I have to compare e.g. my answers with ~ 5 millions of entries of others answers, and there are about 1000 questions. So I've got a table with 5 million entries, each with 16x64 bit columns. And then I have to compare 1 of the entry to all the others basically.
Related
I would really appreciate some help with a MySQL query for the following matter.
My table, let's call it "data", contains the following fields: "timestamp" and "temperature".
Every 30 seconds a new record is being added into it.
My goal is to identify the record (timestamp) which compared to the one added 2 minutes later (4 records later) has a temperature difference of 20 degrees (or more)
Ex.
...
19:14:08 99
19:14:38 100
19:15:08 101
19:15:38 105
19:16:08 115
19:16:38 126
19:17:08 150
19:17:38 151
...
In this case, the timestamp which I have to find is 19:14:38, because if compared to the one at 19:16:38, we have 126-100 = 26 > 20.
There are some other conditions (not worth mentioning) which have to be met as well, but at least those I can handle myself.
Thanks for your help.
If your timestamps are exactly, you can use a self-join:
select t.*
from t join
t tnext
on t.timestamp = tnext.timestamp - interval 2 minutes
where tnext.temperature - t.temperature > 20;
This is highly dependent on the accuracy of your timestamps, however.
So I believe you're on the right track with a join to the same table: something like this should get you started. It's untested air-code and typically I write in oracle sql so pardon any syntax nuances...
SELECT
a.TEMPERATURE AS NEW_TEMP
,b.TEMPERATURE AS PRIOR_TEMP
FROM
DATA a
INNER JOIN DATA b ON
b.TIMESTAMP = a.TIMESTAMP-TO_DATE('02','MM')
AND ABS(ABS(a.TEMPERATURE) - ABS(b.TEMPERATURE)) > 20
Additionally - using a timestamp is probably not as reliable as you might think since there could be variation that you do not want to exist (such as the timestamp may be off by a second ie it is exactly 1:59 seconds prior to the new record. in which case this join would miss it. whereas if you were using an autoincremented ID as suggested above, you could simply replace that first join clause with:
b.RECORD_ID = a.RECORD_ID-4
I have to create a table that assigns an user id and a product id to some data (models 2 one to many relationships). I will make a lot of queries like
select * from table where userid = x;
The first thing that I am interested is how big should the table get before the query starts to be observable (let's say it takes more than 1 second).
Also, how this can be optimised?
I know that this might depend on the implementation. I will use mysql for this specific project, but I am interested in more general answers as well.
It all depends on the horse power of your machine. The make that query more efficient, create an index with "userid"
how big should the table get before the query starts to be observable (let's say it takes more than 1 second)
There are too many factors to deterministically measure run time. CPU speed, memory, I/O speed, etc. are just some of the external factors.
how this can be optimized?
That's more staightforward. If there is an index on userid then the query will likely to an index seek which is about as fast as you can get as far as finding the record. If the userid is a clustered index then it will be faster because it won't have to use the position from the index to find the record in data pages - the data is physically organized as part of the index.
let's say it takes more than 1 second
With an index on userid, Mysql will manage to find the correct row in (worst case) Oh (log n). In "seconds" it now depends on the performance of your machine.
It is impossible to give you an exact number, without considering how long one operation takes.
As an Example: Assuming you have a database with 4 records. This requires 2 operations worst case. Any time, you double your data, one more operation is required.
for example:
# records | # operations to find entry in worst case
2 1
4 2
8 3
16 4
...
4096 12
...
~1 B 30
~2 B 31
So, with a huge amount of records - time almost remains constant. For 1 Billion records, you would need to perform ~ 30 operations.
And like that it continues: 2 Billion records, 31 operations.
so, let's say your query executes in 0.001 second for 4096 entries (12 ops)
it would take arround (0.001 / 12 * 30) 0.0025 seconds for 1 Billion records.
Heavy Sidenode: this is just considering the runtime complexity of the binary search, but it shows how the complexity would scale.
In a nutshell: Your database would be unimpressed by a single query on an indexed value. However, if you run a heavy amount of those queries at the same time, time increases ofc.
I have 2 tables posts<id, user_id, text, votes_counter, created> and votes<id, post_id, user_id, vote>. Here the table vote can be either 1 (upvote) or -1(downvote). Now if I need to fetch the total votes(upvotes - downvotes) on a post, I can do it in 2 ways.
Use count(*) to count the number of upvotes and downvotes on that post from votes table and then do the maths.
Set up a counter column votes_counter and increment or decrement it everytime a user upvotes or downvotes. Then simply extract that votes_counter.
My question is which one is better and under what condition. By saying condition, I mean factors like scalability, peaktime et cetera.
To what I know, if I use method 1, for a table with millions of rows, count(*) could be a heavy operation. To avoid that situation, if I use a counter then during peak time, the votes_counter column might get deadlocked, too many users trying to update the counter!
Is there a third way better than both and as simple to implement?
The two approaches represent a common tradeoff between complexity of implementation and speed.
The first approach is very simple to implement, because it does not require you to do any additional coding.
The second approach is potentially a lot faster, especially when you need to count a small percentage of items in a large table
The first approach can be sped up by well designed indexes. Rather than searching through the whole table, your RDBMS could retrieve a few records from the index, and do the counts using them
The second approach can become very complex very quickly:
You need to consider what happens to the counts when a user gets deleted
You should consider what happens when the table of votes is manipulated by tools outside your program. For example, merging records from two databases may prove a lot more complex when the current counts are stored along with the individual ones.
I would start with the first approach, and see how it performs. Then I would try optimizing it with indexing. Finally, I would consider going with the second approach, possibly writing triggers to update counts automatically.
As this sounds a lot like StackExchange, I'll refer you to this answer on the meta about the database schema used on the site. The votes table looks like this:
Votes table:
Id
PostId
VoteTypeId, one of the following values:
1 - AcceptedByOriginator
2 - UpMod
3 - DownMod
4 - Offensive
5 - Favorite (if VoteTypeId = 5, UserId will be populated)
6 - Close
7 - Reopen
8 - BountyStart (if VoteTypeId = 8, UserId will be populated)
9 - BountyClose
10 - Deletion
11 - Undeletion
12 - Spam
15 - ModeratorReview
16 - ApproveEditSuggestion
UserId (only present if VoteTypeId is 5 or 8)
CreationDate
BountyAmount (only present if VoteTypeId is 8 or 9)
And so based on that it sounds like the way it would be run is:
SELECT VoteTypeId FROM Votes WHERE VoteTypeId = 2 OR VoteTypeId = 3
And then based on the value, do the maths:
int score = 0;
for each vote in voteQueryResults
if(vote == 2) score++;
if(vote == 3) score--;
Even with millions of results, this is probably going to be a very fast operation as it's so simple.
I have a table having approx 1 crore record.
Required : I need to fire a query which will fetch records from this table having login_date of less than 6 months (yields 5 lacs records) and some conditions , query is talking approx 60sec.
Consideration : if i kept records of login date of last 6 month in a separate table then the query is talking just 1 to 2 seconds.
Solution ?
i should create a separate table by using trigger ?
or any other better solution is better .... like views or something similar ?
Are you using an index on this table? Creating a btree index on login_date should give you about the same performance as having a second table without the schema complexity.
Also, crore and lac aren't very common English words. Try "ten million" and "five hundred thousand", and more people should understand what you mean.
I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.
Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.
However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.
Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:
Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?
When will performance become an issue?
SELECT user.*
FROM (
SELECT id
FROM user
WHERE is_active = 1
AND deleted = 0
AND expiretime > '.time().'
AND id NOT IN (10, 13, 15)
AND id NOT IN (20, 30, 50)
AND id NOT IN (103, 140, 250)
ORDER BY RAND()
LIMIT 100
)
AS random_users
STRAIGHT JOIN user
ON user.id = random_users.id
Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT statement in the PROCEDURE that will be described there. That would be the spot where you add your specific WHERE conditions.
The problem with ORDER BY RAND() is that this operation has complexity of n*log2(n), while the method described in the article that I linked, has almost constant complexity.
Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND() takes 1 time unit:
entries | time units
-------------------------
10 | 1 /* if this takes 0.001s */
100 | 20
1'000 | 300
10'000 | 4'000
100'000 | 50'000
1'000'000 | 600'000 /* then this will need 10 minutes */
And you wrote that you are dealing with table on scale of millions.
I'm afraid no-one's going to be able to answer your question with any accuracy. If you really want to know you'll need to run some benchmarks against your system (not the live one ideally but an exact copy). Benchmark this solution against a different solution (getting the random rows using PHP for example) and compare the numbers to what you/your client consider "good performance). Then ramp up your data trying to keep the distribution of column values as close to real as you can and see where performance starts to drop off. To be honest if it works for you now with a bit of headroom, then I'd go for it. When (if!) it becomes a bottleneck then you can look at it again - or just chuck extra iron at your database...
Preprocess as much as possible
try something like (VB-like example)
Dim sRND = New StringBuilder : Dim iRandom As New Random()
Dim iMaxID As Integer = **put you maxId here**
Dim Cnt as Integer=0
While Cnt < 100
Dim RndVal As Integer = iRandom.Next(1, iMaxID)
If Not ("10,13,15,20,30,50,103,140,250").Contains(RndVal) Then
Cnt += 1
sRND.Append("," & RndVal)
end if
End While
String.Format("SELECT * FROM (Select ID FROM(User) WHERE(is_active = 1) AND deleted = 0 AND expiretime > {0} AND id IN ({1}) .blahblablah.... LIMIT 100",time(), Mid(sRND.ToString, 2))
I didn't check for syntax but you'll get my drift I hope.
This will make MySql read records that fit the 'IN' and stop when it reaches 100 without the need to preprocess all records first.
Please let me know the elapsedtime difference if you try it. (I'm qurious)