Calculating the cost of Block Nested Loop Joins - mysql

I am trying to calculate the cost of the (most efficient) block nested loop join in terms of NDPR (number of disk page reads). Suppose you have a query of the form:
SELECT COUNT(*)
FROM county JOIN mcd
ON count.state_code = mcd.state_code
AND county.fips_code = mcd.fips_code
WHERE county.state_code = #NO
where #NO is substituted for a state code on each execution of the query.
I know that I can derive the NPDR using: NPDR(R x S) = |Pages(R)| + Pages(R) / B - 2 . |P ages(S)|
(where the smaller table is used as the outer in order to produce less page reads. Ergo:
R = county, S = mcd).
I also know that Page size = 2048 bytes
Pointer = 8 byte
Num. rows in mcd table = 35298
Num. rows in county table = 3141
Free memory buffer pages B = 100
Pages(X) = (rowsize)(numrows) / pagesize
What I am trying to figure out is how the "WHERE county.state_code = #NO" affects my cost?
Thanks for your time.

First a couple of observations regarding the formula you wrote:
I'm not sure why it you write "B - 2" instead of "B - 1". From a theoretical perspective, you need a single buffer page to read in relation S (you can do it by reading one page at a time).
Make sure you use all the brackets. I would write the formula as:
NPDR(R x S) = |Pages(R)| + |Pages(R)| / (B-2) * |Pages(S)|
The all numbers in the formula would need to be rounded up (but this is nitpicking).
The explanation for the generic BNLJ formula:
You read in as many tuples from the smaller relation (R) as you can keep in memory (B-1 or B-2 pages worth of tuples).
For each group of B-2 pages worth of tuples, you then have to read the whole S relation ( |Pages(S)|) to perform the join for that particular range of relation R.
At the end of the join, relation R is read exactly one time and relation S is read as many times as we filled the memory buffer, namely |Pages(R)| / (B-2) times.
Now the answer:
In your example a selection criteria is applied to relation R (table Country in this case). This is the WHERE county.state_code = #NO part of the query. Therefore, the generic formula does not apply directly.
When reading from relation R (i.e., table Country in your example), we can discard all the non-qualifying tuples that do not match the selection criteria. Assuming that there are 50 states in the USA and that all states have the same number of counties, only 2% of the tuples in table Country qualify on average and need to be stored in memory. This reduces the number of iteration of the inner loop of the join (i.e., the number of times we need to scan relation S / table mcs). The 2% number is obviously just the expected average and will change depending on the actual given state.
The formula for your problem therefore becomes:
NPDR(R x S) = |Pages(County)| + |Pages(County)| / (B - 2) * |Counties in state #NO| / |Rows in table County| * |Pages(Mcd)|

Related

SQL - Add To Existing Average

I'm trying to build a reporting table to track server traffic and popularity overall. Each SID is a unique game server hosting a particular game, and each UCID is a unique player key connecting to that server.
Say I have a table like so:
SID UCID AvgTime NumConnects
-----------------------------------------
1 AIE9348ietjg 300.55 5
1 Po328gieijge 500.66 7
2 AIE9348ietjg 234.55 3
3 Po328gieijge 1049.88 18
We can see that there are 2 unique players, and 3 unique servers, with SID 1 having 2 players that have connected to it at some point in the past. The AvgTime is the average amount of time those players spent on that server (in seconds), and the NumConnects is the size of the average (ie. 300.55 is averaged out of 5 elements).
Now I run a job in the background where I process a raw connection table and pull out player connections like so:
SID UCID ConnectTime DisconnectTime
-----------------------------------------
1 AIE9348ietjg 90.35 458.32
2 Po328gieijge 30.12 87.15
2 AIE9348ietjg 173.12 345.35
This table has no ID or other fluff to help condense my example. There may be multiple connect/disconnect records for multiple players in this table. What I want to do is add to my existing AvgTime for each SID these new values.
There is a formula from here I am trying to use (taken from this math stackexchange: https://math.stackexchange.com/questions/1153794/adding-to-an-average-without-unknown-total-sum/1153800#1153800)
Average = (Average * Size + NewValue) / Size + 1
How can I write an update query to update each ServerIDs traffic table above, and add to the average using the above formula for each pair of records. I tried something like the following but it didn't work (returned back null):
UPDATE server_traffic st
LEFT JOIN connect_log l
ON st.SID = l.SID AND st.UCID = l.UCID
SET AvgTime = (AvgTime * NumConnects + SUM(l.DisconnectTime - l.ConnectTime) / NumConnects + COUNT(l.UCID)
I would prefer an answer in MySql, but I'll accept MS SQL as well.
EDIT
I understand that statistics and calculations are generally not to be stored in tables and that you can run reports that would crunch the numbers for you. My requirement is that users can go to a website and view the popularity of various servers. This needs to be done in a way that
A: running a complex query per user doesn't crash or slow down the system
B: the page returns the data within a few seconds at most
See this example here: https://bf4stats.com/pc/shinku555555
This is a web page for battlefield 4 stats - notice that the load is almost near instant for this player, and I get back a load of statistics without waiting for some complex report query to return the data. I'm assuming they store these calculations in preprocessed tables where the webpage just needs to do a simple select to return back the values. That's the same approach I want to take with my Database and Web Application design.
Sorry if this is off topic to the original question - but hopefully this adds additional context that helps people understand my needs.
Since you cannot run aggregate functions like SUM and COUNT by themselves at the unit level in SQL but contained in an aggregate query, consider joining to an aggregate subquery for the UPDATE...LEFT JOIN. Also, adjust parentheses in SET to match above formula.
Also, note that since you use LEFT JOIN, rows with non-match IDs will render NULL for aggregate fields and this entity cannot be used in arithmetic operations and will return NULL. You can convert to zero with IFNULL() but may fail with formula's division.
UPDATE server_traffic s
LEFT JOIN
(SELECT SID, UCID, COUNT(UCID) As GrpCount,
SUM(DisconnectTime - ConnectTime) AS SumTimeDiff
FROM connect_log
GROUP BY SID, UCID) l
ON s.SID = l.SID AND s.UCID = l.UCID
SET s.AvgTime = (s.AvgTime * s.NumConnects + l.SumTimeDiff) / s.NumConnects + l.GrpCount
Aside - reconsider saving calculations/statistics within tables as they can always be run by queries even by timestamps. Ideally, database tables should store raw values.

MySQL get duplicate rows in subquery

I want to display all duplicate records from my table, rows are like this
uid planet degree
1 1 104
1 2 109
1 3 206
2 1 40
2 2 76
2 3 302
I have many different OR statements with different combinations in subquery and I want to count every one of them which matches, but it only displays the first match of each planet and degree.
Query:
SELECT DISTINCT
p.uid,
(SELECT COUNT(*)
FROM Params AS p2
WHERE p2.uid = p.uid
AND(
(p2.planet = 1 AND p2.degree BETWEEN 320 - 10 AND 320 + 10) OR
(p2.planet = 7 AND p2.degree BETWEEN 316 - 10 AND 316 + 10)
...Some more OR statements...
)
) AS counts FROM Params AS p HAVING counts > 0 ORDER BY p.uid DESC
any solution folks?
updated
So, the problem most people have with their counting-joined-sub-query-group-queries, is that the base query isn't right, and the following may seem like a complete overkill for this question ;o)
base data
in this particular example what you would want as a data basis is at first this:
(uidA, planetA, uidB, planetB) for every combination of player A and player B planets. that one is quite simple (l is for left, r is for right):
SELECT l.uid, l.planet, r.uid, r.planet
FROM params l, params r
first step done.
filter data
now you want to determine if - for one row, meaning one pair of planets - the planets collide (or almost collide). this is where the WHERE comes in.
WHERE ABS(l.degree-r.degree) < 10
would for example only leave those pairs of planet with a difference in degrees of less than 10. more complex stuff is possible (your crazy conditional ...), for example if the planets have different diameter, you may add additional stuff. however, my advise would be, that you put some additional data that you have in your query into tables.
for example, if all 1st planets players have the same size, you could have a table with (planet_id, size). If every planet can have different sizes, add the size to the params table as a column.
then your WHERE clause could be like:
WHERE l.size+r.size < ABS(l.degree-r.degree)
if for example two big planets with size 5 and 10 should at least be 15 degrees apart, this query would find all those planets that aren't.
we assume, that you have a nice conditional, so at this point, we have a list of (uidA, planetA, uidB, planetB) of planets, that are close to colliding or colliding (whatever semantics you chose). the next step is to get the data you're actually interested in:
limit uidA to a specific user_id (the currently logged in user for example)
add l.uid = <uid> to your WHERE.
count for every planet A, how many planets B exist, that threaten collision
add GROUP BY l.uid, l.planet,
replace r.uid, r.planet with count(*) as counts in your SELECT clause
then you can even filter: HAVING counts > 1 (HAVING is the WHERE for after you have GROUPed)
and of course, you can
filter out certain players B that may not have planetary interactions with player A
add to your WHERE
r.uid NOT IN (1)
find only self collisions
WHERE l.uid = r.uid
find only non-self collisions
WHERE l.uid <> r.uid
find only collisions with one specific planet
WHERE l.planet = 1
conclusion
a structured approach where you start from the correct base data, then filter it appropriately and then group it, is usually the best approach. if some of the concepts are unclear to you, please read up on them online, there are manuals everywhere
final query could look something like this
SELECT l.uid, l.planet, count(*) as counts
FROM params l, params r
WHERE [ collision-condition ]
GROUP BY l.uid, l.planet
HAVING counts > 0
if you want to collide a non-planet object, you might want to either make a "virtual table", so instead of FROM params l, params r you do (with possibly different fields, I just assume you add a size-field that is somehow used):
FROM params l, (SELECT 240 as degree, 2 as planet, 5 as size) r
multiple:
FROM params l, (SELECT 240 as degree, 2 as planet, 5 as size
UNION
SELECT 250 as degree, 3 as planet, 10 as size
UNION ...) r

Need a different permutation of groups of numbers

I have numbers from 1 to 36. What I am trying to do is put all these numbers into three groups and works out all various permutations of groups.
Each group must contain 12 numbers, from 1 to 36
A number cannot appear in more than one group, per permutation
Here is an example....
Permutation 1
Group 1: 1,2,3,4,5,6,7,8,9,10,11,12
Group 2: 13,14,15,16,17,18,19,20,21,22,23,24
Group 3: 25,26,27,28,29,30,31,32,33,34,35,36
Permutation 2
Group 1: 1,2,3,4,5,6,7,8,9,10,11,13
Group 2: 12,14,15,16,17,18,19,20,21,22,23,24
Group 3: 25,26,27,28,29,30,31,32,33,34,35,36
Permutation 3
Group 1: 1,2,3,4,5,6,7,8,9,10,11,14
Group 2: 12,11,15,16,17,18,19,20,21,22,23,24
Group 3: 25,26,27,28,29,30,31,32,33,34,35,36
Those are three example, I would expect there to be millions/billions more
The analysis that follows assumes the order of groups matters - that is, if the numbers were 1, 2, 3 then the grouping [{1},{2},{3}] is distinct from the grouping [{3},{2},{1}] (indeed, there are six distinct groupings when taking from this set of numbers).
In your case, how do we proceed? Well, we must first choose the first group. There are 36 choose 12 ways to do this, or (36!)/[(12!)(24!)] = 1,251,677,700 ways. We must then choose the second group. There are 24 choose 12 ways to do this, or (24!)/[(12!)(12!)] = 2,704,156 ways. Since the second choice is already conditioned upon the first we may get the total number of ways of taking the three groups by multiplying the numbers; the total number of ways to choose three equal groups of 12 from a pool of 36 is 3,384,731,762,521,200. If you represented numbers using 8-bit bytes then to store every list would take at least ~3 pentabytes (well, I guess times the size of the list, which would be 36 bytes, so more like ~108 pentabytes). This is a lot of data and will take some time to generate and no small amount of disk space to store, so be aware of this.
To actually implement this is not so terrible. However, I think you are going to have undue difficulty implementing this in SQL, if it's possible at all. Pure SQL does not have operations that return more than n^2 entries (for a simple cross join) and so getting such huge numbers of results would require a large number of joins. Moreover, it does not strike me as possible to generalize the procedure since pure SQL has no ability to do general recursion and therefore cannot do a variable number of joins.
You could use a procedural language to generate the groupings and then write the groupings into a database. I don't know whether this is what you are after.
n = 36
group1[1...12] = []
group2[1...12] = []
group3[1...12] = []
function Choose(input[1...n], m, minIndex, group)
if minIndex + m > n + 1 then
return
if m = 0 then
if group = group1 then
Choose(input[1...n], 12, 1, group2)
else if group = group2 then
group3[1...12] = input[1...12]
print group1, group2, group3
for i = i to n do
group[12 - m + 1] = input[i]
Choose(input[1 ... i - 1].input[i + 1 ... n], m - 1, i, group)
When you call this like Choose([1...36], 12, 1, group1) what it does is fill in group1 with all possible ordered subsequences of length 12. At that point, m = 0 and group = group1, so the call Choose([?], 12, 1, group2) is made (for every possible choice of group1, hence the ?). That will choose all remaining ordered subsequences of length 12 for group2, at which point again m = 0 and now group = group2. We may now safely assign group3 to the remaining entries (there is only one way to choose group3 after choosing group1 and group2).
We take ordered subsequences only by propagating the index at which to begin looking on the recursive call (minIdx). We take ordered subsequences to avoid getting permutations of the same set of 12 items (since order doesn't matter within a group).
Each recursive call to Choose in the loop passes input with one element removed: precisely that element that just got added to the group under consideration.
We check for minIndex + m > n + 1 and stop the recursion early because, in this case, we have skipped too many items in the input to be able to ever fill up the current group with 12 items (while choosing the subsequence to be ordered).
You will notice I have hard-coded the assumption of 12/36/3 groups right into the logic of the program. This was done for brevity and clarity, not because you can't make parameterize it in the input size N and the number of groups k to form. To do this, you'd need to create an array of groups (k groups of size N/k each), then call Choose with N/k instead of 12 and use a select/switch case statement instead of if/then/else to determine whether to Choose again or print. But those details can be left as an exercise.

Tree like data collation in SQL (Mysql)

I have two tables in my database
Table A with columns user_id, free_data, used_data
Table B with columns donor_id, receptor_id, share_data
Basically, a user (lets call x) has some data in his account which is represented by his entry in table A. The data is stored in free_data column. He can donate data to any other user (lets call y), which will show up as an entry in Table B. The same amount of data gets deducted from the user x free_data column.
While entry in Table B gets created, an entry in Table A for user y is also created with free_data value equal to share_data. Now user y can give away data to user z & the process continues.
Each user keep using their data & the entry used_data in Table A keeps on adding up to indicate how much data each user has used.
This is like a tree structure where there is a an entry with all the data (root node) who eventually gives data to others who in-turn give data to other nodes.
Now I would like to write an sql query such that, given a node x (id of entry in Table A), I should be able to sum up total data x has given & who all are beneficiaries at multiple level, all of their used_data need to be collated & showed against x.
Basically, I want to collate
Overall data x has donated.
How much of the donated data from x has been used up.
While the implementation is more graph-like, I am more interested to know if we assume it to be a tree below node x & can come up with a single sql query to be able to get the data I need.
Example
Table A
user_id, free_data, used_data
1 50 10
2 30 20
3 20 20
Table B
donor_id, receptor_id, share_data
1 2 30
1 3 20
Total data donated by 1 - 30 + 20 = 50
Total donated data used - 20 + 20 = 40
This is just one level where 1 donated to 2 & 3. 2 in turn could donated to 4 & all that data needed to be collated in a bubbled up fashion for calculating the overall donated data usage.
Yes its possible using a nested set model. There's a book by Joe Celko that describes but if you want to get straight into it there's an article that talks about it. Both the collated data that you need can be retrieved by a single select statement like this:
SELECT * FROM TableB where left > some_value1 and right < some_value2
In the above example to get all the child nodes of "Portable Electronics" the query will be:
SELECT * FROM Electronics WHERE `left` > 10 and `right` < 19
The article describes how the left and right columns should be initialised.
If I understand the problem correctly, the following should give you the desired results:
SELECT B.donor_id AS donor_id, SUM(A.used_data) AS total_used_data FROM A
INNER JOIN B ON A.user_id = B.receptor_id GROUP BY B.donor_id;
Hope this will solve your problem now.
Try below query(note that you will have to pass userid at 2 places):
SELECT SUM(share_data) as total_donated, sum(used_data) as total_used FROM tablea
LEFT JOIN tableB
ON tableA.user_id = tableB.donor_id
WHERE user_id IN (select receptor_id as id
from (select * from tableb
order by donor_id, receptor_id) u_sorted,
(select #pv := '1') initialisation
where find_in_set(donor_id, #pv) > 0
and #pv := concat(#pv, ',', receptor_id)) OR user_id = 1;

How to randomly select multiple rows satisfying certain conditions from a MySQL table?

I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.
Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.
However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.
Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:
Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?
When will performance become an issue?
SELECT user.*
FROM (
SELECT id
FROM user
WHERE is_active = 1
AND deleted = 0
AND expiretime > '.time().'
AND id NOT IN (10, 13, 15)
AND id NOT IN (20, 30, 50)
AND id NOT IN (103, 140, 250)
ORDER BY RAND()
LIMIT 100
)
AS random_users
STRAIGHT JOIN user
ON user.id = random_users.id
Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT statement in the PROCEDURE that will be described there. That would be the spot where you add your specific WHERE conditions.
The problem with ORDER BY RAND() is that this operation has complexity of n*log2(n), while the method described in the article that I linked, has almost constant complexity.
Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND() takes 1 time unit:
entries | time units
-------------------------
10 | 1 /* if this takes 0.001s */
100 | 20
1'000 | 300
10'000 | 4'000
100'000 | 50'000
1'000'000 | 600'000 /* then this will need 10 minutes */
And you wrote that you are dealing with table on scale of millions.
I'm afraid no-one's going to be able to answer your question with any accuracy. If you really want to know you'll need to run some benchmarks against your system (not the live one ideally but an exact copy). Benchmark this solution against a different solution (getting the random rows using PHP for example) and compare the numbers to what you/your client consider "good performance). Then ramp up your data trying to keep the distribution of column values as close to real as you can and see where performance starts to drop off. To be honest if it works for you now with a bit of headroom, then I'd go for it. When (if!) it becomes a bottleneck then you can look at it again - or just chuck extra iron at your database...
Preprocess as much as possible
try something like (VB-like example)
Dim sRND = New StringBuilder : Dim iRandom As New Random()
Dim iMaxID As Integer = **put you maxId here**
Dim Cnt as Integer=0
While Cnt < 100
Dim RndVal As Integer = iRandom.Next(1, iMaxID)
If Not ("10,13,15,20,30,50,103,140,250").Contains(RndVal) Then
Cnt += 1
sRND.Append("," & RndVal)
end if
End While
String.Format("SELECT * FROM (Select ID FROM(User) WHERE(is_active = 1) AND deleted = 0 AND expiretime > {0} AND id IN ({1}) .blahblablah.... LIMIT 100",time(), Mid(sRND.ToString, 2))
I didn't check for syntax but you'll get my drift I hope.
This will make MySql read records that fit the 'IN' and stop when it reaches 100 without the need to preprocess all records first.
Please let me know the elapsedtime difference if you try it. (I'm qurious)