How to randomly select multiple rows satisfying certain conditions from a MySQL table? - mysql

I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.
Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.
However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.
Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:
Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?
When will performance become an issue?
SELECT user.*
FROM (
SELECT id
FROM user
WHERE is_active = 1
AND deleted = 0
AND expiretime > '.time().'
AND id NOT IN (10, 13, 15)
AND id NOT IN (20, 30, 50)
AND id NOT IN (103, 140, 250)
ORDER BY RAND()
LIMIT 100
)
AS random_users
STRAIGHT JOIN user
ON user.id = random_users.id

Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT statement in the PROCEDURE that will be described there. That would be the spot where you add your specific WHERE conditions.
The problem with ORDER BY RAND() is that this operation has complexity of n*log2(n), while the method described in the article that I linked, has almost constant complexity.
Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND() takes 1 time unit:
entries | time units
-------------------------
10 | 1 /* if this takes 0.001s */
100 | 20
1'000 | 300
10'000 | 4'000
100'000 | 50'000
1'000'000 | 600'000 /* then this will need 10 minutes */
And you wrote that you are dealing with table on scale of millions.

I'm afraid no-one's going to be able to answer your question with any accuracy. If you really want to know you'll need to run some benchmarks against your system (not the live one ideally but an exact copy). Benchmark this solution against a different solution (getting the random rows using PHP for example) and compare the numbers to what you/your client consider "good performance). Then ramp up your data trying to keep the distribution of column values as close to real as you can and see where performance starts to drop off. To be honest if it works for you now with a bit of headroom, then I'd go for it. When (if!) it becomes a bottleneck then you can look at it again - or just chuck extra iron at your database...

Preprocess as much as possible
try something like (VB-like example)
Dim sRND = New StringBuilder : Dim iRandom As New Random()
Dim iMaxID As Integer = **put you maxId here**
Dim Cnt as Integer=0
While Cnt < 100
Dim RndVal As Integer = iRandom.Next(1, iMaxID)
If Not ("10,13,15,20,30,50,103,140,250").Contains(RndVal) Then
Cnt += 1
sRND.Append("," & RndVal)
end if
End While
String.Format("SELECT * FROM (Select ID FROM(User) WHERE(is_active = 1) AND deleted = 0 AND expiretime > {0} AND id IN ({1}) .blahblablah.... LIMIT 100",time(), Mid(sRND.ToString, 2))
I didn't check for syntax but you'll get my drift I hope.
This will make MySql read records that fit the 'IN' and stop when it reaches 100 without the need to preprocess all records first.
Please let me know the elapsedtime difference if you try it. (I'm qurious)

Related

SQL - Add To Existing Average

I'm trying to build a reporting table to track server traffic and popularity overall. Each SID is a unique game server hosting a particular game, and each UCID is a unique player key connecting to that server.
Say I have a table like so:
SID UCID AvgTime NumConnects
-----------------------------------------
1 AIE9348ietjg 300.55 5
1 Po328gieijge 500.66 7
2 AIE9348ietjg 234.55 3
3 Po328gieijge 1049.88 18
We can see that there are 2 unique players, and 3 unique servers, with SID 1 having 2 players that have connected to it at some point in the past. The AvgTime is the average amount of time those players spent on that server (in seconds), and the NumConnects is the size of the average (ie. 300.55 is averaged out of 5 elements).
Now I run a job in the background where I process a raw connection table and pull out player connections like so:
SID UCID ConnectTime DisconnectTime
-----------------------------------------
1 AIE9348ietjg 90.35 458.32
2 Po328gieijge 30.12 87.15
2 AIE9348ietjg 173.12 345.35
This table has no ID or other fluff to help condense my example. There may be multiple connect/disconnect records for multiple players in this table. What I want to do is add to my existing AvgTime for each SID these new values.
There is a formula from here I am trying to use (taken from this math stackexchange: https://math.stackexchange.com/questions/1153794/adding-to-an-average-without-unknown-total-sum/1153800#1153800)
Average = (Average * Size + NewValue) / Size + 1
How can I write an update query to update each ServerIDs traffic table above, and add to the average using the above formula for each pair of records. I tried something like the following but it didn't work (returned back null):
UPDATE server_traffic st
LEFT JOIN connect_log l
ON st.SID = l.SID AND st.UCID = l.UCID
SET AvgTime = (AvgTime * NumConnects + SUM(l.DisconnectTime - l.ConnectTime) / NumConnects + COUNT(l.UCID)
I would prefer an answer in MySql, but I'll accept MS SQL as well.
EDIT
I understand that statistics and calculations are generally not to be stored in tables and that you can run reports that would crunch the numbers for you. My requirement is that users can go to a website and view the popularity of various servers. This needs to be done in a way that
A: running a complex query per user doesn't crash or slow down the system
B: the page returns the data within a few seconds at most
See this example here: https://bf4stats.com/pc/shinku555555
This is a web page for battlefield 4 stats - notice that the load is almost near instant for this player, and I get back a load of statistics without waiting for some complex report query to return the data. I'm assuming they store these calculations in preprocessed tables where the webpage just needs to do a simple select to return back the values. That's the same approach I want to take with my Database and Web Application design.
Sorry if this is off topic to the original question - but hopefully this adds additional context that helps people understand my needs.
Since you cannot run aggregate functions like SUM and COUNT by themselves at the unit level in SQL but contained in an aggregate query, consider joining to an aggregate subquery for the UPDATE...LEFT JOIN. Also, adjust parentheses in SET to match above formula.
Also, note that since you use LEFT JOIN, rows with non-match IDs will render NULL for aggregate fields and this entity cannot be used in arithmetic operations and will return NULL. You can convert to zero with IFNULL() but may fail with formula's division.
UPDATE server_traffic s
LEFT JOIN
(SELECT SID, UCID, COUNT(UCID) As GrpCount,
SUM(DisconnectTime - ConnectTime) AS SumTimeDiff
FROM connect_log
GROUP BY SID, UCID) l
ON s.SID = l.SID AND s.UCID = l.UCID
SET s.AvgTime = (s.AvgTime * s.NumConnects + l.SumTimeDiff) / s.NumConnects + l.GrpCount
Aside - reconsider saving calculations/statistics within tables as they can always be run by queries even by timestamps. Ideally, database tables should store raw values.

How to select two MySQL rows and then compare a column and return an output

I've a table with a structure something like this,
Device | paid | time
abc 1 2 days ago
abc 0 1 day ago
abc 0 5 mins ago
Is it possible to write a query that checks the paid column on all the rows where Device = abc and then outputs the most recent two rows that different. Basically, something like an if statement saying if row 1 = 1 and row 2 = 0 output that but only if it's the most recent two columns that are different. For example, in this case, the first and second row. The table is being updated whenever a user changes from a free to paid account etc. It is also updated in different columns for different reasons hence the duplicate 0s for example.
I know this would probably be done better by having another table altogether and updating that every time the user switches account type, but is there any way to make this work?
Thanks
Example:
http://rextester.com/MABU7860 need further testing on edge cases but this seems to work.
SELECT A.*, B.*
FROM SQLfoo A
INNER JOIN SQLFoo B
on A.Device = B.Device
and A.mTime < B.mTime
WHERE A.Paid <> B.Paid
and A.device = 'abc'
ORDER BY B.mTime Desc, A.MTime Desc
LIMIT 1
By performing a self join we on the devices where the time from one table is less than the time from the next table (thus the two records will never matach and we only get the reuslts one way) and we order by those times descending, the highest times appear first in the result since we limit by a single device we don't need to concern ourselves with the devices. We then just need compare the paid from one source to the paid in the 2nd source and return the first result encountered thus limit 1.
Or using user variables
http://rextester.com/TWVEVX7830
in other engines one might accomplish this task by performing the join as in above, assigning a row number partitioned by the device and then simply return all those row_numbers with a value of 1; which would be the earliest date discrepency.
Use LIMIT to limit the number of record on mysql:
http://www.mysqltutorial.org/mysql-limit.aspx
In your case, use LIMIT 2
and then put the 2 record that you just select into an array, then compare the array if the value is different. If they are different then print

MySQL- Counting rows VS Setting up a counter

I have 2 tables posts<id, user_id, text, votes_counter, created> and votes<id, post_id, user_id, vote>. Here the table vote can be either 1 (upvote) or -1(downvote). Now if I need to fetch the total votes(upvotes - downvotes) on a post, I can do it in 2 ways.
Use count(*) to count the number of upvotes and downvotes on that post from votes table and then do the maths.
Set up a counter column votes_counter and increment or decrement it everytime a user upvotes or downvotes. Then simply extract that votes_counter.
My question is which one is better and under what condition. By saying condition, I mean factors like scalability, peaktime et cetera.
To what I know, if I use method 1, for a table with millions of rows, count(*) could be a heavy operation. To avoid that situation, if I use a counter then during peak time, the votes_counter column might get deadlocked, too many users trying to update the counter!
Is there a third way better than both and as simple to implement?
The two approaches represent a common tradeoff between complexity of implementation and speed.
The first approach is very simple to implement, because it does not require you to do any additional coding.
The second approach is potentially a lot faster, especially when you need to count a small percentage of items in a large table
The first approach can be sped up by well designed indexes. Rather than searching through the whole table, your RDBMS could retrieve a few records from the index, and do the counts using them
The second approach can become very complex very quickly:
You need to consider what happens to the counts when a user gets deleted
You should consider what happens when the table of votes is manipulated by tools outside your program. For example, merging records from two databases may prove a lot more complex when the current counts are stored along with the individual ones.
I would start with the first approach, and see how it performs. Then I would try optimizing it with indexing. Finally, I would consider going with the second approach, possibly writing triggers to update counts automatically.
As this sounds a lot like StackExchange, I'll refer you to this answer on the meta about the database schema used on the site. The votes table looks like this:
Votes table:
Id
PostId
VoteTypeId, one of the following values:
1 - AcceptedByOriginator
2 - UpMod
3 - DownMod
4 - Offensive
5 - Favorite (if VoteTypeId = 5, UserId will be populated)
6 - Close
7 - Reopen
8 - BountyStart (if VoteTypeId = 8, UserId will be populated)
9 - BountyClose
10 - Deletion
11 - Undeletion
12 - Spam
15 - ModeratorReview
16 - ApproveEditSuggestion
UserId (only present if VoteTypeId is 5 or 8)
CreationDate
BountyAmount (only present if VoteTypeId is 8 or 9)
And so based on that it sounds like the way it would be run is:
SELECT VoteTypeId FROM Votes WHERE VoteTypeId = 2 OR VoteTypeId = 3
And then based on the value, do the maths:
int score = 0;
for each vote in voteQueryResults
if(vote == 2) score++;
if(vote == 3) score--;
Even with millions of results, this is probably going to be a very fast operation as it's so simple.

Can SQL query do this?

I have a table "audit" with a "description" column, a "record_id" column and a "record_date" column. I want to select only those records where the description matches one of two possible strings (say, LIKE "NEW%" OR LIKE "ARCH%") where the record_id in each of those two matches each other. I then need to calculate the difference in days between the record_date of each other.
For instance, my table may contain:
id description record_id record_date
1 New Sub 1000 04/14/13
2 Mod 1000 04/14/13
3 Archived 1000 04/15/13
4 New Sub 1001 04/13/13
I would want to select only rows 1 and 3 and then calculate the number of days between 4/15 and 4/14 to determine how long it took to go from New to Archived for that record (1000). Both a New and an Archived entry must be present for any record for it to be counted (I don't care about ones that haven't been archived). Does this make sense and is it possible to calculate this in a SQL query? I don't know much beyond basic SQL.
I am using MySQL Workbench to do this.
The following is untested, but it should work asuming that any given record_id can only show up once with "New Sub" and "Archived"
select n.id as new_id
,a.id as archive_id
,record_id
,n.record_date as new_date
,a.record_date as archive_date
,DateDiff(a.record_date, n.record_date) as days_between
from audit n
join audit a using(record_id)
where n.description = 'New Sub'
and a.description = 'Archieved';
I changed from OR to AND, because I thought you wanted only the nr of days between records that was actually archived.
My test was in SQL Server so the syntax might need to be tweaked slightly for your (especially the DATEDIFF function) but you can select from the same table twice, one side grabbing the 'new' and one grabbing the 'archived' then linking them by record_id...
SELECT
newsub.id,
newsub.description,
newsub.record_date,
arc.id,
arc.description,
arc.record_date,
DATEDIFF(day, newsub.record_date, arc.record_date) AS DaysBetween
FROM
foo1 arc
, foo1 newsub
WHERE
(newsub.description LIKE 'NEW%')
AND
(arc.description LIKE 'ARC%')
AND
(newsub.record_id = arc.record_id)

MySQL, how to repeat same line x times

I have a query that outputs address order data:
SELECT ordernumber
, article_description
, article_size_description
, concat(NumberPerBox,' pieces') as contents
, NumberOrdered
FROM customerorder
WHERE customerorder.id = 1;
I would like the above line to be outputted NumberOrders (e.g. 50,000) divided by NumberPerBox e.g. 2,000 = 25 times.
Is there a SQL query that can do this, I'm not against using temporary tables to join against if that's what it takes.
I checked out the previous questions, however the nearest one:
is to be posible in mysql repeat the same result
Only gave answers that give a fixed number of rows, and I need it to be dynamic depending on the value of (NumberOrdered div NumberPerBox).
The result I want is:
Boxnr Ordernr as_description contents NumberOrdered
------+--------------+----------------+-----------+---------------
1 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
2 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
....
25 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
First, let me say that I am more familiar with SQL Server so my answer has a bit of a bias.
Second, I did not test my code sample and it should probably be used as a reference point to start from.
It would appear to me that this situation is a prime candidate for a numbers table. Simply put, it is a table (usually called "Numbers") that is nothing more than a single PK column of integers from 1 to n. Once you've used a Numbers table and aware of how it's used, you'll start finding many uses for it - such as querying for time intervals, string splitting, etc.
That said, here is my untested response to your question:
SELECT
IV.number as Boxnr
,ordernumber
,article_description
,article_size_description
,concat(NumberPerBox,' pieces') as contents
,NumberOrdered
FROM
customerorder
INNER JOIN (
SELECT
Numbers.number
,customerorder.ordernumber
,customerorder.NumberPerBox
FROM
Numbers
INNER JOIN customerorder
ON Numbers.number BETWEEN 1 AND customerorder.NumberOrdered / customerorder.NumberPerBox
WHERE
customerorder.id = 1
) AS IV
ON customerorder.ordernumber = IV.ordernumber
As I said, most of my experience is in SQL Server. I reference http://www.sqlservercentral.com/articles/Advanced+Querying/2547/ (registration required). However, there appears to be quite a few resources available when I search for "SQL numbers table".