SQL select a sample of rows - sql-server-2008

I need to select sample rows from a set. For example if my select query returns x rows then if x is greater than 50 , I want only 50 rows returned but not just the top 50 but 50 that are evenly spread out over the resultset. The table in this case records routes - GPS locations + DateTime.
I am ordering on DateTime and need a reasonable sample of the Latitude & Longitude values.
Thanks in advance
[ SQL Server 2008 ]

To get sample rows in SQL Server, use this query:
SELECT TOP 50 * FROM Table
ORDER BY NEWID();
If you want to get every n-th row (10th, in this example), try this query:
SELECT * From
(
SELECT *, (Dense_Rank() OVER (ORDER BY Column ASC)) AS Rank
FROM Table
) AS Ranking
WHERE Rank % 10 = 0;
Source
More examples of queries selecting random rows for other popular RDBMS can be found here: http://www.petefreitag.com/item/466.cfm

Every n'th row to get 50:
SELECT *
FROM table
WHERE row_number() over() MOD (SELECT Count(*) FROM table) / 50 == 0
FETCH FIRST 50 ROWS ONLY
And if you want a random sample, go with jimmy_keen's answer.
UPDATE:
In regard to the requirement for it to run on MS SQL, I think it should be changed to this (no MS SQL Server around to test though):
SELECT TOP 50 *
FROM (
SELECT t.*, row_number() over() AS rn, (SELECT count(*) FROM table) / 50 AS step
FROM table t
)
WHERE rn % step == 0

I suggest that you add a calculated column to your resultset on selection that is obtained as a random number, and then select the top 50 sorted by that column. That will give you a random sample.
For example:
SELECT TOP 50 *, RAND(Id) AS Random
FROM SourceData
ORDER BY Random
where SourceData is your source data table or view. This assumes T-SQL on SQL Server 2008, by the way. It also assumes that you have an Id column with unique ids on your data source. If your ids are very low numbers, it is a good practice to multiply them by a large integer before passing them to RAND, like this:
RAND(Id * 10000000)

If you want an statically correct sample, tablesample is a wrong solution. A good solution as I described in here based on a Microsoft Research paper, is to create a materialized view over your table which includes an additional column like
CAST( ROW_NUMBER() OVER (...) AS BYTE ) AS RAND_COL_, then you can add an index on this column, plus other interesting columns and get statistically correct samples for your queries fairly quickly. (by using WHERE RAND_COL_ = 1).

Related

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows

MySQL split table query using varchar column

I have a table in MySQL which I want to query parallel by executing multiple select statements that select
non-overlapping equal parts from the table, like:
1. select * from mytable where col between 1 and 1000
2. select * from mytable where col between 1001 and 2000
...
The problem is that the col in my case is varchar. How can I split the query in this case?
In Oracle we can operate with NTILE in combination with rowids. But I didn't find a similar approach in case of MySQL.
That's why my thinking is to hash the col value and mod it by the number of equal parts I want to have.
Or instead of hashing, dynamically generated rownums could be used.
What would be an optimal solution considering that the table big (xxxM rows) and I want to avoid full table
scans for each of the queries?
You can use limit for the purpose of paging, so you will have:
1. select * from mytable limit 0, 1000
2. select * from mytable limit 1000, 1000
you can use casting for varchar column to integer like this cast(col as int)
Regards
Tushar
Without scanning fulltable, it will produce results
SELECT * FROM mytable
ORDER BY ID
OFFSET 0 ROWS
FETCH NEXT 100 ROWS ONLY

mysql randomizing result and optimization

I want to have randomized rows after a query, but using order by rand() is just exhausting on a table that has 120k+ rows. I have found a small solution that just outputs number of rows but it runs like it starts from a random index and then returns #number of rows after that. It is pretty fast but this just returns some rows after a random index. The code goes like:
SELECT *
FROM lieky AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(col_0)
FROM lieky)) AS id)
AS r2
WHERE r1.col_0 >= r2.id
ORDER BY r1.col_0 ASC
LIMIT 100
and i found it in here: http://jan.kneschke.de/projects/mysql/order-by-rand/
Is there something that would help me ?
I am trying to get randomized data into pagination, so when the user queries the database, he will always get the rows in a random order.
Thanks for help.
It should be noted that
(SELECT (RAND() * (SELECT MAX(col_0) FROM lieky)) AS id)
can return MAX(col_0), so you ll get only 1 row (because of WHERE r1.col_0 >= r2.id)
I think good solution should be somethink like:
add two columns groupId int, seed int; add index indexName (groupId , seed)
every x seconds (maybe every hour, day, ..) run script that ll be recalc these columns (see below)
when user open your rows list first time (or when you want to re-rand items) you save any random groupId to user's session; groupId can be from 0 to (select max(groupId) from lieky)
to show rows you use query like: (select * from lieky where groupId=%saved groupId% order by Seed limit x,100) — it should be very fast
About recalc script, it ll rather slow (so it's good idea to run it at night).
Seed you can update by using:
update lieky set Seed = rand()*1000000
Then set GroupId=0 for first N rows, GroupId=1 for following N rows, ...
N is max rows that you can show for user (max_page)*(per_page_count)

MySQL: return top 16 results in random order

I am trying to query my database to return, say, the top 16 ordered results (ordered by a field called rank) but in a random order.
I can do this easily by shuffling the returned (and ordered) 16 results using php to adjust the array that php will use. I am wondering if there is an easy way to do this directly in the query itself.
try
select * from
(
select * from your_table
order by rank
limit 16
) x
order by rand()

MySQL LIMIT clause equivalent for SQL SERVER

I've been doing a lot of reading on alternatives to the LIMIT clause for SQL SERVER. It's so frustrating that they still refuse to adapt it. Anyway, I really havn't been able to get my head around this. The query I'm trying to convert is this...
SELECT ID, Name, Price, Image FROM Products ORDER BY ID ASC LIMIT $start_from, $items_on_page
Any assistance would be much appreciated, thank you.
In SQL Server 2012, there is support for the ANSI standard OFFSET / FETCH syntax. I blogged about this and here is the official doc (this is an extension to ORDER BY). Your syntax converted for SQL Server 2012 would be:
SELECT ID, Name, Price, Image
FROM Products
ORDER BY ID ASC
OFFSET (#start_from - 1) ROWS -- not sure if you need -1
-- because I don't know how you calculated #start_from
FETCH NEXT #items_on_page ROWS ONLY;
Prior to that, you need to use various workarounds, including the ROW_NUMBER() method. See this article and the follow-on discussion. If you are not on SQL Server 2012, you can't use standard syntax or MySQL's non-standard LIMIT but you can use a more verbose solution such as:
;WITH o AS
(
SELECT TOP ((#start_from - 1) + #items_on_page)
-- again, not sure if you need -1 because I
-- don't know how you calculated #start_from
RowNum = ROW_NUMBER() OVER (ORDER BY ID ASC)
/* , other columns */
FROM Products
)
SELECT
RowNum
/* , other columns */
FROM
o
WHERE
RowNum >= #start_from
ORDER BY
RowNum;
There are many other ways to skin this cat, this is unlikely to be the most efficient but syntax-wise is probably simplest. I suggest reviewing the links I posted as well as the duplicate suggestions noted in the comments to the question.
For SQL Server 2005 and 2008
This is an example query to select rows from 11 to 20 from Report table ordered by LastName.
SELECT a.* FROM
(SELECT *, ROW_NUMBER() OVER (ORDER BY LastName) as row FROM Report) a
WHERE a.row > 10 and a.row <= 20
Try this:
SELECT TOP $items_on_page ID, Name, Price, Image
FROM (SELECT TOP $start_from + $items_on_page - 1 * FROM Products ORDER BY ID) as T
ORDER BY ID DESC
EDIT: Explanation-
No getting around the subquery, but this is an elegant solution.
Say you wanted 10 items per page, starting on the 5th row, this would give you the bottom 10 rows of the top 14 rows. Essentially LIMIT 5,10
you can use ROW COUNT : Returns the number of rows affected by the last statement. when you don, you reset the rowcont.
SET ROWCOUNT 100
or
you can try using TOP query
SELECT TOP 100 * FROM Sometable ORDER BY somecolumn
If you allow the application to store a tad of state, you can do this using just TOP items_on_page. What you do is to store the ID of the last item you retrieved, and for each subsequent query add AND ID > [last ID]. So you get the items in batches of items_on_page, starting where you left off each time.