Update Random Sample in Large Table - sql-server-2008

Using SQL Server 2012, I have a table with 7 million rows. PK column is a GUID (COMB GUID). I am trying to test the performance of a query and first need to update a random sampling of data, I want to change a column value (not the PK) of 50,000 rows.
Selecting Top 50,000 Order by NEWID() takes way too long, I think SQL Server is scanning the whole table. I cannot seem to get the syntax right for TABLESAMPLE, it returns an empty set.
What is the best way to get this to work?

And to treat it as an update:
;WITH x AS
(
SELECT TOP (50000) col
FROM dbo.table TABLESAMPLE (50000 ROWS)
)
UPDATE x SET col = 'something else';
But a couple of notes:
You probably won't see a huge performance improvement over ORDER BY NEWID(). On a table with 1MM rows this took over a minute on my machine.
The TOP is there because TABLESAMPLE doesn't guarantee the exact number of rows - it's based on a rough calculation of how many pages might contain 50,000 rows. You may end up with less or more depending on your fillfactor, how many variable-length columns, how many NULL values, etc. The TOP above will help limit it to 50,000 when the estimate leads to a larger number of pages being read, but it won't help if the estimate is under.
There is some discussion of this going on in another question right now.

Related

MySQL: in designing a loot drop table, is it possible to specify a number of times the query repeats itself and outputs each result on the same table

as part of teaching myself SQL, I'm coding a loot drop table that I hope to use in D&D campaigns.
the simplest form of the query is:
SELECT rarity,
CASE
WHEN item=common THEN (SELECT item FROM common.table)
WHEN item=uncommon THEN (SELECT item FROM unommon.table)
...etc
END AS loot
FROM rarity.table
ORDER BY RAND()*(1/weight)
LIMIT 1
the idea is that the query randomly chooses a rarity from the rarity.table based on a weighted probability. There are 10 types of rarity, each represented on the rarity.table as a single row and having a column for probabilistic weight.
If I want to randomly output 1 item (limit 1), this works great.
However, attempting to output more than 1 item at a time isn't probabilistic in that the query can only put out 1 row of each rarity. If say I want to roll 10 items (limit 10) for my players, it will just output all 10 rows, producing 1 item from each rarity, and never multiple of the higher weighted rarities.
I have tried something similar, creating a different rarity.table that was 1000 rows long, and instead of having a 'weight' column representing probabilistic weight in rows, ex. common is rows 1-20, uncommon rows 21-35...etc.
Then writing the query as
ORDER BY RAND()
LIMIT x
-- (where x is the number of items I want to output)
and while this is better in some ways, it results are still limited by the number of rows for each rarity. I.E. if I set limit to 100, it again just gives me the whole table without taking probability into consideration. This is fine in that I probably won't be rolling 100 items at once, but feels incorrect that the output will always be limited to
20 common items, 15 uncommon, etc. This is also MUCH slower, as my actual code has a lot of case and sub-case statements.
So, my thought moved on to if is possible to run the query with a limit 1, but to set the query to run x number of times, and then include each result on the same table, preserving probability and not being limited by the number of rows in the table. However, I haven't figured out how to do so.
Any thoughts on how to achieve these results? Or maybe a better approach?
Please let me know if I can clarify anything.
Thank you!
A big no-no is having several virtually identical tables (common and uncommon) as separate tables. Instead, have one table with an extra column to distinguish the types. That will let your sample query be written more simply, possibly with a JOIN.
attempting to output more than 1 item at a time isn't probabilistic in that the query can only put out 1 row of each rarity
Let's try to tackle that with something like
SELECT ... WHERE ... 'common' ORDER BY ... LIMIT 1
UNION
SELECT ... WHERE ... 'uncommon' ORDER BY ... LIMIT 1
...
If you don't want the entire list like that, then do
(
((the UNION above))
) ORDER BY RAND() LIMIT 3; -- to pick 3 of the 10
Yes, it looks inefficient. But ORDER BY RAND() LIMIT 1 is inherently inefficient -- it fetches the entire table, shuffles the rows, then peels off one row.
Munch on those. There are other possibilities.
while I'm sure there is room for improvement / optimization, I actually figured out a solution for myself in case anyone is interested.
Instead of the first query being the rarity table, I made a new table that is thousands of entries long, called rolls.table, and first query this table. Here, the limit function works as a way to select the number of rolls I want to make.
Then, every time this table outputs a row the query selects from the rarity.table independently.
Does that make sense?
I'll work with this for now, but would love to hear how to make it better.... it takes like 20 seconds for the output table to load haha.

mysql query speed at table which has 1.5million rows

It takes around 5 seconds to get result of query from a table consisting 1.5 million row. Query is "select * from table where code=x"
Is there a setting to increase speed ? Or should I jump to another database apart from MySQL ?
You could index the code column. Note that the trade off is that inserting new rows or updating the code column on existing rows will be slowed down a bit since the index also needs to be updated. In any event, you should benchmark the improvement to make sure it's worth it.
WHERE code=x -- needs INDEX(code)
SELECT * when many of the columns are bulky: Large columns are stored "off-record". Hence they take longer to fetch. So, explicitly list the columns you really need, hoping to leave out some of the bulky columns.
When a GROUP BY or LIMIT is involved, it is sometimes best to do
SELECT lots of columns
FROM ( SELECT id FROM t WHERE ... group-by or limit ) AS x
JOIN t AS y USING(id)
etc.
That is, start by finding just the ids as simply as possible, then JOIN back to the original table and other table(s). (This is not the case you presented, but I worry that you over-simplified it.)

Computational Complexity of SELECT DISTINC(column) FROM table on an indexed column

Question
I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling
SELECT DISTINCT(column) FROM table
or
SELECT * FROM table GROUP BY column
on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)
Background
For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be
O(1)*Number_Of_DISTINCT_ELEMENTS
But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is
O(NUM_ROWS)
Why Do I Care?
I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row
SELECT total_unread_messages FROM stats WHERE category_id = x
instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.
The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.
If you are doing:
select distinct column
from table
And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).
This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).
I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.

Questions on how to randomly Query multiple rows from Mysql without using "ORDER BY RAND()"

I need to query the MYSQL with some condition, and get five random different rows from the result.
Say, I have a table named 'user', and a field named 'cash'. I can compose a SQL like:
SELECT * FROM user where cash < 1000 order by RAND() LIMIT 5.
The result is good, totally random, unsorted and different from each other, exact what I want.
But I got from google that the efficiency is bad when the table get large because MySQL creates a temporary table with all the result rows and assigns each one of them a random sorting index. The results are then sorted and returned.
Then I go on searching and got a solution like:
SELECT * FROM `user` AS t1 JOIN (SELECT ROUND(RAND() * ((SELECT MAX(id) FROM `user`)- (SELECT MIN(id) FROM `user`))+(SELECT MIN(id) FROM `user`)) AS id) AS t2 WHERE t1.id >= t2.id AND cash < 1000 ORDER BY t1.id LIMIT 5;
This method uses JOIN and MAX(id). And the efficiency is better than the first one according to my testing. However, there is a problem. Since I also needs a condition "cash<1000" and if the the RAND() is so big that no row behind it has the cash<1000, then no result will return.
Anyone has good idea of how to compose the SQL that has have the same effect as the first one but has better efficiency?
Or, shall I just do simple query in MYSQL and let PHP randomly pick 5 different rows from the query result?
Your help is appreciated.
To make first query faster, just SELECT id - that will make the temporary table rather small (it will contain only IDs and not all fields of each row) and maybe it will fit in memory (temp table with text/blob are always created on-disk for example). Then when you get a result, run another query SELECT * FROM xy WHERE id IN (a,b,c,d,...). As you mentioned this approach is not very efficient, but as a quick fix this modification will make it several times faster.
One of the best approaches seems to be getting the total number of rows, choosing random numbers and for each result run a new query SELECT * FROM xy WHERE abc LIMIT $random,1. It should be quite efficient for random 3-5, but not good if you want 100 random rows each time :)
Also consider caching your results. Often you don't need different random rows to be displayed on each page load. Generate your random rows only once per minute. If you will generate the data for example via cron, you can live also with query which takes several seconds, as users will see the old data while new data are being generated.
Here are some of my bookmarks for this problem for reference:
http://jan.kneschke.de/projects/mysql/order-by-rand/
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.