randomizing large dataset

randomizing large dataset - mysql

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!

You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.

Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...

You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.

Related

MySQL: in designing a loot drop table, is it possible to specify a number of times the query repeats itself and outputs each result on the same table

as part of teaching myself SQL, I'm coding a loot drop table that I hope to use in D&D campaigns.
the simplest form of the query is:
SELECT rarity,
CASE
WHEN item=common THEN (SELECT item FROM common.table)
WHEN item=uncommon THEN (SELECT item FROM unommon.table)
...etc
END AS loot
FROM rarity.table
ORDER BY RAND()*(1/weight)
LIMIT 1
the idea is that the query randomly chooses a rarity from the rarity.table based on a weighted probability. There are 10 types of rarity, each represented on the rarity.table as a single row and having a column for probabilistic weight.
If I want to randomly output 1 item (limit 1), this works great.
However, attempting to output more than 1 item at a time isn't probabilistic in that the query can only put out 1 row of each rarity. If say I want to roll 10 items (limit 10) for my players, it will just output all 10 rows, producing 1 item from each rarity, and never multiple of the higher weighted rarities.
I have tried something similar, creating a different rarity.table that was 1000 rows long, and instead of having a 'weight' column representing probabilistic weight in rows, ex. common is rows 1-20, uncommon rows 21-35...etc.
Then writing the query as
ORDER BY RAND()
LIMIT x
-- (where x is the number of items I want to output)
and while this is better in some ways, it results are still limited by the number of rows for each rarity. I.E. if I set limit to 100, it again just gives me the whole table without taking probability into consideration. This is fine in that I probably won't be rolling 100 items at once, but feels incorrect that the output will always be limited to
20 common items, 15 uncommon, etc. This is also MUCH slower, as my actual code has a lot of case and sub-case statements.
So, my thought moved on to if is possible to run the query with a limit 1, but to set the query to run x number of times, and then include each result on the same table, preserving probability and not being limited by the number of rows in the table. However, I haven't figured out how to do so.
Any thoughts on how to achieve these results? Or maybe a better approach?
Please let me know if I can clarify anything.
Thank you!

A big no-no is having several virtually identical tables (common and uncommon) as separate tables. Instead, have one table with an extra column to distinguish the types. That will let your sample query be written more simply, possibly with a JOIN.
attempting to output more than 1 item at a time isn't probabilistic in that the query can only put out 1 row of each rarity
Let's try to tackle that with something like
SELECT ... WHERE ... 'common' ORDER BY ... LIMIT 1
UNION
SELECT ... WHERE ... 'uncommon' ORDER BY ... LIMIT 1
...
If you don't want the entire list like that, then do
(
((the UNION above))
) ORDER BY RAND() LIMIT 3; -- to pick 3 of the 10
Yes, it looks inefficient. But ORDER BY RAND() LIMIT 1 is inherently inefficient -- it fetches the entire table, shuffles the rows, then peels off one row.
Munch on those. There are other possibilities.

while I'm sure there is room for improvement / optimization, I actually figured out a solution for myself in case anyone is interested.
Instead of the first query being the rarity table, I made a new table that is thousands of entries long, called rolls.table, and first query this table. Here, the limit function works as a way to select the number of rolls I want to make.
Then, every time this table outputs a row the query selects from the rarity.table independently.
Does that make sense?
I'll work with this for now, but would love to hear how to make it better.... it takes like 20 seconds for the output table to load haha.

How to set up MYSQL Tables for fast SELECT

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id
filename
date
1
xyz.fit
2022-01-01
2
vwx.fit
2022-01-02
..
..
..
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id
timestamp/counter
file_id
value
1
0
1
value
2
1
1
value
3
0
2
value
..
..
..
..
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?
EXPLAIN SELECT

You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
SELECT * FROM a LEFT JOIN b
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
SELECT *
FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
AND x.timestamp = y.timestamp
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
file_id part of your primary key
parameter_id part of your primary key
timestamp part of your primary key
value
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
SELECT timestamp, parameter_id, value
FROM observation_table
WHERE file_id = xxxx
AND parameter_id IN (2,3)
ORDER BY timestamp, parameter_id
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.

Questions on how to randomly Query multiple rows from Mysql without using "ORDER BY RAND()"

I need to query the MYSQL with some condition, and get five random different rows from the result.
Say, I have a table named 'user', and a field named 'cash'. I can compose a SQL like:
SELECT * FROM user where cash < 1000 order by RAND() LIMIT 5.
The result is good, totally random, unsorted and different from each other, exact what I want.
But I got from google that the efficiency is bad when the table get large because MySQL creates a temporary table with all the result rows and assigns each one of them a random sorting index. The results are then sorted and returned.
Then I go on searching and got a solution like:
SELECT * FROM `user` AS t1 JOIN (SELECT ROUND(RAND() * ((SELECT MAX(id) FROM `user`)- (SELECT MIN(id) FROM `user`))+(SELECT MIN(id) FROM `user`)) AS id) AS t2 WHERE t1.id >= t2.id AND cash < 1000 ORDER BY t1.id LIMIT 5;
This method uses JOIN and MAX(id). And the efficiency is better than the first one according to my testing. However, there is a problem. Since I also needs a condition "cash<1000" and if the the RAND() is so big that no row behind it has the cash<1000, then no result will return.
Anyone has good idea of how to compose the SQL that has have the same effect as the first one but has better efficiency?
Or, shall I just do simple query in MYSQL and let PHP randomly pick 5 different rows from the query result?
Your help is appreciated.

To make first query faster, just SELECT id - that will make the temporary table rather small (it will contain only IDs and not all fields of each row) and maybe it will fit in memory (temp table with text/blob are always created on-disk for example). Then when you get a result, run another query SELECT * FROM xy WHERE id IN (a,b,c,d,...). As you mentioned this approach is not very efficient, but as a quick fix this modification will make it several times faster.
One of the best approaches seems to be getting the total number of rows, choosing random numbers and for each result run a new query SELECT * FROM xy WHERE abc LIMIT $random,1. It should be quite efficient for random 3-5, but not good if you want 100 random rows each time :)
Also consider caching your results. Often you don't need different random rows to be displayed on each page load. Generate your random rows only once per minute. If you will generate the data for example via cron, you can live also with query which takes several seconds, as users will see the old data while new data are being generated.
Here are some of my bookmarks for this problem for reference:
http://jan.kneschke.de/projects/mysql/order-by-rand/
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

how to increase performance of limit ?,1 when the ? is a huge number

I have a situation where I need to use a huge number for the limit. For example,
"select * from a table limit 15824293949,1";
this is....really really slow. Sometimes my home mysql server just dies.
is it possible to make it faster?
sorry the number was 15824293949, not 38975901200
Added:
**Table 'photos' (Sample)**
Photos
img_id img_filename
1 a.jpg
2 b.jpg
3 c.jpg
4 d.jpg
5 e.jpg
and so on
select cp1.img_id,cp2.img_id from photos as cp1 cross join photos as cp2 limit ?,1
How I get 15824293949?
I have 177901 rows in my photo table. I can get the total # of possible combinations by using
(total # of rows * total # of rows ) - total # of rows)/2

MySQL has issues with huge limit offsets with MyISAM engine mostly where InnoDB optimizes that. There are various techniques to get MyISAM limit to behave faster, however add EXPLAIN before your select statement to see what's actually going on. 3 billion rows generated from cross join indicate that the issue lies within the join itself, not the LIMIT clause.
If you're interested in how to make LIMIT behave faster, this link should provide you with enough information.

Try limiting the query with a WHERE clause on a column with an index on it. E.g.:
SELECT * FROM table WHERE id >= 38975901200 LIMIT 1
Update: I think perhaps you don't even need the database? You can find the nth combination of two images by calculating something like 15824293949 / 177901 and 15824293949 % 177901. I suppose you could write a query:
SELECT (15824293949 / 177901) AS img_id1, (15824293949 MOD 177901) AS img_id2
If you're trying to get them from the natural order that they're in the database (and it doesn't happen to be their img_id) then you might have some trouble. Does it matter? It's not clear what you're trying to do here.

Presumably you have this in some sort of script, where the reason you are looking at that specific point is because it is where you left off last.
Ideally, if you also have an auto_increment primary key field (an id), you can store that number. Then just do select * from table where id > last_seen_id limit 1 (maybe do more than 1 at a time :P)
Generally speaking, what you are asking it to do should be slow. Give it something to search for, rather than everything with a limit.

MySQL: SELECT(x) WHERE vs COUNT WHERE?

This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.

If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.

AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.

Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.

Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.

Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

randomizing large dataset - mysql

Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...

Related

MySQL: in designing a loot drop table, is it possible to specify a number of times the query repeats itself and outputs each result on the same table

How to set up MYSQL Tables for fast SELECT

Questions on how to randomly Query multiple rows from Mysql without using "ORDER BY RAND()"

how to increase performance of limit ?,1 when the ? is a huge number

MySQL: SELECT(x) WHERE vs COUNT WHERE?

Categories

Resources