Select a Portion of Vast Data Over Time with MySQL - mysql

I have hundreds of thousands of price points spanning 40 years plus. I would like to construct a query that will only return 3000 total data points, with the last 500 being the most recent data points, and the other 2500 being just a sample of the rest of the data, evenly distributed.
Is it possible to do this in one query? How would I select just a sample of the large amount of data? This is a small example of what I mean for getting just a sample of the other 2500 data points:
1
2
3
4
5
6
7
8
9
10
And I want to return something like this:
1
5
10
Here's the query for the last 500:
SELECT * FROM price ORDER BY time_for DESC LIMIT 500
I'm not sure how to go about getting the sample data from the other data points.

Try this:
(SELECT * FROM price ORDER BY time_for DESC LIMIT 500)
UNION ALL
(SELECT * FROM price WHERE time_for < (SELECT time_for FROM price ORDER BY time_for LIMIT 500, 1) ORDER BY rand() LIMIT 2500)
ORDER BY time_for
Note: It's probably going to be slow. How big is this table?
It might be faster to only get the primary ID from all these rows, then join it to the original in a secondary query once it's narrowed down. This is because ORDER BY rand() LIMIT has to sort the entire table. If the table is large this can take a LONG time, and a lot of disk space. Retrieving only the ID reduces the necessary disk space.

The previous answer is good, but you did specify that you want the results to be evenly distributed so I'll add this possibility too. By iterating a counter over the rows you can use a MOD operator to sample an even distribution. I don't have a MYSQL install right now to test this so apologies if the syntax isn't 100% spot on. But it should be close enough and may give you some ideas.
( SELECT p1.*
FROM price p1
ORDER BY p1.time_for DESC
LIMIT 500 )
UNION ALL
( SELECT #i := #i + 1 AS row_num,
p2.*
FROM price p2,
(SELECT #i: = 0)
WHERE row_num > 500
AND (row_num % 500) = 0
ORDER BY time_for DESC )
The first query gives the 500 latest rows. The second query gives every 500th row after that, thus returning an even distribution from the rest of the data. Obviously you can tune this parameter to achieve the desired sample spacing. Or base it on the total number of rows in the table to calculate the necessary spacing to give exactly 2500 records.

Related

SQL Query - selecting higher priority rows more often

I am trying to do SQL code in mysqli query to select rows with higher priority more often. I have a DB where all posts are sorted by priority, but I want it select like this (10 - the highest priority):
**Priority**
10
3
10
9
7
10
9
1
10
How can I do this? I have tried that to solve by more ways but no result. Thank you.
If you want to sample your data with preference to higher priorities, you could do something like this:
SELECT *
FROM (
SELECT OrderDetailID
,mod(OrderDetailID, 10) + 1 AS priority
,rand() * 10 AS rand_priority
FROM OrderDetails
) A
WHERE rand_priority < priority
ORDER BY OrderDetailID
This query runs in MySQL Tryit from W3Schools.
mod(OrderDetailID, 10) + 1 simulates a 1-10 priority - your table just has this value in it already
rand() * 10 gives you a random number between 0 and 10
Then by filtering to only ones where the random number is less than the priority, you get a result set where the higher priorities are more likely.
You may use rank function if your MySQL version supports it. It will order your data by priority in descending order and ranks each row. If the two rows have same priority then both rows will have same ranking. Then you can filter out the first rank data which will give you highest priority rows always.
Select * FROM
(
SELECT
col1,
col2,
priority,
RANK() OVER w AS 'rank'
FROM MyTable
WINDOW w AS (ORDER BY priority)
) MyQuery
Where rank = 1
Note : Syntax might be incorrect, please feel to edit the query.
This post might help you for ranking if your MySql version doesn't support Rank.

MYSQL SELECT random on large table ORDER BY SCORE [duplicate]

This question already has answers here:
Optimizing my mysql statement! - RAND() TOO SLOW
(6 answers)
Closed 8 years ago.
I have a large mysql table with about 25000 rows. There are 5 table fields: ID, NAME, SCORE, AGE,SEX
I need to select random 5 MALES order BY SCORE DESC
For instance, if there are 100 men that score 60 each and another 100 that score 45 each, the script should return random 5 from the first 200 men from the list of 25000
ORDER BY RAND()
is very slow
The real issue is that the 5 men should be a random selection within the first 200 records. Thanks for the help
so to get something like this I would use a subquery.. that way you are only putting the RAND() on the outer query which will be much less taxing.
From what I understood from your question you want 200 males from the table with the highest score... so that would be something like this:
SELECT *
FROM table_name
WHERE age = 'male'
ORDER BY score DESC
LIMIT 200
now to randomize 5 results it would be something like this.
SELECT id, score, name, age, sex
FROM
( SELECT *
FROM table_name
WHERE age = 'male'
ORDER BY score DESC
LIMIT 200
) t -- could also be written `AS t` or anything else you would call it
ORDER BY RAND()
LIMIT 5
I dont think that sorting by random can be "optimised out" in any way as sorting is N*log(N) operation. Sorting is avoided by query analyzer by using indexes.
The ORDER BY RAND() operation actually re-queries each row of your table, assigns a random number ID and then delivers the results. This takes a large amount of processing time for table of more than 500 rows. And since your table is containing approx 25000 rows then it will definitely take a good amount of time.

Mysql select records with offset

I'm looking for a mysql select that will allow me to select (LIMIT 8) records after some changing number of first few matches;
select id
from customers
where name LIKE "John%"
Limit 8
So if i have a table with 1000 of johns with various last names
I want to be able to select records 500-508
You can send the offset to the limit statement, like this:
SELECT id
FROM customers
WHERE name LIKE "John%"
LIMIT 8 OFFSET 500
Notice the OFFSET 500 on the limit. That sets the 'start point' past the first 500 entries (at entry #501).
Therefor, entries #501, #502, #503, #504, #505, #506, #507 and #508 will be selected.
This can also be written:
LIMIT 500, 8
Personally, I don't like that as much and don't understand the order.
Pedantic point: 500-508 is 9 entries, so I had to adjust.
As a solution please try executing the following sql query
select id from customers where name LIKE "John%" Limit 500,8

Pagination issue while sorting based on last modified property

I need to show some records sorted based on modified column (latest activity on top)
(Post with new edit or comments at the top)
App UI has twitter like 'more' post button for infinite scroll. each 'more' will add next 10 records to UI.
Issue is that pagination index breaks when any of the to be shown record is modified
for example
Suppose i have records A,B,C,..Z in jobs table.
first time I'm' showing the records A-J to the user using
SELECT * FROM Jobs WHERE 1 ORDER BY last_modified DESC LIMIT 0, 10
second time if none of the records are modified
SELECT * FROM Jobs WHERE 1 ORDER BY last_modified DESC LIMIT 10, 10
will return K-T
But if some body modifies any records after J before the user clicks 'more button',
SELECT * FROM Jobs WHERE 1 ORDER BY last_modified DESC LIMIT 10, 10
will return J-S
Here record J is duplicated. I can hide it by not inserting J to the UI, but the more button will show only 9 records. But this mechanism fails when large number of records are updated, If 10 records are modified, the query will return A-J again.
What is the best way to handle this pagination issue?
Keeping a second time stamp fails if a record has multiple updates.
Server cache of queries?
I would do a NOT IN() and a LIMIT instead of just a straight LIMIT with a pre-set offset.
SELECT * FROM Jobs WHERE name NOT IN('A','B','C','D','E','F','G','H','I','J')
ORDER BY last_modified DESC LIMIT 10
This way you still get the most recent 10 every time but you would need to be tracking what IDs have already been shown and constantly negative match on those in your sql query.
Twitter timelines not paged queries they are queried by ids
This page will help you a lot understanding timeline basics https://dev.twitter.com/docs/working-with-timelines
lets say each column have id field too
id msg
1 A
2 B
....
First query will give you 10 post and max post_id will be 10
Next query should be
SELECT * FROM Jobs WHERE id > 10 ORDER BY last_modified DESC LIMIT 0, 10
I don't know the exact solution but I can give it a try.
First u need an integer ID column in your Job table.
Now send a max_id = null along with limit = 10 and offset = 0 from UI.
In this case if max_id is null, set max_id to (MAX(ID) + 1) of Table.
SELECT (MAX(ID) + 1) INTO max_id FROM Jobs;
Later find the records:
SELECT * FROM Jobs WHERE ID < max_id ORDER BY last_modified DESC LIMIT 10 OFFSET 0;
Return the records to UI.
Now from UI set max_id = ID of first record in the response array, offset = offset + limit.
Now onwards try with updated values of max_id and offset:
SELECT * FROM Jobs WHERE ID < max_id ORDER BY last_modified DESC LIMIT 10 OFFSET 10;

MySQL equally distributed random rows with WHERE clause

I have this table,
person_id int(10) pk
points int(6) index
other columns not very important
I have this random function which is very fast on a table with 10M rows:
SELECT person_id
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id >= r2.id
ORDER BY r1.person_id ASC
LIMIT 1
This is all great but now I wish to show only people with points > 0. Example table:
PERSON_ID POINTS
1 4
2 6
3 0
4 3
When I append AND points > 0 to the where clause, person_id 3 can't be selected, so a gap is created and when the random select person_id 3, person_id 4 will be selected. This gives person 4 a bigger chance to be chosen. Any one got suggestions how I can adjust the query to make it work with the where clause and give all rows same % of chance to be selected.
Info table: The table is uniform, no gaps in person_id's. About 90% will have 0 points. I want to make the query for where points = 0 and points > 0.
Before someone will say, use rand(): this is not solution for tables with more than a few 100k rows.
Bonus question: will it be possible to select x random rows in 1 query, so I do not have to call this query a few times when i want more random rows?
Important note: performance is key, with 10M+ rows the query may not take much longer than the current query, which takes 0.0005 seconds, I prefer to stay under 0.05 second.
Last note: If you think the query will never be this fast with above requirements, but another solution is possible (like fetching 100 rows and showing x random which has more than 0 points), please tell :)
Really appreciate your help and all help is welcome :)
You could generate in-line gap-free id's for the records that you really want to work with, and generate then the random selector using the total number of records available.
Try with this (props to the chosen answer here for the row_number generator):
SELECT r1.*
FROM
(SELECT person_id,
#curRow := #curRow + 1 AS row_number
FROM persons as p,
(SELECT #curRow := 0) r0
WHERE points>0) r1
, (SELECT COUNT(1) * RAND() id
FROM persons
WHERE points>0) r2
WHERE r1.person_id>=r2.id
ORDER BY r1.person_id ASC
LIMIT 1;
You can mess with it in this sqlfiddle.