how to repeatedly retrieve certain amount of data from a large data set?

how to repeatedly retrieve certain amount of data from a large data set? - mysql

I have a INVENTORY table. There are many items inside. When I query it, I just want to retrieve 50 items each time, and continue to get next 50 if user request and repeat this process until end of records. I am using MySQL database. How to do it in plain SQL if possible?
thanks,

SELECT * FROM Inventory
WHERE whatever
ORDER BY something
LIMIT 50 OFFSET 0
gets you the first 50; the second time, use OFFSET 50; the third time, use OFFSET 100; and so on. It can be a problem if the table changes between requests, of course; if you do have that problem, solving it can be costly (force locks, make a copy of the table, or other unpleasant solutions -- there is no magic).
If your client just cannot remember the offset it last used, you can store it in a separate auxiliary table, but it does complicate the SQL a lot.

Use LIMIT:
SELECT ... FROM ... LIMIT <offset>, <number>
In your case <number> would be 50 and <offset> would increase by 50 each request.

Related

Optimize SQL to get rows count

I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits

My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.

The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.

You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.

You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.

Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000

Don't use OFFSET they say but what's the actual purpose of OFFSET?

It appears that OFFSET is not recommended when dealing with large records because of slow performance and instead just use something like WHERE id < x LIMIT y.
If this is the case why does OFFSET exists, is there another use for it?

Conceptually, the way that offset (and limit for that matter) works is that the entire result set is filtered after it is generated. So, in order to get the the "y"th row, all the rows up that one need to be generated.
If there is another method to get the same rows using where, then that will normally be faster (for large values of "y"), because the intermediate rows don't have to be generated.
However, offset is still very useful. How else would you easily get the candidates who are 11+ in average poll ratings from candidates, so you know who to put in the "also-ran" debate? How would you get the second highest value, fi that is what you want? For smaller datasets, offset can be very useful for these types of questions. It is also useful for paging on smaller result sets.

MySQL ORDER BY RAND() with WHERE clause any better?

I understand ORDER BY RAND() is slow (am using it to get a random subset of data). But I wonder whether if there is a WHERE clause or a filtering function like JOIN, will it improve things? My DB size can grow as times goes by. But if I expect the WHERE to limit the number of records to say 1000, ORDER BY RAND() will work only with that 1000 records correct?
In case you want more detail
What I am doing is actually generating winners for a lucky draw. So I want to randomly select a few winners. Simple example is something like:
SELECT * FROM luckydrawchance
WHERE luckydraw = 1
ORDER BY RAND()
LIMIT 5
But some users might have more chances of winning, so I am thinking
SELECT * FROM luckydrawchance
WHERE luckydraw = 1
ORDER BY RAND() * (-chances)
LIMIT 5
Maybe instead of RAND() * (-chances) I need something else (I read this does not give the right probability distribution) but just to give you an idea.

Joining other tables will actually make things worse because to order by rand(), MySQL copies the result to a temporary table. The bigger and more complex the data to copy, the slower the query. As for WHERE, I can't give an absolute answer but I expect sorting a smaller subset to be faster than sorting the whole table. Using EXPLAIN on your query should help you to understand how it is executed.

EDIT2: From your extra information it is clear, that the fair randomness is important, but you need only a few rows and you need them not very often. So I would combine two steps. numrows is very roughly the number of rows in table, numwinners the number of wanted winners
calculate part = 5 * numwinners / numrows
query your data in the way
select * from users where rand() < [part] order by rand() limit numwinners
If it gets less then numwinners rows (very rare, but could happen), then repeat the query.
EDIT: more clarified
If you just want an arbitrary subset of your data for a one-shot analysis, you should find out how much your sample data is compared to the whole table. Say it is a bit less than 0.1% and 1000 rows, then you could try
where rand() < 0.001
LIMIT 1000 // EDIT of course you should use limit
This also creates tons of rand() numbers, but does not have to order your data for those rand() numbers. You MUST adopt the 0.001 to your needs, and there is no guarantee for the good solution. If you make the number too small or if you just are unlucky (random!), you get too little data. If you make it too big, you always get only older (or only newer) entries depending on sorting.
If you need a random sample very often, then you could assign a fixed field with a random number, but you have to be a bit careful reading the sample. If you spread the range [ 0, 1 ] to your lines and want a fair sample, then you could make a check-random number between [ 0.1, 0.9 ] and read all data within [check - 0.1, check+0.1]. You could reshuffle the assigned random numbers every now and then (at night, for example).

Almost any attempt to get a random 5 rows out of a 1000-row table will hit all 1000 rows. flaschenpost's will fetch somewhere between 5 and 1000; it will vary.
Here are the only really efficient random fetchers I know of. You have not provided enough details about your table for me to pick among the five for you.

ORDER BY RAND() alternative [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MySQL: Alternatives to ORDER BY RAND()
I currently have a query that ends ORDER BY RAND(HOUR(NOW())) LIMIT 40 to get 40 random results. The list of results changes each hour.
This kills the query cache, which is damaging performance.
Can you suggest an alternative way of getting a random(ish) set of results that changes from time to time? It does not have to be every hour and it does not have to be totally random.
I would prefer a random result, rather than sorting on an arbitrary field in the table, but I will do that as a last resort...
(this is a list of new products that I want to shuffle around a bit every now and then).

If you have an ID column it's better to do a:
-- create a variable to hold the random number
SET #rownum := SELECT count(*) FROM table;
SET #row := (SELECT CEIL((rand() * #rownum));
-- use the random number to select on the id column
SELECT * from tablle WHERE id = #row;
The logic of selecting the random id number can be move to the application level.
SELECT * FROM table ORDER BY RAND LIMIT 40
is very inefficient because MySQL will process ALL the records in the table performing a full table scan on all the rows, order them randomly.

Its going to kill the cache because you are expecting a different result set each time. There is no way that you can cache a random set of values. If you want to cache a group of results, cache a large random set of values, and then within sub sections of the time that you are going to use those values do a random grab within the smaller set [outside of sql].

I think the better way is to download product identifiers to your middle layer, choose random 40 values when you need (once per hour or for every request) and use them in the query: product_id in (#id_1, #id_2, ..., #id_40).

you may have a column with random values that you update every hour.

This is going to be a significantly nasty query if it needs to sort a large data set into a random order (which really does require a sort), then discard all but the first 40 records.
A better solution would be to just pick 40 random records. There are lots of ways of doing this and it usually depends on having keys which are evenly distributed.
Another option is to pick the 40 random records in a batch job which is only run once per hour (or whatever) and then remember which ones they are.

One way to achieve it is to shuffle the objects you map the data to. If you don't map the data to objects, you could shuffle the result array from the database. I don't know if this will perform better or not, but you will at least get the benefits from the query cache as you mention.
You could also generate a random sequence from 1 to n, and index the result array (or object array) with those.

calculate the current hour in your PHP code and pass that to your query. this will result in a static value that can be cached.
note that you might also have a hidden bug. since you're only taking the hour, you only have 24 different values, which will repeat every day. which means that what's showing at 1 pm today will also be the same as what shows tomorrow at 6. you might want to change that.

Don't fight with the cache-- expoit it!
Write your query as you are (or even simpler). Then, in your code, cache the results, setting a cache expiry for 1 hour. If you are using a caching layer, like memcached, you are set. If not, you can build a fairly simple one:
[pseudocode]
global cache[24]
h = Time.hour
if (cache[h] == null) {
cache[h] = .. run your query
}
return cache[h];

If you only need a new set of random data once an hour, don't hit the database - save the results to your application's caching layer (or, if it doesn't have one, just put it out into a temporary file of some sort). Query cache is handy, but if you never need to even execute a query, even better...

Why does MySQL Rand() hate me?

Here is a simplified query of something I am trying to do on a larger join query. It is still breaking on this small scale. I am trying to generate a random number for each row pulled back in the range of 1-60. I then want to order the returned rows by this random number.
SELECT downloads . * ,
(FLOOR( 1 + ( RAND( ) *60 ) )) AS randomtimer
FROM downloads
ORDER BY randomtimer
LIMIT 25
I have 2 databases I have tried this query on. A live one and a dev one. I have side by side compared the two and they are both structurally the same. It works correctly on the dev one. returning the rows ordered by the randomtimer.
The live table returns all 1's in the randomtimer column. If I order by randomtimer ASC they become all 60s. If I remove randomtimer from the Order By Clause it returns correct individual values. So something is tweaking the values on the ORDER BY statment.
Anyone have any ideas on this? Might I be overlooking something? WTF? WTF?

Aside from what mr. unknown has said, there's another issue.
You are generating a random number between 1 and 60 then selecting the top 25 rows. If there are enough rows that you would (statistically) end up with more than 25 with a random value of 1, then the first 25 rows would of course all have a value of 1 in the "randomtimer" column.
So this is likely due to the fact that you just have a lot more data in production than on the dev server.

From the RAND docs:
You cannot use a column with RAND()
values in an ORDER BY clause, because
ORDER BY would evaluate the column
multiple times. However, you can
retrieve rows in random order like
this:
mysql> SELECT * FROM tbl_name ORDER BY
RAND();
I'd guess the variance is due to different MySQL version, query plans or table data but don't know which.

I decided to scrap that idea and make an array of random numbers in php the same length as the returned results and just sort and use that.

I'll throw out an idea ... the RAND function is using the time as its seed. On the live system, the entire query is finishing within the same millisecond, therefore all the random numbers are the same. On the dev system, it is taking longer so you get more random numbers. Might make sense if your live system is more powerful than your dev system.
Just a thought.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to repeatedly retrieve certain amount of data from a large data set? - mysql

I have a INVENTORY table. There are many items inside. When I query it, I just want to retrieve 50 items each time, and continue to get next 50 if user request and repeat this process until end of records. I am using MySQL database. How to do it in plain SQL if possible? thanks,

Use LIMIT: SELECT ... FROM ... LIMIT <offset>, <number> In your case <number> would be 50 and <offset> would increase by 50 each request.

Related

Optimize SQL to get rows count

Don't use OFFSET they say but what's the actual purpose of OFFSET?

MySQL ORDER BY RAND() with WHERE clause any better?

ORDER BY RAND() alternative [duplicate]

Why does MySQL Rand() hate me?

Categories

Resources