ORDER BY RAND() alternative [duplicate] - mysql

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MySQL: Alternatives to ORDER BY RAND()
I currently have a query that ends ORDER BY RAND(HOUR(NOW())) LIMIT 40 to get 40 random results. The list of results changes each hour.
This kills the query cache, which is damaging performance.
Can you suggest an alternative way of getting a random(ish) set of results that changes from time to time? It does not have to be every hour and it does not have to be totally random.
I would prefer a random result, rather than sorting on an arbitrary field in the table, but I will do that as a last resort...
(this is a list of new products that I want to shuffle around a bit every now and then).

If you have an ID column it's better to do a:
-- create a variable to hold the random number
SET #rownum := SELECT count(*) FROM table;
SET #row := (SELECT CEIL((rand() * #rownum));
-- use the random number to select on the id column
SELECT * from tablle WHERE id = #row;
The logic of selecting the random id number can be move to the application level.
SELECT * FROM table ORDER BY RAND LIMIT 40
is very inefficient because MySQL will process ALL the records in the table performing a full table scan on all the rows, order them randomly.

Its going to kill the cache because you are expecting a different result set each time. There is no way that you can cache a random set of values. If you want to cache a group of results, cache a large random set of values, and then within sub sections of the time that you are going to use those values do a random grab within the smaller set [outside of sql].

I think the better way is to download product identifiers to your middle layer, choose random 40 values when you need (once per hour or for every request) and use them in the query: product_id in (#id_1, #id_2, ..., #id_40).

you may have a column with random values that you update every hour.

This is going to be a significantly nasty query if it needs to sort a large data set into a random order (which really does require a sort), then discard all but the first 40 records.
A better solution would be to just pick 40 random records. There are lots of ways of doing this and it usually depends on having keys which are evenly distributed.
Another option is to pick the 40 random records in a batch job which is only run once per hour (or whatever) and then remember which ones they are.

One way to achieve it is to shuffle the objects you map the data to. If you don't map the data to objects, you could shuffle the result array from the database. I don't know if this will perform better or not, but you will at least get the benefits from the query cache as you mention.
You could also generate a random sequence from 1 to n, and index the result array (or object array) with those.

calculate the current hour in your PHP code and pass that to your query. this will result in a static value that can be cached.
note that you might also have a hidden bug. since you're only taking the hour, you only have 24 different values, which will repeat every day. which means that what's showing at 1 pm today will also be the same as what shows tomorrow at 6. you might want to change that.

Don't fight with the cache-- expoit it!
Write your query as you are (or even simpler). Then, in your code, cache the results, setting a cache expiry for 1 hour. If you are using a caching layer, like memcached, you are set. If not, you can build a fairly simple one:
[pseudocode]
global cache[24]
h = Time.hour
if (cache[h] == null) {
cache[h] = .. run your query
}
return cache[h];

If you only need a new set of random data once an hour, don't hit the database - save the results to your application's caching layer (or, if it doesn't have one, just put it out into a temporary file of some sort). Query cache is handy, but if you never need to even execute a query, even better...

Related

Optimize SQL to get rows count

I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits
My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.
The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.
You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.
You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.
Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

how to repeatedly retrieve certain amount of data from a large data set?

I have a INVENTORY table. There are many items inside. When I query it, I just want to retrieve 50 items each time, and continue to get next 50 if user request and repeat this process until end of records. I am using MySQL database. How to do it in plain SQL if possible?
thanks,
SELECT * FROM Inventory
WHERE whatever
ORDER BY something
LIMIT 50 OFFSET 0
gets you the first 50; the second time, use OFFSET 50; the third time, use OFFSET 100; and so on. It can be a problem if the table changes between requests, of course; if you do have that problem, solving it can be costly (force locks, make a copy of the table, or other unpleasant solutions -- there is no magic).
If your client just cannot remember the offset it last used, you can store it in a separate auxiliary table, but it does complicate the SQL a lot.
Use LIMIT:
SELECT ... FROM ... LIMIT <offset>, <number>
In your case <number> would be 50 and <offset> would increase by 50 each request.

MySQL pagination without double-querying?

I was wondering if there was a way to get the number of results from a MySQL query, and at the same time limit the results.
The way pagination works (as I understand it) is to first do something like:
query = SELECT COUNT(*) FROM `table` WHERE `some_condition`
After I get the num_rows(query), I have the number of results. But then to actually limit my results, I have to do a second query:
query2 = SELECT COUNT(*) FROM `table` WHERE `some_condition` LIMIT 0, 10
Is there any way to both retrieve the total number of results that would be given, AND limit the results returned in a single query? Or are there any other efficient ways of achieving this?
I almost never do two queries.
Simply return one more row than is needed, only display 10 on the page, and if there are more than are displayed, display a "Next" button.
SELECT x, y, z FROM `table` WHERE `some_condition` LIMIT 0, 11
// Iterate through and display 10 rows.
// if there were 11 rows, display a "Next" button.
Your query should return in the order of most relevant first, chances are most people aren't going to care about going to page 236 out of 412.
When you do a google search and your results aren't on the first page, you likely go to page two, not nine.
No, that's how many applications that want to paginate have to do it. It's reliable and bullet-proof, albeit it makes the query twice, but you can cache the count for a few seconds and that will help a lot.
The other way is to use SQL_CALC_FOUND_ROWS clause and then call SELECT FOUND_ROWS(). Apart from the fact you have to put the FOUND_ROWS() call afterwards, there is a problem with this: there is a bug in MySQL that this tickles which affects ORDER BY queries making it much slower on large tables than the naive approach of two queries.
Another approach to avoiding double-querying is to fetch all the rows for the current page using a LIMIT clause first, then only do a second COUNT(*) query if the maximum number of rows were retrieved.
In many applications, the most likely outcome will be that all of the results fit on one page, and having to do pagination is the exception rather than the norm. In these cases, the first query will not retrieve the maximum number of results.
For example, answers on a Stackoverflow question rarely spill onto a second page. Comments on an answer rarely spill over the limit of 5 or so required to show them all.
So in these applications you can simply just do a query with a LIMIT first, and then as long as that limit is not reached, you know exactly how many rows there are without the need to do a second COUNT(*) query - which should cover the majority of situations.
In most situations it is much faster and less resource intensive to do it in two separate queries than to do it in one, even though that seems counter-intuitive.
If you use SQL_CALC_FOUND_ROWS, then for large tables it makes your query much slower, significantly slower even than executing two queries, the first with a COUNT(*) and the second with a LIMIT. The reason for this is that SQL_CALC_FOUND_ROWS causes the LIMIT clause to be applied after fetching the rows instead of before, so it fetches the entire row for all possible results before applying the limits. This can't be satisfied by an index because it actually fetches the data.
If you take the two queries approach, the first one only fetching COUNT(*) and not actually fetching and actual data, this can be satisfied much more quickly because it can usually use indexes and doesn't have to fetch the actual row data for every row it looks at. Then, the second query only needs to look at the first $offset + $limit rows and then return.
This post from the MySQL performance blog explains this further:
http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
For more information on optimising pagination, check this post and this post.
For anyone looking for an answer in 2020. As per MySQL documentation:
The SQL_CALC_FOUND_ROWS query modifier and accompanying FOUND_ROWS() function are deprecated as of MySQL 8.0.17 and will be removed in a future MySQL version. As a replacement, considering executing your query with LIMIT, and then a second query with COUNT(*) and without LIMIT to determine whether there are additional rows.
I guess that settles that.
My answer may be late, but you can skip the second query (with the limit) and just filter the info through your back end script. In PHP for instance, you could do something like:
if($queryResult > 0) {
$counter = 0;
foreach($queryResult AS $result) {
if($counter >= $startAt AND $counter < $numOfRows) {
//do what you want here
}
$counter++;
}
}
But of course, when you have thousands of records to consider, it becomes inefficient very fast. Pre-calculated count maybe a good idea to look into.
Here's a good read on the subject:
http://www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf
SELECT col, col2, (SELECT COUNT(*) FROM `table`) / 10 AS total FROM `table` WHERE `some_condition` LIMIT 0, 10
Where 10 is the page size and 0 is the page number, you need to use pageNumber - 1 in the query.
You can reuse most of the query in a subquery and set it to an identifier. For example a movie query that finds movies containing the letter 's' ordering by runtime would look like this on my site.
SELECT Movie.*, (
SELECT Count(1) FROM Movie
INNER JOIN MovieGenre
ON MovieGenre.MovieId = Movie.Id AND MovieGenre.GenreId = 11
WHERE Title LIKE '%s%'
) AS Count FROM Movie
INNER JOIN MovieGenre
ON MovieGenre.MovieId = Movie.Id AND MovieGenre.GenreId = 11
WHERE Title LIKE '%s%' LIMIT 8;
Do note that I'm not a database expert, and am hoping someone will be able to optimize that a bit better. As it stands running it straight from the SQL command line interface they both take ~0.02 seconds on my laptop.
SELECT *
FROM table
WHERE some_condition
ORDER BY RAND()
LIMIT 0, 10

mysql, force limited entries in a table

I keep some temporary data in a memory table. I only need the 20 most recent entries and would prefer the data is always be on the heap. How should i accomplish this? i am sure theres nothing i can do about the memory table but how should i handle entries tables? should i add a autoincrease key and delete the oldest whenever i want to push a new value in?
Could you please describe in more detail, what you are trying to do? I don't see why you want to keep the most recent data in an additional table when you can just use a SELECT with descending order and a LIMIT 20. If the SELECT query is too expensive then just cache the result using memcached or similar and clear the cache every time a new data is inserted.
If the additional table is really necessary there are several ways to prune old data from the table. Either you fetch the id of the 20th recent data (again descending order and LIMIT 19,1 and delete everything that has a smaller id (in case you have an auto increment index, timestamp, etc.) or you SELECT COUNT(*) and then do a DELETE with ascending order and a LIMIT (all items - 20). This could be packed into a cronjob that runs every several minutes.
But I would really recommend using a cache and looking at the table definition. With a decent index there shouldn't be any problems.
Appending to the 20-entry table and removing the eldest element (i.e. the one with the minimum ID?) is possible. However, note that this will fragment the table.
That's OK so long as you run OPTIMIZE every once in a while.
A different way would be to pre-allocate 20 entries and keep a separate counter of which entry is the latest. Then instead of insert/delete, you would update the item ID based on the counter, which you would then increment (mod 20 + 1) and store again.
However note that both of these models work only under a "single-threaded" model. If multiple threads are running on the table it's possible that they'll conflict.
If the counter is in program memory, shared by threads but guarded properly, that will be both thread-safe and efficient.