I'm having some problem in using order by in mysql. I have a table called "site" with 3 fields like id,name,rank. This table consists around 1.4m records. when i apply query like,
select name from site limit 50000,10;
it returns 10 records in 7.45 seconds [checked via terminal]. But when i use order by in the above query like,
select name from site order by id limit 50000,10;
the query never seems to be complete. Since the id is set as primary key, i thought it doesn't need another indexing to speedup my query. but i don't know where is the mistake.
Any help greatly appreciated, Thanks.
This is "to be expected" with large LIMIT values:
From http://www.mysqlperformanceblog.com/2006/09/01/order-by-limit-performance-optimization/
Beware of large LIMIT Using index to sort is efficient if you need
first few rows, even if some extra filtering takes place so you need
to scan more rows by index then requested by LIMIT. However if you’re
dealing with LIMIT query with large offset efficiency will suffer.
LIMIT 1000,10 is likely to be way slower than LIMIT 0,10. It is true
most users will not go further than 10 page in results, however Search
Engine Bots may very well do so. I’ve seen bots looking at 200+ page
in my projects. Also for many web sites failing to take care of this
provides very easy task to launch a DOS attack – request page with
some large number from few connections and it is enough. If you do not
do anything else make sure you block requests with too large page
numbers.
For some cases, for example if results are static it may make sense to
precompute results so you can query them for positions. So instead of
query with LIMIT 1000,10 you will have WHERE position between 1000 and
1009 which has same efficiency for any position (as long as it is
indexed)
AND
One more note about ORDER BY … LIMIT is – it provides scary explain
statements and may end up in slow query log as query which does not
use indexes
The last point is THE important point in your case - the combination of ORDER BY and LIMIT with a big table (1.4m) and the "not use indexes" (even if there are indexes!) in this case makes for really slow performance...
EDIT - as per comment:
For this specific case you should use select name from site order by id and handle the splitting of the resultset into chunks of 50,000 each in your code!
Can you try this:
SELECT name
FROM site
WHERE id >= ( SELECT id
FROM site
ORDER BY id
LIMIT 50000, 1
)
ORDER BY id
LIMIT 10 ;
Related
I'm working on a database containing over 5 million rows.
Question 1.
At the moment I'm doing the following:
SELECT COUNT(*) FROM cars
Count total rows to be returned. The above example is very basic. Queries do get more complex with WHERE clause.
I'm showing 50 rows per page. Using PHP I count total pages and offset based on current page retrieved from PHP $_GET. This gets passed to the following query:
SELECT ID FROM cars ORDER BY ID DESC LIMIT $offset, 50
I fetch all IDs of rows to be displayed in current page put them in a single string.
$ID_list = implode( ',', array_column( $mysqli_fetch, 'ID' ) );
This then gets passed to final query.
SELECT ID, make, model, year, price FROM cars WHERE ID IN ($ID_list)
Performance wise I find that passing IDs to third query is up to 8 times faster than just selecting all required columns in second query.
What is the most efficient way to paginate results while displaying total rows count and page numbers. While OFFSET, LIMIT pagination is not efficient, using seek method is not possible to display page numbers. Is there an alternative method? Maybe I should look into technologies other than MySQLi?
Question 2.
What is the best approach in displaying all possible search results of returned data?
https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=B4%206TB&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&page=1
The search in the website above starts with no filters applied. Now I can click on for example, Make and it shows a number of possible results next to car brand name. Same goes for every other option. How is this achieved?
Question 1's issues and solution is discussed in http://mysql.rjweb.org/doc.php/pagination
That strongly recommends "remember where you left off" instead of OFFSET, providing a significant performance improvement. It gets rid of $ID_list and lets you do the two SELECTs as one (which is another performance benefit). (Your 8x improvement was due the combination of selecting multiple columns and skipping over rows (OFFSET).)
Question 2 is more difficult since you want to do multiple counts. Try usingGROUP BY and COUNT(*) to get all the counts in a single query. The risk is that this might involve so much data (eg, all 5M rows) that it takes "too long". In the few cases where a "covering" index is available, it might not be "too long".
You could do big group-bys every night -- counts by make and no filtering, counts by model-year and no filtering, etc. Store those in a table for quickly fetching. Once you add filtering, the complexity makes this impractical. Note: doing such a nightly tally implies that you analyze the user's request in order to tailor the SELECT.
Even the count-how-many-row-we-are-about-to-page-through (of Question 1) may be too costly.
See this for how to segregate the "common" attributes from the "rare" ones: http://mysql.rjweb.org/doc.php/eav . That leads to having several composite queries of 2-3 columns in order to handle most of the SELECT from people with random filtering criteria.
Keep the table size down by using minimal datatypes. Model_year could use a 2-byte YEAR datatype. An auto_inc for 5M cars could use a 3-byte MEDIUMINT UNSIGNED (16M limit).
Normalization (replacing a long string with a short id) saves space, but is likely to cost too much when the queries filter on multiple criteria. Eg: make = 'Ford' AND model = 'F150'.
AND is relatively easy to optimize in a WHERE clause; IN is worse and OR is even worse. For some of the IN and OR cases, you may need to resort to UNION to rid of such. Example:
( SELECT ... WHERE make = 'BMW' )
UNION ALL
( SELECT ... WHERE make = 'Audi' )
There will be a number of other cases where you really need to "construct" the query in your app code, not simply hope that MySQL can do something optimal.
The above UNION does not allow for pagination; see my links on how to deal with such.
When I add LIMIT 1 to a MySQL query, does it stop the search after it finds 1 result (thus making it faster) or does it still fetch all of the results and truncate at the end?
Depending on the query, adding a limit clause can have a huge effect on performance. If you want only one row (or know for a fact that only one row can satisfy the query), and are not sure about how the internal optimizer will execute it (for example, WHERE clause not hitting an index and so forth), then you should definitely add a LIMIT clause.
As for optimized queries (using indexes on small tables) it probably won't matter much in performance, but again - if you are only interested in one row than add a LIMIT clause regardless.
Limit can affect the performance of the query (see comments and the link below) and it also reduces the result set that is output by MySQL. For a query in which you expect a single result there is benefits.
Moreover, limiting the result set can in fact speed the total query time as transferring large result sets use memory and potentially create temporary tables on disk. I mention this as I recently saw a application that did not use limit kill a server due to huge result sets and with limit in place the resource utilization dropped tremendously.
Check this page for more specifics: MySQL Documentation: LIMIT Optimization
The answer, in short, is yes. If you limit your result to 1, then even if you are "expecting" one result, the query will be faster because your database wont look through all your records. It will simply stop once it finds a record that matches your query.
If there is only 1 result coming back, then no, LIMIT will not make it any faster. If there are a lot of results, and you only need the first result, and there is no GROUP or ORDER by statements then LIMIT will make it faster.
If you really only expect one single result, it really makes sense to append the LIMIT to your query. I don't know the inner workings of MySQL, but I'm sure it won't gather a result set of 100'000+ records just to truncate it back to 1 at the end..
In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/
Does anyone have experience with query of the form
select * from TableX where columnY = 'Z' limit some_limits
For example:
select * from Topics where category_id = 100
Here columnY is indexed, but it's not the primary key. ColumnY = Z could return an unpredictable number of rows (from zero to a few thousands).
I only wonder the case of quite large dataset, for example, more than 10 millions items in TableX. What is the performance of such query?
A little detail about the performance should be nice (I mean specific big-O analysis, for example).
It depends upon the records found. If your query return a large number of records it may take time to load the browswer. And even larger return could make your browser unresponsive. But this is how you execute the query. The better solution for such problems could be limiting the query as you did with relevant limits. Further more instead of limiting manually you may use limit with loop till certain index and again start from following index in the case of programming. I am answering with the context of programming. Hope this answers your question
I was wondering if there was a way to get the number of results from a MySQL query, and at the same time limit the results.
The way pagination works (as I understand it) is to first do something like:
query = SELECT COUNT(*) FROM `table` WHERE `some_condition`
After I get the num_rows(query), I have the number of results. But then to actually limit my results, I have to do a second query:
query2 = SELECT COUNT(*) FROM `table` WHERE `some_condition` LIMIT 0, 10
Is there any way to both retrieve the total number of results that would be given, AND limit the results returned in a single query? Or are there any other efficient ways of achieving this?
I almost never do two queries.
Simply return one more row than is needed, only display 10 on the page, and if there are more than are displayed, display a "Next" button.
SELECT x, y, z FROM `table` WHERE `some_condition` LIMIT 0, 11
// Iterate through and display 10 rows.
// if there were 11 rows, display a "Next" button.
Your query should return in the order of most relevant first, chances are most people aren't going to care about going to page 236 out of 412.
When you do a google search and your results aren't on the first page, you likely go to page two, not nine.
No, that's how many applications that want to paginate have to do it. It's reliable and bullet-proof, albeit it makes the query twice, but you can cache the count for a few seconds and that will help a lot.
The other way is to use SQL_CALC_FOUND_ROWS clause and then call SELECT FOUND_ROWS(). Apart from the fact you have to put the FOUND_ROWS() call afterwards, there is a problem with this: there is a bug in MySQL that this tickles which affects ORDER BY queries making it much slower on large tables than the naive approach of two queries.
Another approach to avoiding double-querying is to fetch all the rows for the current page using a LIMIT clause first, then only do a second COUNT(*) query if the maximum number of rows were retrieved.
In many applications, the most likely outcome will be that all of the results fit on one page, and having to do pagination is the exception rather than the norm. In these cases, the first query will not retrieve the maximum number of results.
For example, answers on a Stackoverflow question rarely spill onto a second page. Comments on an answer rarely spill over the limit of 5 or so required to show them all.
So in these applications you can simply just do a query with a LIMIT first, and then as long as that limit is not reached, you know exactly how many rows there are without the need to do a second COUNT(*) query - which should cover the majority of situations.
In most situations it is much faster and less resource intensive to do it in two separate queries than to do it in one, even though that seems counter-intuitive.
If you use SQL_CALC_FOUND_ROWS, then for large tables it makes your query much slower, significantly slower even than executing two queries, the first with a COUNT(*) and the second with a LIMIT. The reason for this is that SQL_CALC_FOUND_ROWS causes the LIMIT clause to be applied after fetching the rows instead of before, so it fetches the entire row for all possible results before applying the limits. This can't be satisfied by an index because it actually fetches the data.
If you take the two queries approach, the first one only fetching COUNT(*) and not actually fetching and actual data, this can be satisfied much more quickly because it can usually use indexes and doesn't have to fetch the actual row data for every row it looks at. Then, the second query only needs to look at the first $offset + $limit rows and then return.
This post from the MySQL performance blog explains this further:
http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
For more information on optimising pagination, check this post and this post.
For anyone looking for an answer in 2020. As per MySQL documentation:
The SQL_CALC_FOUND_ROWS query modifier and accompanying FOUND_ROWS() function are deprecated as of MySQL 8.0.17 and will be removed in a future MySQL version. As a replacement, considering executing your query with LIMIT, and then a second query with COUNT(*) and without LIMIT to determine whether there are additional rows.
I guess that settles that.
My answer may be late, but you can skip the second query (with the limit) and just filter the info through your back end script. In PHP for instance, you could do something like:
if($queryResult > 0) {
$counter = 0;
foreach($queryResult AS $result) {
if($counter >= $startAt AND $counter < $numOfRows) {
//do what you want here
}
$counter++;
}
}
But of course, when you have thousands of records to consider, it becomes inefficient very fast. Pre-calculated count maybe a good idea to look into.
Here's a good read on the subject:
http://www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf
SELECT col, col2, (SELECT COUNT(*) FROM `table`) / 10 AS total FROM `table` WHERE `some_condition` LIMIT 0, 10
Where 10 is the page size and 0 is the page number, you need to use pageNumber - 1 in the query.
You can reuse most of the query in a subquery and set it to an identifier. For example a movie query that finds movies containing the letter 's' ordering by runtime would look like this on my site.
SELECT Movie.*, (
SELECT Count(1) FROM Movie
INNER JOIN MovieGenre
ON MovieGenre.MovieId = Movie.Id AND MovieGenre.GenreId = 11
WHERE Title LIKE '%s%'
) AS Count FROM Movie
INNER JOIN MovieGenre
ON MovieGenre.MovieId = Movie.Id AND MovieGenre.GenreId = 11
WHERE Title LIKE '%s%' LIMIT 8;
Do note that I'm not a database expert, and am hoping someone will be able to optimize that a bit better. As it stands running it straight from the SQL command line interface they both take ~0.02 seconds on my laptop.
SELECT *
FROM table
WHERE some_condition
ORDER BY RAND()
LIMIT 0, 10