MySQL matching query performance on large dataset

MySQL matching query performance on large dataset - mysql

Does anyone have experience with query of the form
select * from TableX where columnY = 'Z' limit some_limits
For example:
select * from Topics where category_id = 100
Here columnY is indexed, but it's not the primary key. ColumnY = Z could return an unpredictable number of rows (from zero to a few thousands).
I only wonder the case of quite large dataset, for example, more than 10 millions items in TableX. What is the performance of such query?
A little detail about the performance should be nice (I mean specific big-O analysis, for example).

It depends upon the records found. If your query return a large number of records it may take time to load the browswer. And even larger return could make your browser unresponsive. But this is how you execute the query. The better solution for such problems could be limiting the query as you did with relevant limits. Further more instead of limiting manually you may use limit with loop till certain index and again start from following index in the case of programming. I am answering with the context of programming. Hope this answers your question

Related

Pagination and how to return all possible options together with database search results?

I'm working on a database containing over 5 million rows.
Question 1.
At the moment I'm doing the following:
SELECT COUNT(*) FROM cars
Count total rows to be returned. The above example is very basic. Queries do get more complex with WHERE clause.
I'm showing 50 rows per page. Using PHP I count total pages and offset based on current page retrieved from PHP $_GET. This gets passed to the following query:
SELECT ID FROM cars ORDER BY ID DESC LIMIT $offset, 50
I fetch all IDs of rows to be displayed in current page put them in a single string.
$ID_list = implode( ',', array_column( $mysqli_fetch, 'ID' ) );
This then gets passed to final query.
SELECT ID, make, model, year, price FROM cars WHERE ID IN ($ID_list)
Performance wise I find that passing IDs to third query is up to 8 times faster than just selecting all required columns in second query.
What is the most efficient way to paginate results while displaying total rows count and page numbers. While OFFSET, LIMIT pagination is not efficient, using seek method is not possible to display page numbers. Is there an alternative method? Maybe I should look into technologies other than MySQLi?
Question 2.
What is the best approach in displaying all possible search results of returned data?
https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=B4%206TB&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&page=1
The search in the website above starts with no filters applied. Now I can click on for example, Make and it shows a number of possible results next to car brand name. Same goes for every other option. How is this achieved?

Question 1's issues and solution is discussed in http://mysql.rjweb.org/doc.php/pagination
That strongly recommends "remember where you left off" instead of OFFSET, providing a significant performance improvement. It gets rid of $ID_list and lets you do the two SELECTs as one (which is another performance benefit). (Your 8x improvement was due the combination of selecting multiple columns and skipping over rows (OFFSET).)
Question 2 is more difficult since you want to do multiple counts. Try usingGROUP BY and COUNT(*) to get all the counts in a single query. The risk is that this might involve so much data (eg, all 5M rows) that it takes "too long". In the few cases where a "covering" index is available, it might not be "too long".
You could do big group-bys every night -- counts by make and no filtering, counts by model-year and no filtering, etc. Store those in a table for quickly fetching. Once you add filtering, the complexity makes this impractical. Note: doing such a nightly tally implies that you analyze the user's request in order to tailor the SELECT.
Even the count-how-many-row-we-are-about-to-page-through (of Question 1) may be too costly.
See this for how to segregate the "common" attributes from the "rare" ones: http://mysql.rjweb.org/doc.php/eav . That leads to having several composite queries of 2-3 columns in order to handle most of the SELECT from people with random filtering criteria.
Keep the table size down by using minimal datatypes. Model_year could use a 2-byte YEAR datatype. An auto_inc for 5M cars could use a 3-byte MEDIUMINT UNSIGNED (16M limit).
Normalization (replacing a long string with a short id) saves space, but is likely to cost too much when the queries filter on multiple criteria. Eg: make = 'Ford' AND model = 'F150'.
AND is relatively easy to optimize in a WHERE clause; IN is worse and OR is even worse. For some of the IN and OR cases, you may need to resort to UNION to rid of such. Example:
( SELECT ... WHERE make = 'BMW' )
UNION ALL
( SELECT ... WHERE make = 'Audi' )
There will be a number of other cases where you really need to "construct" the query in your app code, not simply hope that MySQL can do something optimal.
The above UNION does not allow for pagination; see my links on how to deal with such.

Does long query string affect the speed?

Suppose i have a long query string for eg.
SELECT id from users where collegeid='1' or collegeid='2' . . . collegeid='1000'
will it affect the speed or output in any way?
SELECT m.id,m.message,m.postby,m.tstamp,m.type,m.category,u.name,u.img
from messages m
join users u on m.postby=u.uid
where m.cid = '1' or m.cid = '1' . . . . . .
or m.cid = '1000'. . . .

I would prefer to use IN in this case as it would be better. However to check the performance you may try to look at the Execution Plan of the query which you are executing. You will get the idea about what performance difference you will get by using the both.
Something like this:
SELECT id from users where collegeid IN ('1','2','3'....,'1000')
According to the MYSQL
If all values are constants, they are evaluated according to the type
of expr and sorted. The search for the item then is done using a
binary search. This means IN is very quick if the IN value list
consists entirely of constants.
The number of values in the IN list is only limited by the
max_allowed_packet value.
You may also check IN vs OR in the SQL WHERE Clause and MYSQL OR vs IN performance
The answer given by Ergec is very useful:
SELECT * FROM item WHERE id = 1 OR id = 2 ... id = 10000
This query took 0.1239 seconds
SELECT * FROM item WHERE id IN (1,2,3,...10000)
This query took 0.0433 seconds
IN is 3 times faster than OR
will it affect the speed or output in any way?
So the answer is Yes the performance will be affected.

Obviously, there is no direct correlation between the length of a query string and its processing time (as some very short query can be tremendeously complex and vice versa). For your specific example: It depends on how the query is processed. This is something you can check by looking at the query execution plan (syntax depends on your DBMS, something like EXPLAIN PLAN). If the DBMS has to perform a full table scan, performance will only be affected slightly, since the DBMS has to visit all pages that make up the table anyhow. If there is an index on collegeid, performance will likely suffer more the more entries you put into your disjunction, since there will be several (though very fast) index lookups. At some point, there will we an full index scan instead of individual lookups, at which point performance will not degrade significantly anymore.
However - details depend ony our DBMS and its execution planner.

I' not sure you are facing what I suffered.
Actually, string length is not problem. How many values are in IN() is more important.
I've tested how many elements can be listed IN().
Result is 10,000 elements can be processed without performance loss.
Values in IN() should be stored in somewhere and searched while query evaluation. But 10k values is getting slower.
So if you have many 100k values, split 10 groups and try 10 times query. Or save in temp table and JOIN.
and long query uses more CPU, So IN() better than column = 1 OR ...

How is it possible to have a good EXPLAIN and a slow query?

How is it possible to have a good plan in EXPLAIN like below and have a slow query. With few rows, using index, no filesort.
The query is running in 9s. The main table has around 500k rows.
When I had 250k rows in that table, the query was running in < 1s.
Suggestions plz?
Query (1. fields commented can be enabled according user choice. 2. Without FORCE INDEX I got 14s. 3. SQL_NO_CACHE I use to prevent false results):
SELECT SQL_NO_CACHE
p.property_id
, lct.loc_city_name_pt
, lc.loc_community_name_pt
, lc.loc_community_image_num_default
, lc.loc_community_gmap_longitude
, lc.loc_community_gmap_latitude
FROM property as p FORCE INDEX (ix_order_by_perf)
INNER JOIN loc_community lc
ON lc.loc_community_id = p.property_loc_community_id
INNER JOIN loc_city lct FORCE INDEX (loc_city_id)
ON lct.loc_city_id = lc.loc_community_loc_city_id
INNER JOIN property_attribute pa
ON pa.property_attribute_property_id = p.property_id
WHERE p.property_published = 1
AND (p.property_property_type_id = '1' AND p.property_property_type_sale_id = '1')
AND p.property_property_housing_id = '1'
-- AND p.property_loc_community_id = '36'
-- AND p.property_bedroom_id = '2'
-- AND p.property_price >= '50000' AND p.property_price <= '150000'
-- AND lct.loc_city_id = '1'
-- AND p.property_loc_subcommunity_id IN(7,8,12)
ORDER BY
p.property_featured DESC
, p.property_ranking_date DESC
, p.property_ranking_total DESC
LIMIT 0, 15
Query Profile
The resultset always outputs 15 rows. But the table property and property_attribute has around 500k rows.
Thanks all,
Armando Miani

This really seems to be an odditity in EXPLAIN in this case. This doesn't occur on MySQL 4.x, but it does on MySQL 5.x.
What MySQL is really showing you is that MySQL is trying to use the forced index ix_order_by_perf for sorting the rows, and it's showing you 15 rows because you have LIMIT 15.
However, the WHERE clause is still scanning all 500K rows since it can't utilize an index for the criteria in your WHERE clause. If it were able to use the index for finding the required rows, you would see the forced index listed in the 'possible_keys' field.
You can prove this by keeping the FORCE INDEX clause and removing the ORDER BY clause. You'll see that MySQL now won't use any indexes, even the one you're forcing (because the index doesn't work for this purpose).
Try adding property_property_type_id, property_property_type_sale_id, property_property_housing_id, and any other columns that you refer to in your WHERE clause to the beginning of the index.

There's a moment when your query will be optimized around a model which might not be anymore valid for a given need.
A plan could be great but, even if the filters you are using in the where clause respect indexes definitions, it doesn't mean the parser doesn't parse may rows.
What you have to analize is how determinating are your indexes. For instance, if there's an index on "name, family name" in a "person" table, the performances are going to be poor if everybody has the same name and family name. The index is a real trap pulling down performances when it doesn't manage to be enough describing a certain segment of your datasets.

Based on the output of your explain the query, here are my initial thoughts:
This portion of your query (rewritten to excluded the unneeded parentheses):
p.property_published = 1
AND p.property_property_type_id = '1'
AND p.property_property_type_sale_id = '1'
AND p.property_property_housing_id = '1'
Put conditions so many conditions on the property table that it's unlikely any index you have can be used. Unless you have a single index that has all four of those attributes in it, you're forcing a full table scan on the query just to find the rows that meet those conditions (though, it's possible if you have an index on one of the attributes it could use that).
First, I'd add the following index (have not checked this for syntax errors):
CREATE INDEX property_published_type_sale_housing_idx
ON property (property_published,
property_property_type_id,
property_property_type_sale_id,
property_property_housing_id );
Then I'd re-run your EXPLAIN to see if you hit the index now. (Take off the FORCE INDEX on that part of the query).
Also, even given this issue, it's possible the slow down may be memory related. That is, you may have enough memory to process the table with a smaller number of rows, but it may be that when the table gets larger MySQL can't process the entire query in memory and is forced to start using disk to get the entire query handled. This would explain why there's a sudden drop off in performance.
If that's the case, then two things might help:
Adding more memory (and tune the mysql config file to take advantage of it) so that the number of rows that can br processed at once is larger. This is at best a temporary solution.
Tune the indexes (like I'm saying above) so that the number of rows that mysql needs to process is lower. If it can be more precise in picking the rows it selects for processing.

except a good plan you need to have enough resources to run query.
check buffers size and another critical parameters in your config.
And your query is?

mysql order by query issue

I'm having some problem in using order by in mysql. I have a table called "site" with 3 fields like id,name,rank. This table consists around 1.4m records. when i apply query like,
select name from site limit 50000,10;
it returns 10 records in 7.45 seconds [checked via terminal]. But when i use order by in the above query like,
select name from site order by id limit 50000,10;
the query never seems to be complete. Since the id is set as primary key, i thought it doesn't need another indexing to speedup my query. but i don't know where is the mistake.
Any help greatly appreciated, Thanks.

This is "to be expected" with large LIMIT values:
From http://www.mysqlperformanceblog.com/2006/09/01/order-by-limit-performance-optimization/
Beware of large LIMIT Using index to sort is efficient if you need
first few rows, even if some extra filtering takes place so you need
to scan more rows by index then requested by LIMIT. However if you’re
dealing with LIMIT query with large offset efficiency will suffer.
LIMIT 1000,10 is likely to be way slower than LIMIT 0,10. It is true
most users will not go further than 10 page in results, however Search
Engine Bots may very well do so. I’ve seen bots looking at 200+ page
in my projects. Also for many web sites failing to take care of this
provides very easy task to launch a DOS attack – request page with
some large number from few connections and it is enough. If you do not
do anything else make sure you block requests with too large page
numbers.
For some cases, for example if results are static it may make sense to
precompute results so you can query them for positions. So instead of
query with LIMIT 1000,10 you will have WHERE position between 1000 and
1009 which has same efficiency for any position (as long as it is
indexed)
AND
One more note about ORDER BY … LIMIT is – it provides scary explain
statements and may end up in slow query log as query which does not
use indexes
The last point is THE important point in your case - the combination of ORDER BY and LIMIT with a big table (1.4m) and the "not use indexes" (even if there are indexes!) in this case makes for really slow performance...
EDIT - as per comment:
For this specific case you should use select name from site order by id and handle the splitting of the resultset into chunks of 50,000 each in your code!

Can you try this:
SELECT name
FROM site
WHERE id >= ( SELECT id
FROM site
ORDER BY id
LIMIT 50000, 1
)
ORDER BY id
LIMIT 10 ;

MySQL pagination without double-querying?

I was wondering if there was a way to get the number of results from a MySQL query, and at the same time limit the results.
The way pagination works (as I understand it) is to first do something like:
query = SELECT COUNT(*) FROM `table` WHERE `some_condition`
After I get the num_rows(query), I have the number of results. But then to actually limit my results, I have to do a second query:
query2 = SELECT COUNT(*) FROM `table` WHERE `some_condition` LIMIT 0, 10
Is there any way to both retrieve the total number of results that would be given, AND limit the results returned in a single query? Or are there any other efficient ways of achieving this?

I almost never do two queries.
Simply return one more row than is needed, only display 10 on the page, and if there are more than are displayed, display a "Next" button.
SELECT x, y, z FROM `table` WHERE `some_condition` LIMIT 0, 11
// Iterate through and display 10 rows.
// if there were 11 rows, display a "Next" button.
Your query should return in the order of most relevant first, chances are most people aren't going to care about going to page 236 out of 412.
When you do a google search and your results aren't on the first page, you likely go to page two, not nine.

No, that's how many applications that want to paginate have to do it. It's reliable and bullet-proof, albeit it makes the query twice, but you can cache the count for a few seconds and that will help a lot.
The other way is to use SQL_CALC_FOUND_ROWS clause and then call SELECT FOUND_ROWS(). Apart from the fact you have to put the FOUND_ROWS() call afterwards, there is a problem with this: there is a bug in MySQL that this tickles which affects ORDER BY queries making it much slower on large tables than the naive approach of two queries.

Another approach to avoiding double-querying is to fetch all the rows for the current page using a LIMIT clause first, then only do a second COUNT(*) query if the maximum number of rows were retrieved.
In many applications, the most likely outcome will be that all of the results fit on one page, and having to do pagination is the exception rather than the norm. In these cases, the first query will not retrieve the maximum number of results.
For example, answers on a Stackoverflow question rarely spill onto a second page. Comments on an answer rarely spill over the limit of 5 or so required to show them all.
So in these applications you can simply just do a query with a LIMIT first, and then as long as that limit is not reached, you know exactly how many rows there are without the need to do a second COUNT(*) query - which should cover the majority of situations.

In most situations it is much faster and less resource intensive to do it in two separate queries than to do it in one, even though that seems counter-intuitive.
If you use SQL_CALC_FOUND_ROWS, then for large tables it makes your query much slower, significantly slower even than executing two queries, the first with a COUNT(*) and the second with a LIMIT. The reason for this is that SQL_CALC_FOUND_ROWS causes the LIMIT clause to be applied after fetching the rows instead of before, so it fetches the entire row for all possible results before applying the limits. This can't be satisfied by an index because it actually fetches the data.
If you take the two queries approach, the first one only fetching COUNT(*) and not actually fetching and actual data, this can be satisfied much more quickly because it can usually use indexes and doesn't have to fetch the actual row data for every row it looks at. Then, the second query only needs to look at the first $offset + $limit rows and then return.
This post from the MySQL performance blog explains this further:
http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
For more information on optimising pagination, check this post and this post.

For anyone looking for an answer in 2020. As per MySQL documentation:
The SQL_CALC_FOUND_ROWS query modifier and accompanying FOUND_ROWS() function are deprecated as of MySQL 8.0.17 and will be removed in a future MySQL version. As a replacement, considering executing your query with LIMIT, and then a second query with COUNT(*) and without LIMIT to determine whether there are additional rows.
I guess that settles that.

My answer may be late, but you can skip the second query (with the limit) and just filter the info through your back end script. In PHP for instance, you could do something like:
if($queryResult > 0) {
$counter = 0;
foreach($queryResult AS $result) {
if($counter >= $startAt AND $counter < $numOfRows) {
//do what you want here
}
$counter++;
}
}
But of course, when you have thousands of records to consider, it becomes inefficient very fast. Pre-calculated count maybe a good idea to look into.
Here's a good read on the subject:
http://www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf

SELECT col, col2, (SELECT COUNT(*) FROM `table`) / 10 AS total FROM `table` WHERE `some_condition` LIMIT 0, 10
Where 10 is the page size and 0 is the page number, you need to use pageNumber - 1 in the query.

You can reuse most of the query in a subquery and set it to an identifier. For example a movie query that finds movies containing the letter 's' ordering by runtime would look like this on my site.
SELECT Movie.*, (
SELECT Count(1) FROM Movie
INNER JOIN MovieGenre
ON MovieGenre.MovieId = Movie.Id AND MovieGenre.GenreId = 11
WHERE Title LIKE '%s%'
) AS Count FROM Movie
INNER JOIN MovieGenre
ON MovieGenre.MovieId = Movie.Id AND MovieGenre.GenreId = 11
WHERE Title LIKE '%s%' LIMIT 8;
Do note that I'm not a database expert, and am hoping someone will be able to optimize that a bit better. As it stands running it straight from the SQL command line interface they both take ~0.02 seconds on my laptop.

SELECT *
FROM table
WHERE some_condition
ORDER BY RAND()
LIMIT 0, 10

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008