OK, let's assume I have a big table with a 1k+ records, and that I need to take three records from it. Now, let's assume there are no records that meet the conditions. By doing a COUNT(*) using the same conditions and then doing a SELECT if the count is greater than zero, am I making my queries faster by making sure there are records available before doing a SELECT, or is this just a waste of time?
That is a tiny table in the overall scheme of things. You should just query for your filtered results directly, and if you need to do something different in your app when no results are returned, just do a check against the number of rows returned to skip trying to work with the result set.
There would never be a case where the COUNT() approach performs better, because it would be doing the same exact query logic you would do on a full select anyway.
Related
I have a large database in which I use LIMIT in order not to fetch all the results of the query every time (It is not necessary). But I have an issue: I need to count the number of results. The dumbest solution is the following and it works:
We just get the data that we need:
SELECT * FROM table_name WHERE param > 3 LIMIT 10
And then we find the length:
SELECT COUNT(1) FROM table_name WHERE param > 3 LIMIT 10
But this solution bugs me because unlike the query in question, the one that I work with is complex and you have to basically run it twice to achieve the result.
Another dumb solution for me was to do:
SELECT COUNT(1), param, anotherparam, additionalparam FROM table_name WHERE param > 3 LIMIT 10
But this results in only one row. At this point I will be ok if it would just fill the count row with the same number, I just need this information without wasting computation time.
Is there a better way to achieve this?
P.S. By the way, I am not looking to get 10 as the result of COUNT, I need the length without LIMIT.
You should (probably) run the query twice.
MySQL does have a FOUND_ROWS() function that reports the number of rows matched before the limit. But using this function is often worse for performance than running the query twice!
https://www.percona.com/blog/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
...when we have appropriate indexes for WHERE/ORDER clause in our query, it is much faster to use two separate queries instead of one with SQL_CALC_FOUND_ROWS.
There are exceptions to every rule, of course. If you don't have an appropriate index to optimize the query, it could be more costly to run the query twice. The only way to be sure is to repeat the tests shown in that blog, using your data and your query on your server.
This question is very similar to: How can I count the numbers of rows that a MySQL query returned?
See also: https://mariadb.com/kb/en/found_rows/
This is probably the most efficient solution to your problem, but it's best to test it using EXPLAIN with a reasonably sized dataset.
I need to query for the COUNT of rows that fulfill multiple filter criteria. However, I do not know which filters will be combined, so I cannot create appropriate indexes.
SELECT COUNT(id) FROM tbl WHERE filterA > 1000 AND filterD < 500
This is very slow since it has to do a full table scan. Is there any way to have a perfomant query in my situation?
id, filterA, filterB, filterC, filterD, filterE
1, 2394, 23240, 8543, 3241, 234, 23
The issue here is that there are certain limitations in how you can index data on multiple criteria. These are standard, fundamental issues and to the extent that ElasticSearch is able to get away from the problems it is just brute force parallelism and indexes on everything you may want to filter by.
Usually some filters will be more commonly used and more selective, so usually one would start by looking at actual examples of queries and build indexes around the queries which have performed slowly in the past.
This means you start with slow query logging and then focus on the most important queries first until you get everything where it is tolerable.
"The GROUP BY clause groups a set of rows into a set of summary rows by values of columns or expressions. The GROUP BY clause returns one row for each group. In other words, it reduces the number of rows in the result set." - http://www.mysqltutorial.org/mysql-group-by.aspx
But when actually GROUP BY works? When rows are searched, either after they all are found (filters the result of the query)?
It doesn't really matter. If you're thinking of the database in terms of loops over data evaluating truths for the WHERE part, then you could probably envisage the group by as being a dictionary/hashtable. It makes sense for performance reasons to do the hashing at the same time that you loop the data, but you could loop twice; once to filter, once to group. The looping part is cheap
How you write your query can have a bearing on things too. All in there's a lack of specifics on your question that prohibits a direct and targeted answer.
For your particular query you might get the info you need from the Display Execution Plan facility; filtering and grouping may show up separately there and you'll then be able to infer when they're done
Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.
I'm sure the answer is somehow logical but here goes.
I have three big tables joined on three columns, each column is part of the primary key.
I want to get a distinct select on column1.
It works if I get the whole result at once, i.e. i export it in to a file.
But if I paginate it like phpadmin would do LIMIT 1000, 0 I get some column1 values twice, e.g. val1 on page 1 and val1 on the last page. This also means I'm not getting some values back which I should have had.
If I add a ORDER BY column1 everything is ok again, but I loose speed on the last pages, or that is what I've been told.
I guess it has something to do with the way mysql is handling the pagination and returns the result without actually knowing the whole result, but it still bugs my.
Can anyone elaborate on that.
The reason for paginating the query is because I don't like to lock the tables for longer periods at a time.
Does anyone have a insight how to achieve this and at the same time get all the data?
It doesn't make sense to implement paging using LIMIT without an ORDER BY.
Yes, you're right that it's faster without the ORDER BY, because the server is free to return arbitrary results in any order and the results don't have to be consistent from one query to the next.
If you want correct and consistent results, you must have the ORDER BY. If you are concerned about performance consider adding on index for the column you are ordering by.
From the manual page LIMIT optimization:
If you use LIMIT row_count with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast.
The reason for paginating the query is because I don't like to lock the tables for longer periods at a time. Does anyone have a insight how to achieve this and at the same time get all the data?
If you're trying to perform some operation on every row then your approach won't work if data can be added or removed. This is because it will push all the following rows and some rows will be moved onto different pages. Adding a row will push some rows onto the next page, meaning that you see one row twice. Removing a row from an earlier page will cause you to skip a row.
Instead you could use one of these approaches:
Use some id to keep track of how far you have progressed. Select the next n rows with higher id.
Record which rows you have handled by storing a boolean in a column. Select any n rows that you haven't handled yet.