I need to query for the COUNT of rows that fulfill multiple filter criteria. However, I do not know which filters will be combined, so I cannot create appropriate indexes.
SELECT COUNT(id) FROM tbl WHERE filterA > 1000 AND filterD < 500
This is very slow since it has to do a full table scan. Is there any way to have a perfomant query in my situation?
id, filterA, filterB, filterC, filterD, filterE
1, 2394, 23240, 8543, 3241, 234, 23
The issue here is that there are certain limitations in how you can index data on multiple criteria. These are standard, fundamental issues and to the extent that ElasticSearch is able to get away from the problems it is just brute force parallelism and indexes on everything you may want to filter by.
Usually some filters will be more commonly used and more selective, so usually one would start by looking at actual examples of queries and build indexes around the queries which have performed slowly in the past.
This means you start with slow query logging and then focus on the most important queries first until you get everything where it is tolerable.
Related
I have seen several questions comparing select * to select by all columns explicitly, but what about fewer columns selected vs more.
In other words, is:
SELECT id,firstname,lastname,lastlogin,email,phone
More than negligibly faster than:
SELECT id,firstname,lastlogin
I realize there will be small differences for more data being transferred through the system and to the application, but this is a total data/load difference, not a cost of the query (larger data in the cells would have the same effect anyway I believe) - I'm only trying to optimize my query, as I will have to load ALL the data at some point anyway...
When my admin user logs in, I'm going to load the entire user database into a cache, but I can either query only critical data upfront to shave some execution time, or just get everything - if it works out roughly the same. I know more rows equals longer query execution - but what about more selected values in my query?
Under most circumstances, the only difference is going to be slightly larger data for these fields and the additional time to fetch them.
There are two things to consider:
If the additional fields are very big, then this could be a big difference in performance.
If there is an index that covers the columns you actually want, then the index can be used for the query. This could speed the query in the database.
In general, though, the advice is to return the columns you want to the application. If there is complex processing, you should consider doing that in the database rather than the application.
I have a table called users with a couple dozen columns such as height, weight, city, state, country, age, gender etc...
The only keys/indices on the table are for the columns id and email.
I have a search feature on my website that filters users based on these various columns. The query could contain anywhere from zero to a few dozen different where clauses (such as where `age` > 40).
The search is set to LIMIT 50 and ORDER BY `id`.
There are about 100k rows in the database right now.
If I perform a search with zero filters or loose filters, mysql basically just returns the first 50 rows and doesn't have to read very many more rows than that. It usually takes less than 1 second to complete this type of query.
If I create a search with a lot of complex filters (for instance, 5+ where clauses), MySQL ends up reading through the entire database of 100k rows, trying to accumulate 50 valid rows, and the resulting query takes about 30 seconds.
How can I more efficiently query to improve the response time?
I am open to using caching (I already use Redis for other caching purposes, but I don't know where to start with properly caching a MySQL table).
I am open to adding indices, although there are a lot of different combinations of where clauses that can be built. Also, several of the columns are JSON where I am searching for rows that contain certain elements. To my knowledge I don't think an index is a viable solution for that type of query.
I am using MySQL version 8.0.15.
In general you need to create indexes for the columns which are mentioned in the criteria of the WHERE clauses. And you can also create indexes for JSON columns, use generated column index: https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html.
Per the responses in the comments from ysth and Paul, the problem was just the server capacity. After upgrading the an 8GB RAM server, to query times dropped to under 1s.
My thinking is that if I put my ANDs that filter out a greater number of rows before those that filter out just a few, my query should run quicker since that selection set is much smaller between And statements.
But does the order of AND in the WHERE clause of an SQL Statement really effect the performance of the SQL that much or are the engines optimized already for this?
It really depends on the optimiser.
It shouldn't matter because it's the optimiser's job to figure out the optimal way to run your query regardless of how you describe it.
In practice, no optimiser is perfect so you might find that re-ordering the clauses does make a difference to particular queries. The only way to know for sure is to test it yourself with your own schema, data etc.
Most SQL engines are optimized to do this work for you. However, I have found situations in which trying to carve down the largest table first can make a big difference - it doesn't hurt !
A lot depends how the indices are set up. If an index exists which combines the two keys, the optimizer should be able to answer the query with a single index search. Otherwise if independent indices exist for both keys, the optimizer may get a list of the records satisfying each key and merge the lists. If an index exists for one condition but not the other, the optimizer should filter using the indexed list first. In any of those scenarios, it shouldn't matter what order the conditions are listed.
If none of the conditions apply, the order the conditions are specified may affect the order of evaluation, but since the database is going to have to fetch every single record to satisfy the query, the time spent fetching will likely dwarf the time spent evaluating the conditions.
I ran the same query in number of tables (containing different no of records):
SELECT * FROM `tblTest`
ORDER BY `tblTest`.`DateAccess` DESC;
Why the first queries behave erratically (take longer then second, third...)?
I calculated the average of the second, third and fourth query, exuding the first query.
So for example, in a table with 1,000,000 records, the first time to proccess takes 4.8410 s and second time - only 0.8940 s. Why is this happening?
p.s. I use phpMyAdmin tool.
DBMS are really smart applications and maintain multiple catalogues to optimize their execution. When a query is run it generates many entries in the database depending on the DBMS used these catalogues will be more optimized and can even go to automatically generate index to optimize really often used queries. They also all have what is call a query optimizer which analyzes the plan of the query execution in order to optimize the execution plan.
In your specific case, you should look at query and result caching, the following article should help you understand how mysql natively tries to optimize query processing.
http://dev.mysql.com/doc/refman/5.5/en/query-cache.html
http://www.cyberciti.biz/tips/enable-the-query-cache-in-mysql-to-improve-performance.html
Here is a comparison between oracle, mysql and postgres (not a new article but will give you a basic idea of how different dbms have different way of handling complex queries on large databases)
http://dcdbappl1.cern.ch:8080/dcdb/archive/ttraczyk/db_compare/db_compare.html#Query+optimization
Cheers,
So I have a table that has a little over 5 million rows. When I use SQL_CALC_FOUND_ROWS the query just hangs forever. When I take it out the query executes within a second withe LIMIT ,25. My question is for pagination reasons is there an alternative to getting the number of total rows?
SQL_CALC_FOUND_ROWS forces MySQL to scan for ALL matching rows, even if they'd never get fetched. Internally it amounts to the same query being executed without the LIMIT clause.
If the filtering you're doing via WHERE isn't too crazy, you could calculate and cache various types of filters to save the full-scan load imposed by calc_found_rows. Basically run a "select count(*) from ... where ...." for most possible where clauses.
Otherwise, you could go Google-style and just spit out some page numbers that occasionally have no relation whatsoever with reality (You know, you see "Goooooooooooogle", get to page 3, and suddenly run out of results).
Detailed talk about implementing Google-style pagination using MySQL
You should choose between COUNT(*) AND SQL_CALC_FOUND_ROWS depending on situation. If your query search criteria uses rows that are in index - use COUNT(*). In this case Mysql will "read" from indexes only without touching actual data in the table while SQL_CALC_FOUND_ROWS method will load rows from disk what can be expensive and time consuming on massive tables.
More information on this topic in this article #mysqlperformanceblog.