MySQL group by query complexity analysis - mysql

Which is the complexity of the "group by" statement in MySQL?
I am managing vaery big tables and I also would like to know if there is any method to calculate how much time a query is going to take.

This question is impossible to answer with knowledge of what the entire query looks like. Some group bys can be prohibitively expensive while others are very cheap, it all depends on how the indexes in the database are set up, if the value you group by can be cached etc.
For example, this is a very cheap group by:
CREATE TABLE t (a INT, KEY(a));
SELECT * FROM WHERE 1 GROUP BY a;
Since a is an index.
But for something like this, it's very expensive since it would require a table scan.
CREATE TABLE t (a INT);
SELECT * FROM WHERE 1 GROUP BY a;
Generally if a key is not available, the database will creates a temporary table in memory for group by clauses, go through all the values, insert each value into the temporary table with an index to the corresponding row in the result set, then it will select from the temporary table, pick the first row from each column and send that back as the result. Depending on if you use the "extra" rows per group by clause (ie. using MAX(), GROUP_CONCAT() or similar) it will need to fetch all rows again.
You can use EXPLAIN to figure out what strategy MySQL will use, the 'Extra' (in ascending order of cost to execute) 'Using index' if an index can be used, 'Using filesort' if reading all rows from disk will be necessary, and column will contain 'Using Temporary' if a temporary will be required

Related

MySQL: composite index fulltext+btree?

I want a query that does a fulltext search on one field and then a sort on a different field (imagine searching some text document and order by publication date). The table has about 17M rows and they are more or less uniformly distributed in dates. This is to be used in a webapp request/response cycle, so the query has to finish in at most 200ms.
Schematically:
SELECT * FROM table WHERE MATCH(text) AGAINST('query') ORDER BY date=my_date DESC LIMIT 10;
One possibility is having a fulltext index on the text field and a btree on the publication date:
ALTER TABLE table ADD FULLTEXT index_name(text);
CREATE INDEX index_name ON table (date);
This doesn't work very well in my case. What happens is that MySQL evaluates two execution paths. One is using the fulltext index to find the relevant rows, and once they are selected use a FILESORT to sort those rows. The second is using the BTREE index to sort the entire table and then look for matches using a FULL TABLE SCAN. They're both bad. In my case MySQL chooses the former. The problem is that the first step can select some 30k results which it then has to sort, which means the entire query might take of the order 10 seconds.
So I was thinking: do composite indexes of FULLTEXT+BTREE exist? If you know how a FULLTEXT index works, it first tokenizes the column you're indexing and then builds an index for the tokens. It seems reasonable to me to imagine a composite index such that the second index is a BTREE in dates for each token. Does this exist in MySQL and if so what's the syntax?
BONUS QUESTION: If it doesn't exist in MySQL, would PostgreSQL perform better in this situation?
Use IN BOOLEAN MODE.
The date index is not useful. There is no way to combine the two indexes.
Beware, if a user searches for something that shows up in 30K rows, the query will be slow. There is no straightforward away around it.
I suspect you have a TEXT column in the table? If so, there is hope. Instead of blindly doing SELECT *, let's first find the ids and get the LIMIT applied, then do the *.
SELECT a.*
FROM tbl AS a
JOIN ( SELECT date, id
FROM tbl
WHERE MATCH(...) AGAINST (...)
ORDER BY date DESC
LIMIT 10 ) AS x
USING(date, id)
ORDER BY date DESC;
Together with
PRIMARY KEY(date, id),
INDEX(id),
FULLTEXT(...)
This formulation and indexing should work like this:
Use FULLTEXT to find 30K rows, deliver the PK.
With the PK, sort 30K rows by date.
Pick the last 10, delivering date, id
Reach back into the table 10 times using the PK.
Sort again. (Yeah, this is necessary.)
More (Responding to a plethora of Comments):
The goal behind my reformulation is to avoid fetching all columns of 30K rows. Instead, it fetches only the PRIMARY KEY, then whittles that down to 10, then fetches * only 10 rows. Much less stuff shoveled around.
Concerning COUNT on an InnoDB table:
INDEX(col) makes it so that an index scan works for SELECT COUNT(*) or SELECT COUNT(col) without a WHERE.
Without INDEX(col),SELECT COUNT(*)will use the "smallest" index; butSELECT COUNT(col)` will need a table scan.
A table scan is usually slower than an index scan.
Be careful of timing -- It is significantly affected by whether the index and/or table is already cached in RAM.
Another thing about FULLTEXT is the + in front of words -- to say that each word must exist, else there is no match. This may cut down on the 30K.
The FULLTEXT index will deliver the date, id is random order, not PK order. Anyway, it is 'wrong' to assume any ordering, hence it is 'right' to add ORDER BY, then let the Optimizer toss it if it knows that it is redundant. And sometimes the Optimizer can take advantage of the ORDER BY (not in your case).
Removing just the ORDER BY, in many cases, makes a query run much faster. This is because it avoids fetching, say, 30K rows and sorting them. Instead it simply delivers "any" 10 rows.
(I have not experience with Postgres, so I cannot address that question.)

Fastest result when checking date range

User will select a date e.g. 06-MAR-2017 and I need to retrieve hundred thousand of records for date earlier than 06-MAR-2017 (but it could vary depends on user selection).
From above case, I am using this querySELECT col from table_a where DATE_FORMAT(mydate,'%Y%m%d') < '20170306' I feel that the record is kind of slow. Are there any faster or fastest way to get date results like this?
With 100,000 records to read, the DBMS may decide to read the table record for record (full table scan) and there wouldn't be much you could do.
If on the other hand the table contains billions of records, so 100,000 would just be a small part, then the DBMS may decide to use an index instead.
In any way you should at least give the DBMS the opportunity to select via an index. This means: create an index first (if such doesn't exist yet).
You can create an index on the date column alone:
create index idx on table_a (mydate);
or even provide a covering index that contains the other columns used in the query, too:
create index idx on table_a (mydate, col);
Then write your query such that the date column is accessed directly. You have no index on DATE_FORMAT(mydate,'%Y%m%d'), so above indexes don't help with your original query. You'd need a query that looks up the date itself:
select col from table_a where mydate < date '2017-03-06';
Whether the DBMS then uses the index or not is still up to the DBMS. It will try to use the fastest approach, which very well can still be the full table scan.
If you make a function call in any column at the left side of comparison, MySql will make a full table scan.
The fastest method would be to have an index created on mydate, and make the right side ('20170306') the same datatype of the column (and the index)

MySQL covering index optimization? [duplicate]

I've just heard the term covered index in some database discussion - what does it mean?
A covering index is an index that contains all of, and possibly more, the columns you need for your query.
For instance, this:
SELECT *
FROM tablename
WHERE criteria
will typically use indexes to speed up the resolution of which rows to retrieve using criteria, but then it will go to the full table to retrieve the rows.
However, if the index contained the columns column1, column2 and column3, then this sql:
SELECT column1, column2
FROM tablename
WHERE criteria
and, provided that particular index could be used to speed up the resolution of which rows to retrieve, the index already contains the values of the columns you're interested in, so it won't have to go to the table to retrieve the rows, but can produce the results directly from the index.
This can also be used if you see that a typical query uses 1-2 columns to resolve which rows, and then typically adds another 1-2 columns, it could be beneficial to append those extra columns (if they're the same all over) to the index, so that the query processor can get everything from the index itself.
Here's an article: Index Covering Boosts SQL Server Query Performance on the subject.
Covering index is just an ordinary index. It's called "covering" if it can satisfy query without necessity to analyze data.
example:
CREATE TABLE MyTable
(
ID INT IDENTITY PRIMARY KEY,
Foo INT
)
CREATE NONCLUSTERED INDEX index1 ON MyTable(ID, Foo)
SELECT ID, Foo FROM MyTable -- All requested data are covered by index
This is one of the fastest methods to retrieve data from SQL server.
Covering indexes are indexes which "cover" all columns needed from a specific table, removing the need to access the physical table at all for a given query/ operation.
Since the index contains the desired columns (or a superset of them), table access can be replaced with an index lookup or scan -- which is generally much faster.
Columns to cover:
parameterized or static conditions; columns restricted by a parameterized or constant condition.
join columns; columns dynamically used for joining
selected columns; to answer selected values.
While covering indexes can often provide good benefit for retrieval, they do add somewhat to insert/ update overhead; due to the need to write extra or larger index rows on every update.
Covering indexes for Joined Queries
Covering indexes are probably most valuable as a performance technique for joined queries. This is because joined queries are more costly & more likely then single-table retrievals to suffer high cost performance problems.
in a joined query, covering indexes should be considered per-table.
each 'covering index' removes a physical table access from the plan & replaces it with index-only access.
investigate the plan costs & experiment with which tables are most worthwhile to replace by a covering index.
by this means, the multiplicative cost of large join plans can be significantly reduced.
For example:
select oi.title, c.name, c.address
from porderitem poi
join porder po on po.id = poi.fk_order
join customer c on c.id = po.fk_customer
where po.orderdate > ? and po.status = 'SHIPPING';
create index porder_custitem on porder (orderdate, id, status, fk_customer);
See:
http://literatejava.com/sql/covering-indexes-query-optimization/
Lets say you have a simple table with the below columns, you have only indexed Id here:
Id (Int), Telephone_Number (Int), Name (VARCHAR), Address (VARCHAR)
Imagine you have to run the below query and check whether its using index, and whether performing efficiently without I/O calls or not. Remember, you have only created an index on Id.
SELECT Id FROM mytable WHERE Telephone_Number = '55442233';
When you check for performance on this query you will be dissappointed, since Telephone_Number is not indexed this needs to fetch rows from table using I/O calls. So, this is not a covering indexed since there is some column in query which is not indexed, which leads to frequent I/O calls.
To make it a covered index you need to create a composite index on (Id, Telephone_Number).
For more details, please refer to this blog:
https://www.percona.com/blog/2006/11/23/covering-index-and-prefix-indexes/

mysql query optimization - order by + indexes

I'm trying to learn something about Optimizing and indexes because I ran a insert select-query that required 4 min to complete. Now, I've added multiple indexes and it seems to have made my query run in 0.160sec. Now what I'm wondering is why the customer table is getting the using filesort message when i'm ordering by orderdate in my order table. Query and explain:
I've even tried an index in O(Orders) for (orderdate, orderid) and (orderdate, orderid, customerid). I thought one of them would help, but no dice. Can anyone help me understand why?
There is nothing wrong with having a query that uses "filesort"; all that means is that the results can't be sorted based on an index.
Now the reason why the sort can't be performed on an index is in this case because your ORDER BY contains columns from tables other than the first table in the join queue.
Since your query result doesn't contain very many rows, the temporary table being used is probably in memory.
What happens is as the query results are fetched from that query is that the results are put into a temporary table so they can later be sorted.
Adding the initial indexes sped up your query most likely because MySQL was doing a full table scan to fetch the results initially which was very time consuming. Once you added the proper indexes, finding the records is extremely quick. It probably had to do a filesort on a temporary table originally but this was likely no slower or faster than it is now.
If you try moving the join for the Orders table and put it before the join of the Products table, you may be able to eliminate the use of the temporary table and file sort.
Check out what does using filesort mean? and How MySQL Uses Internal Temporary Tables for more information.

Should I avoid COUNT all together in InnoDB?

Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.
SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.
Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.
I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.